voldemort.cluster.failuredetector
Class ThresholdFailureDetector

java.lang.Object
  extended by voldemort.cluster.failuredetector.AbstractFailureDetector
      extended by voldemort.cluster.failuredetector.AsyncRecoveryFailureDetector
          extended by voldemort.cluster.failuredetector.ThresholdFailureDetector
All Implemented Interfaces:
java.lang.Runnable, FailureDetector

public class ThresholdFailureDetector
extends AsyncRecoveryFailureDetector

ThresholdFailureDetector builds upon the AsyncRecoveryFailureDetector and provides a more lenient for marking nodes as unavailable. Fundamentally, for each node, the ThresholdFailureDetector keeps track of a "success ratio" which is a ratio of successful operations to total operations and requires that ratio to meet or exceed a threshold. That is, every call to recordException or recordSuccess increments the total count while only calls to recordSuccess increments the success count. Calls to recordSuccess increase the success ratio while calls to recordException by contrast decrease the success ratio.

As the success ratio threshold continues to exceed the threshold, the node will be considered as available. Once the success ratio dips below the threshold, the node is marked as unavailable. As this class extends the AsyncRecoveryFailureDetector, an unavailable node is only marked as available once a background thread has been able to contact the node asynchronously.

There is also a minimum number of requests that must occur before the success ratio is checked against the threshold. This is to prevent occurrences like 1 failure out of 1 attempt yielding a success ratio of 0%. There is also a threshold interval which means that the success ratio for a given node is only "valid" for a certain period of time, after which it is reset. This prevents scenarios like 100,000,000 successful requests (and thus 100% success threshold) overshadowing a subsequent stream of 10,000,000 failures because this is only 10% of the total and above a given threshold.


Field Summary
 
Fields inherited from class voldemort.cluster.failuredetector.AbstractFailureDetector
failureDetectorConfig, idNodeStatusMap, listeners, logger
 
Constructor Summary
ThresholdFailureDetector(FailureDetectorConfig failureDetectorConfig)
           
 
Method Summary
protected  java.lang.String getCatastrophicError(UnreachableStoreException e)
           
 java.lang.String getNodeThresholdStats()
           
protected  void nodeRecovered(Node node)
          We delegate node recovery detection to the AsyncRecoveryFailureDetector class.
 void recordException(Node node, long requestTime, UnreachableStoreException e)
          Allows external callers to provide input to the FailureDetector that an error occurred when trying to access the node.
 void recordSuccess(Node node, long requestTime)
          Allows external callers to provide input to the FailureDetector that an access to the node succeeded.
protected  void update(Node node, boolean isSuccess, UnreachableStoreException e)
           
 
Methods inherited from class voldemort.cluster.failuredetector.AsyncRecoveryFailureDetector
destroy, isAvailable, run
 
Methods inherited from class voldemort.cluster.failuredetector.AbstractFailureDetector
addFailureDetectorListener, checkArgs, checkNodeArg, getAvailableNodeCount, getAvailableNodes, getConfig, getLastChecked, getNodeCount, getNodeStatus, getUnavailableNodes, removeFailureDetectorListener, setAvailable, setUnavailable, waitForAvailability
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ThresholdFailureDetector

public ThresholdFailureDetector(FailureDetectorConfig failureDetectorConfig)
Method Detail

recordException

public void recordException(Node node,
                            long requestTime,
                            UnreachableStoreException e)
Description copied from interface: FailureDetector
Allows external callers to provide input to the FailureDetector that an error occurred when trying to access the node. The implementation is free to use or ignore this input. It can be considered a "hint" to the FailureDetector rather than an absolute truth. For example, it is possible to call recordException for a given node and have an immediate call to isAvailable return true, depending on the implementation.

Specified by:
recordException in interface FailureDetector
Overrides:
recordException in class AsyncRecoveryFailureDetector
Parameters:
node - Node to check
requestTime - Length of time (in milliseconds) to perform request
e - Exception that occurred when trying to access the node

recordSuccess

public void recordSuccess(Node node,
                          long requestTime)
Description copied from interface: FailureDetector
Allows external callers to provide input to the FailureDetector that an access to the node succeeded. As with recordException, the implementation is free to use or ignore this input. It can be considered a "hint" to the FailureDetector rather than gospel truth.

Note for implementors: because of threading issues it's possible for multiple threads to attempt access to a node and some fail and some succeed. In a classic last-one-in-wins scenario, it's possible for the failures to be recorded first and then the successes. It would be prudent for implementations not to immediately assume that the node is then available.

Specified by:
recordSuccess in interface FailureDetector
Overrides:
recordSuccess in class AsyncRecoveryFailureDetector
Parameters:
node - Node to check
requestTime - Length of time (in milliseconds) to perform request

getNodeThresholdStats

public java.lang.String getNodeThresholdStats()

nodeRecovered

protected void nodeRecovered(Node node)
We delegate node recovery detection to the AsyncRecoveryFailureDetector class. When it determines that the node has recovered, this callback is executed with the newly-recovered node.

Overrides:
nodeRecovered in class AsyncRecoveryFailureDetector

update

protected void update(Node node,
                      boolean isSuccess,
                      UnreachableStoreException e)

getCatastrophicError

protected java.lang.String getCatastrophicError(UnreachableStoreException e)


Jay Kreps, Roshan Sumbaly, Alex Feinberg, Bhupesh Bansal, Lei Gao, Chinmay Soman, Vinoth Chandar, Zhongjie Wu