voldemort.cluster.failuredetector
Class ThresholdFailureDetector
java.lang.Object
voldemort.cluster.failuredetector.AbstractFailureDetector
voldemort.cluster.failuredetector.AsyncRecoveryFailureDetector
voldemort.cluster.failuredetector.ThresholdFailureDetector
- All Implemented Interfaces:
- java.lang.Runnable, FailureDetector
public class ThresholdFailureDetector
- extends AsyncRecoveryFailureDetector
ThresholdFailureDetector builds upon the AsyncRecoveryFailureDetector and
provides a more lenient for marking nodes as unavailable. Fundamentally, for
each node, the ThresholdFailureDetector keeps track of a "success ratio"
which is a ratio of successful operations to total operations and requires
that ratio to meet or exceed a threshold. That is, every call to
recordException or recordSuccess increments the total count while only calls
to recordSuccess increments the success count. Calls to recordSuccess
increase the success ratio while calls to recordException by contrast
decrease the success ratio.
As the success ratio threshold continues to exceed the threshold, the node
will be considered as available. Once the success ratio dips below the
threshold, the node is marked as unavailable. As this class extends the
AsyncRecoveryFailureDetector, an unavailable node is only marked as available
once a background thread has been able to contact the node asynchronously.
There is also a minimum number of requests that must occur before the success
ratio is checked against the threshold. This is to prevent occurrences like 1
failure out of 1 attempt yielding a success ratio of 0%. There is also a
threshold interval which means that the success ratio for a given node is
only "valid" for a certain period of time, after which it is reset. This
prevents scenarios like 100,000,000 successful requests (and thus 100%
success threshold) overshadowing a subsequent stream of 10,000,000 failures
because this is only 10% of the total and above a given threshold.
Methods inherited from class voldemort.cluster.failuredetector.AbstractFailureDetector |
addFailureDetectorListener, checkArgs, checkNodeArg, getAvailableNodeCount, getAvailableNodes, getConfig, getLastChecked, getNodeCount, getNodeStatus, getUnavailableNodes, removeFailureDetectorListener, setAvailable, setUnavailable, waitForAvailability |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ThresholdFailureDetector
public ThresholdFailureDetector(FailureDetectorConfig failureDetectorConfig)
recordException
public void recordException(Node node,
long requestTime,
UnreachableStoreException e)
- Description copied from interface:
FailureDetector
- Allows external callers to provide input to the FailureDetector that an
error occurred when trying to access the node. The implementation is free
to use or ignore this input. It can be considered a "hint" to the
FailureDetector rather than an absolute truth. For example, it is
possible to call recordException for a given node and have an immediate
call to isAvailable return true, depending on the implementation.
- Specified by:
recordException
in interface FailureDetector
- Overrides:
recordException
in class AsyncRecoveryFailureDetector
- Parameters:
node
- Node to checkrequestTime
- Length of time (in milliseconds) to perform requeste
- Exception that occurred when trying to access the node
recordSuccess
public void recordSuccess(Node node,
long requestTime)
- Description copied from interface:
FailureDetector
- Allows external callers to provide input to the FailureDetector that an
access to the node succeeded. As with recordException, the implementation
is free to use or ignore this input. It can be considered a "hint" to the
FailureDetector rather than gospel truth.
Note for implementors: because of threading issues it's possible
for multiple threads to attempt access to a node and some fail and some
succeed. In a classic last-one-in-wins scenario, it's possible for the
failures to be recorded first and then the successes. It would be prudent
for implementations not to immediately assume that the node is then
available.
- Specified by:
recordSuccess
in interface FailureDetector
- Overrides:
recordSuccess
in class AsyncRecoveryFailureDetector
- Parameters:
node
- Node to checkrequestTime
- Length of time (in milliseconds) to perform request
getNodeThresholdStats
public java.lang.String getNodeThresholdStats()
nodeRecovered
protected void nodeRecovered(Node node)
- We delegate node recovery detection to the
AsyncRecoveryFailureDetector
class. When it determines that the
node has recovered, this callback is executed with the newly-recovered
node.
- Overrides:
nodeRecovered
in class AsyncRecoveryFailureDetector
update
protected void update(Node node,
boolean isSuccess,
UnreachableStoreException e)
getCatastrophicError
protected java.lang.String getCatastrophicError(UnreachableStoreException e)
Jay Kreps, Roshan Sumbaly, Alex Feinberg, Bhupesh Bansal, Lei Gao, Chinmay Soman, Vinoth Chandar, Zhongjie Wu