voldemort.cluster.failuredetector
Interface FailureDetector
- All Known Implementing Classes:
- AbstractFailureDetector, AsyncRecoveryFailureDetector, BannagePeriodFailureDetector, NoopFailureDetector, ThresholdFailureDetector
public interface FailureDetector
The FailureDetector API is used to determine a cluster's node availability.
Machines and servers can go down at any time and usage of this API can be
used by request routing in an attempt to avoid unavailable servers.
A FailureDetector is specific to a given cluster and as such there should
only be one instance per cluster per JVM.
Implementations can differ dramatically in how they approach the problem of
determining node availability. Some implementations may rely heavily on
invocations of recordException and recordSuccess to determine availability.
The result is that such a FailureDetector implementation performs little
logic other than bookkeeping, implicitly trusting users of the API. However,
other implementations may be more selective in using results of any external
users' calls to the recordException and recordSuccess methods.
Implementations may use these error/success calls as "hints" or may ignore
them outright.
To contrast the two approaches to implementing:
- Externally-based implementations use algorithms that rely heavily
on users for correctness. For example, let's say a user attempts to contact a
node which then fails. A responsible caller should invoke the recordException
API to inform the FailureDetector that an error has taken place for the node.
The FailureDetector itself hasn't really determined availability itself. So
if the caller is incorrect or buggy, the FailureDetector's accuracy is
compromised.
- Internally-based implementations rely on their own determination
of node availability. For example, a heartbeat style implementation may pay
only a modicum of attention when its recordException and/or recordSuccess
methods are invoked by outside callers.
Naturally there is a spectrum of implementations and external calls to
recordException and recordSuccess should (not must) provide some input to the
internal algorithm.
- See Also:
RoutedStore
Method Summary |
void |
addFailureDetectorListener(FailureDetectorListener failureDetectorListener)
Adds a FailureDetectorListener instance that can receive event callbacks
about node availability state changes. |
void |
destroy()
Cleans up any open resources in preparation for shutdown. |
int |
getAvailableNodeCount()
Returns the number of nodes that are considered to be available at the
time of calling. |
FailureDetectorConfig |
getConfig()
Retrieves the FailureDetectorConfig instance with which this
FailureDetector was constructed. |
long |
getLastChecked(Node node)
Returns the number of milliseconds since the node was last checked for
availability. |
int |
getNodeCount()
Returns the number of nodes that are in the set of all nodes at the time
of calling. |
boolean |
isAvailable(Node node)
Determines if the node is available or offline. |
void |
recordException(Node node,
long requestTime,
UnreachableStoreException e)
Allows external callers to provide input to the FailureDetector that an
error occurred when trying to access the node. |
void |
recordSuccess(Node node,
long requestTime)
Allows external callers to provide input to the FailureDetector that an
access to the node succeeded. |
void |
removeFailureDetectorListener(FailureDetectorListener failureDetectorListener)
Removes a FailureDetectorListener instance from the event listener list. |
void |
waitForAvailability(Node node)
waitForAvailability causes the calling thread to block until the given
Node is available. |
isAvailable
boolean isAvailable(Node node)
- Determines if the node is available or offline.
The isAvailable method is a simple boolean operation to determine if the
node in question is available. As expected, the result of this call is an
approximation given race conditions. However, the FailureDetector should
do its best to determine the then-current state of the cluster to produce
a minimum of false negatives and false positives.
Note: this determination is approximate and differs based upon the
algorithm used by the implementation.
- Parameters:
node
- Node to check
- Returns:
- True if available, false otherwise
getLastChecked
long getLastChecked(Node node)
- Returns the number of milliseconds since the node was last checked for
availability. Because of its lack of precision, this should really only
be used for status/reporting.
- Parameters:
node
- Node to check
- Returns:
- Number of milliseconds since the node was last checked for
availability
recordSuccess
void recordSuccess(Node node,
long requestTime)
- Allows external callers to provide input to the FailureDetector that an
access to the node succeeded. As with recordException, the implementation
is free to use or ignore this input. It can be considered a "hint" to the
FailureDetector rather than gospel truth.
Note for implementors: because of threading issues it's possible
for multiple threads to attempt access to a node and some fail and some
succeed. In a classic last-one-in-wins scenario, it's possible for the
failures to be recorded first and then the successes. It would be prudent
for implementations not to immediately assume that the node is then
available.
- Parameters:
node
- Node to checkrequestTime
- Length of time (in milliseconds) to perform request
recordException
void recordException(Node node,
long requestTime,
UnreachableStoreException e)
- Allows external callers to provide input to the FailureDetector that an
error occurred when trying to access the node. The implementation is free
to use or ignore this input. It can be considered a "hint" to the
FailureDetector rather than an absolute truth. For example, it is
possible to call recordException for a given node and have an immediate
call to isAvailable return true, depending on the implementation.
- Parameters:
node
- Node to checkrequestTime
- Length of time (in milliseconds) to perform requeste
- Exception that occurred when trying to access the node
addFailureDetectorListener
void addFailureDetectorListener(FailureDetectorListener failureDetectorListener)
- Adds a FailureDetectorListener instance that can receive event callbacks
about node availability state changes.
Notes:
- Make sure to clean up the listener by invoking
removeFailureDetectorListener
- Make sure that the FailureDetectorListener implementation properly
implements the hashCode/equals methods
Note for implementors: When adding a FailureDetectorListener that
has already been added, this should not add a second instance but should
effectively be a no-op.
- Parameters:
failureDetectorListener
- FailureDetectorListener that receives
events- See Also:
removeFailureDetectorListener(voldemort.cluster.failuredetector.FailureDetectorListener)
removeFailureDetectorListener
void removeFailureDetectorListener(FailureDetectorListener failureDetectorListener)
- Removes a FailureDetectorListener instance from the event listener list.
Note for implementors: When removing a FailureDetectorListener
that has already been removed or was never in the list, this should not
raise any errors but should effectively be a no-op.
- Parameters:
failureDetectorListener
- FailureDetectorListener that was receiving
events- See Also:
addFailureDetectorListener(voldemort.cluster.failuredetector.FailureDetectorListener)
getConfig
FailureDetectorConfig getConfig()
- Retrieves the FailureDetectorConfig instance with which this
FailureDetector was constructed.
- Returns:
- FailureDetectorConfig
getAvailableNodeCount
int getAvailableNodeCount()
- Returns the number of nodes that are considered to be available at the
time of calling. Letting
n
= the results of
getNodeCount()
, the return value is bounded in the range
[0..n]
.
- Returns:
- Number of available nodes
- See Also:
getNodeCount()
getNodeCount
int getNodeCount()
- Returns the number of nodes that are in the set of all nodes at the time
of calling.
- Returns:
- Number of nodes
- See Also:
getAvailableNodeCount()
waitForAvailability
void waitForAvailability(Node node)
throws java.lang.InterruptedException
- waitForAvailability causes the calling thread to block until the given
Node is available. If the node is already available, this will
simply return.
- Parameters:
node
- Node on which to wait
- Throws:
java.lang.InterruptedException
- Thrown if the thread is interrupted
destroy
void destroy()
- Cleans up any open resources in preparation for shutdown.
Note for implementors: After this method is called it is assumed
that attempts to call the other methods will either silently fail, throw
errors, or return stale information.
Jay Kreps, Roshan Sumbaly, Alex Feinberg, Bhupesh Bansal, Lei Gao, Chinmay Soman, Vinoth Chandar, Zhongjie Wu