voldemort.tools
Class Repartitioner

java.lang.Object
  extended by voldemort.tools.Repartitioner

public class Repartitioner
extends java.lang.Object

RepartitionUtils provides functions that balance the distribution of partitions across a cluster.


Field Summary
static int DEFAULT_GREEDY_MAX_PARTITIONS_PER_NODE
          Default (max) number of partition IDs per node to consider, if greedy swaps are enabled.
static int DEFAULT_GREEDY_MAX_PARTITIONS_PER_ZONE
          Default (max) number of partition IDs from all the other nodes in the cluster to consider, if greedy swaps are enabled.
static int DEFAULT_GREEDY_SWAP_ATTEMPTS
          Default number of greedy partition ID swaps to perform, if greedy swaps are enabled.
static java.util.List<java.lang.Integer> DEFAULT_GREEDY_SWAP_ZONE_IDS
          Default setting for which zone IDs to run greedy swap algorithm.
static int DEFAULT_MAX_CONTIGUOUS_PARTITIONS
          Default limit on length of contiguous partition ID run within a zone.
static int DEFAULT_RANDOM_SWAP_ATTEMPTS
          Default number of random partition ID swaps to attempt, if random swaps are enabled.
static int DEFAULT_RANDOM_SWAP_SUCCESSES
          Default number of successful random swaps (i.e., the random swap improves balance) after which reparitioning stops, if random swaps are enabled.
static java.util.List<java.lang.Integer> DEFAULT_RANDOM_SWAP_ZONE_IDS
          Default setting for which zone IDs to run random swap algorithm.
static int DEFAULT_REPARTITION_ATTEMPTS
          Recommended (default) number of times to attempt repartitioning.
 
Constructor Summary
Repartitioner()
           
 
Method Summary
static Cluster balanceContiguousPartitionsPerZone(Cluster nextCandidateCluster, int maxContiguousPartitionsPerZone)
          Ensures that no more than maxContiguousPartitionsPerZone partitions are contiguous within a single zone.
static Cluster balancePrimaryPartitions(Cluster nextCandidateCluster, boolean balanceZones)
          This method balances primary partitions among nodes within a zone, and optionally primary partitions among zones.
static java.util.HashMap<java.lang.Integer,java.util.List<java.lang.Integer>> getBalancedNumberOfPrimaryPartitionsPerNode(Cluster nextCandidateCluster, java.util.Map<java.lang.Integer,java.lang.Integer> targetPartitionsPerZone)
          Determines how many primary partitions each node within each zone should have.
static Pair<java.util.HashMap<Node,java.lang.Integer>,java.util.HashMap<Node,java.lang.Integer>> getDonorsAndStealersForBalance(Cluster nextCandidateCluster, java.util.Map<java.lang.Integer,java.util.List<java.lang.Integer>> numPartitionsPerNodePerZone)
          Assign target number of partitions per node to specific node IDs.
static Cluster greedyShufflePartitions(Cluster nextCandidateCluster, int greedyAttempts, int greedySwapMaxPartitionsPerNode, int greedySwapMaxPartitionsPerZone, java.util.List<java.lang.Integer> greedySwapZoneIds, java.util.List<StoreDefinition> storeDefs)
          Within a single zone, tries swapping some minimum number of random partitions per node with some minimum number of random partitions from other nodes within the zone.
static Cluster randomShufflePartitions(Cluster nextCandidateCluster, int randomSwapAttempts, int randomSwapSuccesses, java.util.List<java.lang.Integer> randomSwapZoneIds, java.util.List<StoreDefinition> storeDefs)
          Randomly shuffle partitions between nodes within every zone.
static Cluster repartition(Cluster currentCluster, java.util.List<StoreDefinition> currentStoreDefs, Cluster interimCluster, java.util.List<StoreDefinition> finalStoreDefs, java.lang.String outputDir, int attempts, boolean disableNodeBalancing, boolean disableZoneBalancing, boolean enableRandomSwaps, int randomSwapAttempts, int randomSwapSuccesses, java.util.List<java.lang.Integer> randomSwapZoneIds, boolean enableGreedySwaps, int greedySwapAttempts, int greedySwapMaxPartitionsPerNode, int greedySwapMaxPartitionsPerZone, java.util.List<java.lang.Integer> greedySwapZoneIds, int maxContiguousPartitionsPerZone)
          Runs a number of distinct algorithms over the specified clusters/store defs to better balance partition IDs over nodes such that all nodes have similar iops and capacity usage.
static Cluster repeatedlyBalanceContiguousPartitionsPerZone(Cluster nextCandidateCluster, int maxContiguousPartitionsPerZone)
          Loops over cluster and repeatedly tries to break up contiguous runs of partitions.
static Cluster swapGreedyRandomPartitions(Cluster nextCandidateCluster, java.util.List<java.lang.Integer> nodeIds, int greedySwapMaxPartitionsPerNode, int greedySwapMaxPartitionsPerZone, java.util.List<StoreDefinition> storeDefs)
          For each node in specified zones, tries swapping some minimum number of random partitions per node with some minimum number of random partitions from other specified nodes.
static Cluster swapPartitions(Cluster nextCandidateCluster, int nodeIdA, int partitionIdA, int nodeIdB, int partitionIdB)
          Swaps two specified partitions.
static Cluster swapRandomPartitionsAmongNodes(Cluster nextCandidateCluster, java.util.List<java.lang.Integer> nodeIds)
          Shuffles partitions among all nodes specified.
static Cluster swapRandomPartitionsWithinZone(Cluster nextCandidateCluster, int zoneId)
          Within a single zone, swaps one random partition on one random node with another random partition on different random node.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_REPARTITION_ATTEMPTS

public static final int DEFAULT_REPARTITION_ATTEMPTS
Recommended (default) number of times to attempt repartitioning.

See Also:
Constant Field Values

DEFAULT_RANDOM_SWAP_ATTEMPTS

public static final int DEFAULT_RANDOM_SWAP_ATTEMPTS
Default number of random partition ID swaps to attempt, if random swaps are enabled.

See Also:
Constant Field Values

DEFAULT_RANDOM_SWAP_SUCCESSES

public static final int DEFAULT_RANDOM_SWAP_SUCCESSES
Default number of successful random swaps (i.e., the random swap improves balance) after which reparitioning stops, if random swaps are enabled.

See Also:
Constant Field Values

DEFAULT_RANDOM_SWAP_ZONE_IDS

public static final java.util.List<java.lang.Integer> DEFAULT_RANDOM_SWAP_ZONE_IDS
Default setting for which zone IDs to run random swap algorithm. Empty implies all zones will be considered.


DEFAULT_GREEDY_SWAP_ATTEMPTS

public static final int DEFAULT_GREEDY_SWAP_ATTEMPTS
Default number of greedy partition ID swaps to perform, if greedy swaps are enabled. Each greedy partition ID swaps considers (some number of partitions per node) X (some number of partitions from rest of cluster) and selects the best such swap.

See Also:
Constant Field Values

DEFAULT_GREEDY_SWAP_ZONE_IDS

public static final java.util.List<java.lang.Integer> DEFAULT_GREEDY_SWAP_ZONE_IDS
Default setting for which zone IDs to run greedy swap algorithm. Empty implies all zones will be considered.


DEFAULT_GREEDY_MAX_PARTITIONS_PER_NODE

public static final int DEFAULT_GREEDY_MAX_PARTITIONS_PER_NODE
Default (max) number of partition IDs per node to consider, if greedy swaps are enabled.

See Also:
Constant Field Values

DEFAULT_GREEDY_MAX_PARTITIONS_PER_ZONE

public static final int DEFAULT_GREEDY_MAX_PARTITIONS_PER_ZONE
Default (max) number of partition IDs from all the other nodes in the cluster to consider, if greedy swaps are enabled.

See Also:
Constant Field Values

DEFAULT_MAX_CONTIGUOUS_PARTITIONS

public static final int DEFAULT_MAX_CONTIGUOUS_PARTITIONS
Default limit on length of contiguous partition ID run within a zone. 0 implies no limit on such runs.

See Also:
Constant Field Values
Constructor Detail

Repartitioner

public Repartitioner()
Method Detail

repartition

public static Cluster repartition(Cluster currentCluster,
                                  java.util.List<StoreDefinition> currentStoreDefs,
                                  Cluster interimCluster,
                                  java.util.List<StoreDefinition> finalStoreDefs,
                                  java.lang.String outputDir,
                                  int attempts,
                                  boolean disableNodeBalancing,
                                  boolean disableZoneBalancing,
                                  boolean enableRandomSwaps,
                                  int randomSwapAttempts,
                                  int randomSwapSuccesses,
                                  java.util.List<java.lang.Integer> randomSwapZoneIds,
                                  boolean enableGreedySwaps,
                                  int greedySwapAttempts,
                                  int greedySwapMaxPartitionsPerNode,
                                  int greedySwapMaxPartitionsPerZone,
                                  java.util.List<java.lang.Integer> greedySwapZoneIds,
                                  int maxContiguousPartitionsPerZone)
Runs a number of distinct algorithms over the specified clusters/store defs to better balance partition IDs over nodes such that all nodes have similar iops and capacity usage. The algorithms (in order): This method is used for three key use cases:

Parameters:
currentCluster - current cluster
currentStoreDefs - current store defs
interimCluster - interim cluster; needed for cluster or zone expansion, otherwise pass in same as currentCluster.
finalStoreDefs - final store defs; needed for zone expansion, otherwise pass in same as currentStores.
outputDir - Directory in which to dump cluster xml and analysis files.
attempts - Number of distinct repartitionings to attempt, the best of which is returned.
disableNodeBalancing - Disables the core algorithm that balances primaries among nodes within each zone.
disableZoneBalancing - For the core algorithm that balances primaries among nodes in each zone, disable balancing primaries among zones.
enableRandomSwaps - Enables random swap optimization.
randomSwapAttempts -
randomSwapSuccesses -
randomSwapZoneIds -
enableGreedySwaps - Enables greedy swap optimization.
greedySwapAttempts -
greedySwapMaxPartitionsPerNode -
greedySwapMaxPartitionsPerZone -
greedySwapZoneIds -
maxContiguousPartitionsPerZone -
Returns:
"final cluster" that has had all specified balancing algorithms run against it. The number of zones and number of nodes will match that of the specified "interim cluster".

getBalancedNumberOfPrimaryPartitionsPerNode

public static java.util.HashMap<java.lang.Integer,java.util.List<java.lang.Integer>> getBalancedNumberOfPrimaryPartitionsPerNode(Cluster nextCandidateCluster,
                                                                                                                                 java.util.Map<java.lang.Integer,java.lang.Integer> targetPartitionsPerZone)
Determines how many primary partitions each node within each zone should have. The list of integers returned per zone is the same length as the number of nodes in that zone.

Parameters:
nextCandidateCluster -
targetPartitionsPerZone -
Returns:
A map of zoneId to list of target number of partitions per node within zone.

getDonorsAndStealersForBalance

public static Pair<java.util.HashMap<Node,java.lang.Integer>,java.util.HashMap<Node,java.lang.Integer>> getDonorsAndStealersForBalance(Cluster nextCandidateCluster,
                                                                                                                                       java.util.Map<java.lang.Integer,java.util.List<java.lang.Integer>> numPartitionsPerNodePerZone)
Assign target number of partitions per node to specific node IDs. Then, separates Nodes into donorNodes and stealerNodes based on whether the node needs to donate or steal primary partitions.

Parameters:
nextCandidateCluster -
numPartitionsPerNodePerZone -
Returns:
a Pair. First element is donorNodes, second element is stealerNodes. Each element in the pair is a HashMap of Node to Integer where the integer value is the number of partitions to store.

balancePrimaryPartitions

public static Cluster balancePrimaryPartitions(Cluster nextCandidateCluster,
                                               boolean balanceZones)
This method balances primary partitions among nodes within a zone, and optionally primary partitions among zones. The balancing is done at the level of partitionIds. Such partition Id movement may, or may not, result in data movement during a rebalancing. See RebalancePlan for the object responsible for determining which partition-stores move where for a specific repartitioning.

Parameters:
nextCandidateCluster -
balanceZones - indicates whether or not number of primary partitions per zone should be balanced.
Returns:
updated cluster

repeatedlyBalanceContiguousPartitionsPerZone

public static Cluster repeatedlyBalanceContiguousPartitionsPerZone(Cluster nextCandidateCluster,
                                                                   int maxContiguousPartitionsPerZone)
Loops over cluster and repeatedly tries to break up contiguous runs of partitions. After each phase of breaking up contiguous partitions, random partitions are selected to move between zones to balance the number of partitions in each zone. The second phase may re-introduce contiguous partition runs in another zone. Therefore, this overall process is repeated multiple times.

Parameters:
nextCandidateCluster -
maxContiguousPartitionsPerZone - See RebalanceCLI.
Returns:
updated cluster

balanceContiguousPartitionsPerZone

public static Cluster balanceContiguousPartitionsPerZone(Cluster nextCandidateCluster,
                                                         int maxContiguousPartitionsPerZone)
Ensures that no more than maxContiguousPartitionsPerZone partitions are contiguous within a single zone. Moves the necessary partitions to break up contiguous runs from each zone to some other random zone/node. There is some chance that such random moves could result in contiguous partitions in other zones.

Parameters:
nextCandidateCluster - cluster metadata
maxContiguousPartitionsPerZone - See RebalanceCLI.
Returns:
Return updated cluster metadata.

swapPartitions

public static Cluster swapPartitions(Cluster nextCandidateCluster,
                                     int nodeIdA,
                                     int partitionIdA,
                                     int nodeIdB,
                                     int partitionIdB)
Swaps two specified partitions. Pair-wase partition swapping may be more prone to local minima than larger perturbations. Could consider "swapping" a list of . This would allow a few nodes to be identified (random # btw 2-5?) and then "swapped" (shuffled? rotated?).

Returns:
modified cluster metadata.

swapRandomPartitionsWithinZone

public static Cluster swapRandomPartitionsWithinZone(Cluster nextCandidateCluster,
                                                     int zoneId)
Within a single zone, swaps one random partition on one random node with another random partition on different random node.

Parameters:
nextCandidateCluster -
zoneId - Zone ID within which to shuffle partitions
Returns:
updated cluster

swapRandomPartitionsAmongNodes

public static Cluster swapRandomPartitionsAmongNodes(Cluster nextCandidateCluster,
                                                     java.util.List<java.lang.Integer> nodeIds)
Shuffles partitions among all nodes specified.

Parameters:
nextCandidateCluster -
nodeIds -
Returns:
shuffled cluster

randomShufflePartitions

public static Cluster randomShufflePartitions(Cluster nextCandidateCluster,
                                              int randomSwapAttempts,
                                              int randomSwapSuccesses,
                                              java.util.List<java.lang.Integer> randomSwapZoneIds,
                                              java.util.List<StoreDefinition> storeDefs)
Randomly shuffle partitions between nodes within every zone.

Parameters:
nextCandidateCluster - cluster object.
randomSwapAttempts - See RebalanceCLI.
randomSwapSuccesses - See RebalanceCLI.
randomSwapZoneIds - The set of zoneIds to consider. Each zone is done independently.
storeDefs - List of store definitions
Returns:
updated cluster

swapGreedyRandomPartitions

public static Cluster swapGreedyRandomPartitions(Cluster nextCandidateCluster,
                                                 java.util.List<java.lang.Integer> nodeIds,
                                                 int greedySwapMaxPartitionsPerNode,
                                                 int greedySwapMaxPartitionsPerZone,
                                                 java.util.List<StoreDefinition> storeDefs)
For each node in specified zones, tries swapping some minimum number of random partitions per node with some minimum number of random partitions from other specified nodes. Chooses the best swap in each iteration. Large values of the greedSwapMaxPartitions... arguments make this method equivalent to comparing every possible swap. This may get very expensive. So if a node had partitions P1, P2, P3 and P4 and the other partitions set was Q1, Q2, Q3, Q4, Q5 The combinations that will be tried for swapping will be the cartesian product of the two sets. That is, {P1, Q1}, {P2, Q2}...{P2,Q1}, {P2,Q2}, in total 20 such swap pairs will be generated. The best among these swap pairs will be chosen.

Parameters:
nextCandidateCluster -
nodeIds - Node IDs within which to shuffle partitions
greedySwapMaxPartitionsPerNode - See RebalanceCLI.
greedySwapMaxPartitionsPerZone - See RebalanceCLI.
storeDefs -
Returns:
updated cluster

greedyShufflePartitions

public static Cluster greedyShufflePartitions(Cluster nextCandidateCluster,
                                              int greedyAttempts,
                                              int greedySwapMaxPartitionsPerNode,
                                              int greedySwapMaxPartitionsPerZone,
                                              java.util.List<java.lang.Integer> greedySwapZoneIds,
                                              java.util.List<StoreDefinition> storeDefs)
Within a single zone, tries swapping some minimum number of random partitions per node with some minimum number of random partitions from other nodes within the zone. Chooses the best swap in each iteration. Large values of the greedSwapMaxPartitions... arguments make this method equivalent to comparing every possible swap. This is very expensive. Normal case should be : #zones X #nodes/zone X max partitions/node X max partitions/zone

Parameters:
nextCandidateCluster - cluster object.
greedyAttempts - See RebalanceCLI.
greedySwapMaxPartitionsPerNode - See RebalanceCLI.
greedySwapMaxPartitionsPerZone - See RebalanceCLI.
greedySwapZoneIds - The set of zoneIds to consider. Each zone is done independently.
storeDefs -
Returns:
updated cluster


Jay Kreps, Roshan Sumbaly, Alex Feinberg, Bhupesh Bansal, Lei Gao, Chinmay Soman, Vinoth Chandar, Zhongjie Wu