Outline

Overview
Terminology
Design and Implementation
Tools
Utilities
Run book

Overview

Rebalancing a Voldemort instance involves deciding upon a repartitioning that re-assigns some partitions to other servers in the cluster and then a rebalancing that migrates data to achieve the desired new layout. The supported use cases for rebalancing are as follows:

Adding new nodes to existing cluster (cluster expansion)
Load balancing of existing cluster (shuffle)
Expansion of cluster into new zone (zone expansion)

In the future, rebalancing should also support the following additional use cases:

Shrinking number of nodes (cluster contraction)
Shrinking number of zones (zone contraction)

The rebalancing process is intended to be able to rebalance a Voldemort instance, while providing the following guarantees:

No downtime.
Transparent to client in terms of functionality and performance.
Maintain data consistency guarantees while rebalancing, as well as in the case of aborting a rebalance.

Terminology

Controller: Also known as the "Rebalance Controller" and "RebalanceController". Executes rebalancing to move from existing configuration (cluster.xml and stores.xml) to specified configuration (final-cluster.xml and final-stores.xml).
Plan: Also known as a "Rebalance Plan" and "RebalancePlan". A list of Batches to complete the rebalance. Between each Batch of work, cluster metadata is updated.
Batch: Also known as a "Rebalance Batch" and "RebalanceBatchPlan". An unordered list of the Tasks necessary to complete this Batch of rebalancing. The size of a Batch is specified in the number of primary partitions completely rebalanced.
Task: Also known as "Rebalance Task" and "RebalanceTask". The list of all partition-stores that must be cloned from the donor-node to the stealer-node to complete this Batch.
Partition-store: The smallest unit of data cloning: a single partition's worth of data for a single store.
Stealer-node: Also known as "Stealer". The node that receives data in a Task. Note that the Stealer-node drives rebalancing (see class StealerBasedDonorTask).
Donor-node: Also known as "Donor". The node that gives up data in a Task.
Scheduler: Also known as the "Rebalance Scheduler" and "RebalanceScheduler". Used by the Controller to determine which Task to schedule next. The Scheduler ensures Tasks are scheduled up to some level of parallelism and also ensures that no one node participates in more than one Task at a time.
Zone N-Ary: Also known as "Zone N-Aries", "Zone Primary"/"Zone 0-ary", "Zone Secondary"/"Zone 1-ary", and "Zone Tertiary"/"Zone 2-ary". Every store has a replication factor. In each zone of a zoned cluster, or in the single zone of a non-zoned cluster, the replication factor for a store dictates how many Zone N-Aries there will be (for that store). Stores that have the same replication factor have the same number of Zone N-Aries. Zone N-Aries of the same degree (e.g., all 1-aries) are common across all stores that have such an N-Ary. Zone N-Aries are used throughout rebalancing execution do determine stealer-donor relationships.
Proxy Bridge: Established for the duration of a Batch. One Proxy Bridge is established for every stealer-donor pair in a Batch.
Proxy-Put: A server that is rebalancing does a Proxy-Put for any put issued to a partition-store that will be cloned to it in the current Batch, i.e. for any partition-store that the server acts as a Stealer in this Batch. A server locally does the put and then, in the background, asynchronously issues the put to the "old partition-store".
Proxy-Get: A server that is rebalancing does a Proxy-Get for any get (or getall) issued to a partition-store that will be cloned to it in the current Batch, i.e. for any partition-store that the server acts as a Stealer in this Batch. First, a server locally does the get. If the local get returns null, then it issues a Proxy-Get.
Clone: Rebalancing clones partition-stores from Donor-Nodes to Stealer-Nodes. After rebalancing completes, "old" data for historic partition-stores still exists on Donor-Nodes. Logically, you can think of data "migrating" during a rebalance, but practically, it is actually cloned.
Orphaned Keys: Also known as orphans, orphaned values, orphaned entries, and orphaned partition-stores. Once a Batch of rebalancing completes, some partition-stores, on donor-nodes, are no longer relevant. All of the keys/values/entries in these partition-stores have been orphaned, i.e. they consume disk space but are of no value.
Repair Job: An administrative task that scans a local disk and deletes all entries in orphaned partition-stores.

Design and Implementation

Rebalancing involves two sets of tooling: offline and online. The offline tooling produces repartitionings of clusters and plans to achieve such a repartitioning. The online tooling takes such plans and execute them.

Repartitioning

Please see documentation below about the tools voldemort.tools.RepartitionerCLI and voldemort.tools.PartitionAnalysisCLI. The former embodies all the algorithms we have developed for repartitioning. The latter embodies the algorithms used to determine the balance of a specific cluster metadata (cluster.xml and stores.xml). Please also see the bin/rebalance-*sh tools described below for specific repartitioning use cases.

The key classes of interest for these offline tools are as follows:

voldemort.tools.RepartitionerCLI : Command line tool to repartition a cluster.
voldemort.tools.Repartitioner : Collection of techniques to repartition a cluster.
voldemort.tools.PartitionAnalysisCLI : Command line tool to analyze balance of a cluster.
voldemort.tools.PartitionBalance : Analyzes a cluster to determine balance.

Planning

Planning involves determining which partition-stores have to move where to achieve a specified repartitioning (final-cluster.xml). When some node needs to steal some partition-stores, there are often many donor nodes from which it could steal. The design principles we chose for deciding which donor node to steal from are as follows:

Only steal if you do not already host the necessary partition-stores.
If possible, steal from a donor in the same zone. If not possible, steal from a donor in the same zone as the primary partition for the partition-store.
If you must steal, steal from the "same" zone n-ary, i.e. a stealer-node that is a zone 2-ary in the final-cluster.xml should steal from the donor-node that is the zone 2-ary in the "old" cluster.xml

These design principles ensure that determination of donor-node is deterministic and minimizes data movement across zones. These principles also match up with the logic for proxy-bridges, i.e. proxy-bridges align with stealer-donor pairs.

The key classes of interest for this offline tool are as follows:

voldemort.tools.RebalancePlanCLI : Command line tool to generate a plan.
voldemort.client.rebalance.RebalancePlan : Constructs rebalancing plan comprised of batches.
voldemort.client.rebalance.RebalanceBatchPlan : Constructs rebalancing batch comprised of tasks.
voldemort.client.rebalance.RebalanceTaskInfo : Structure that tracks which partition-stores must be cloned from donor-node to stealer-node.

Historically, a plan was broken into batches whose size is limited to some number of primary partition moves. The rebalance tooling still supports batches, but LinkedIn tends to do single batch rebalances. The rationale for single-batch plans is that each batch introduces some additional overhead (extraneous data movement) because of the nature of the partition ring in Voldemort.

Execution

Rebalancing is an administrative operation that affects every server in the cluster. Because all servers are involved in rebalancing, clients interacting with the cluster are also affected. We only describe high-level design issues in this documentation; please see the code for details.

Client impact

Voldemort clients currently bootstrap from the bootstrap URL at client start time and use the returned cluster/stores metadata for all subsequent operations. Rebalancing results in a cluster metadata change and so we need a mechanism to tell clients that they should re-bootstrap if they have old metadata.

For client-side routing re-bootstrapping, there are two possibilities. First, clients may have ZenStoreClient enabled which checks the metadata version on every operation and so will trigger a re-bootstrap as soon as the new cluster metadata is deployed for a rebalance. Second, for historic clients, the client will use the old metadata during rebalancing. A server throws an InvalidMetadataException if it receives a request for a partition which does not belong to it. On receiving this exception, the client re-bootstraps from the bootstrap URL and will pick up the correct cluster metadata. This latter approach only leads to clients re-bootstrapping once they issue a request to a partition that has moved. The proxy pause provides a time window during which clients re-bootstrap before any server-side rebalancing work is performed.

Classes of interest

The key classes of interest for executing a rebalance are as follows:

Client-side control of rebalancing.
- voldemort.tools.RebalanceControllerCLI : Command line tool to initiate rebalancing.
- voldemort.client.rebalance.RebalanceController : Constructs plan and then schedules batches to complete the plan.
- voldemort.client.rebalance.RebalanceScheduler : Schedules rebalancing tasks from current rebalancing batch.
- voldemort.client.rebalance.task.RebalanceTask : Clones partition-stores from donor-node to stealer-node.
- voldemort.client.rebalance.task.StealerBasedRebalanceTask : Manages execution of specific rebalance task on stealer-node.
Rebalance plan and data structures
- voldemort.client.rebalance.RebalancePlan : Constructs rebalancing plan comprised of batches.
- voldemort.client.rebalance.RebalanceBatchPlan : Constructs rebalancing batch comprised of tasks.
- voldemort.client.rebalance.RebalanceTaskInfo : Structure that tracks which partition-stores must be cloned from donor-node to stealer-node.
Server-side control of rebalancing
- voldemort.server.rebalance.Rebalancer : Initiates server-side rebalancing work and handles server-level abort/rollback.
- voldemort.server.rebalance.async.RebalanceAsyncOperation : Parent class for server-side rebalancing operations.
- voldemort.server.rebalance.async.StealerBasedRebalanceAsyncOperation : Stealer-based server-side rebalancing operation.
- voldemort.store.metadata.MetadataStore : Rebalancing state is kept in this store.

Client-side

At a very high level, the controller does the following:

Fetch current cluster XML and stores XML from the cluster.
Compute the Plan based on the specified final cluster XML.
Verify state of cluster is ready for rebalancing.
Set the cluster state to rebalancing
Execute the plan, batch by batch.
Upon completion of all batches, set the cluster state to normal.

Again, at a high level, to execute a batch of a plan, the controller does the following:

Schedule rebalance tasks on stealer nodes until specified parallelism is achieved (or until no more such tasks can be scheduled).
Upon completion of a rebalance task, schedule more rebalance tasks.
Upon completion of all rebalance tasks, return to controller so the next batch, if any, can be started.
Upon failure of any rebalance task, abort rebalance and roll back cluster metadata.

Server-side

At a very high-level, servers do the following:

Upon receiving new cluster xml and state change to rebalance, start proxying gets and puts as needed.
Upon receiving rebalance task, act as stealer-node and fetch necessary partition-stores from donor-node.
Upon receiving fetch request, act as donor-node and stream requested partition-stores store-by-store (with some parallelism permitted among stores).
Upon receiving abort, roll back metadata locally, set state to normal, and stop proxying.
Upon receiving state change to normal, set state locally and stop proxying.

Read-write vs read-only stores

The execution of rebalance is fundamentally different for read-write stores than for read-only stores. This is necessary because the on-disk formats are different and, in particular, the on-disk format for read-only stores does not permit updates in place. The rebalance implementation first rebalances read-only stores and then read-write stores. Note though that LinkedIn does not have any clusters that host both read-write and read-only stores so we are not positive the code path that rebalances both read-only and read-write stores is truly production quality.

For read-write stores, new cluster metadata is deployed and then data is cloned to the necessary nodes in the "new" cluster xml. During the execution of rebalance, proxying ensures that all data ("old" and "new" cluster xml) is kept in sync.

For read-only stores, data is cloned to the "new" cluster xml and then cluster metadata is deployed. The rationale for this is that read-only stores have one file per partition on disk and can only operate on whole files. It is therefore necessary to completely clone a partition before allowing access to it. During the execution of rebalance, new data cannot be pushed into a read-only store and so proxying is unnecessary.

Proxying

During rebalance, servers establish Proxy Bridges with other servers. A Proxy-Get is performed if a server now responsible for data has not yet cloned the data locally. Once the server has cloned the data locally, it does not issue additional proxy-get operations. A Proxy-Put is performed for all puts until rebalance completes. This ensures that the rebalance can be aborted.

Proxy relationships are kept within a zone. A stealer-node choses a donor-node with which to proxy from within the zone based on zone n-aries in the "old" and "new" cluster.xml. The donor-node is the node, in the same zone as the stealer-node, that hosts the same Zone N-Ary in the "old" cluster.xml as the Stealer-Node hosts in the "new" final-cluster.xml. A Proxy-Put is issued only if the "old" partition-store will be orphaned when the Batch completes; if it will be retained, then it is directly receiving put operations during rebalance anyhow. For the zone-expansion use case, nodes in the new zone do not do Proxy-Put operations; Proxy-Get operations may be performed and will incur cross-zone latencies.

Before any rebalance batches begin, there is a proxy pause during which the "new" final-cluster.xml (and, if necessary, final-stores.xml) is deployed to the cluster. The rebalancing state is also set on the servers. During this period, proxy bridges are established, but no rebalance tasks are actually scheduled. This period allows for a baseline performance cost of proxying to be established before servers begin additional work on rebalancing tasks.

To perform proxy-put and proxy-get operations, a server effectively acts as a client to other servers. This means that it creates a store client and failure detector, that it establishes connections with other servers, and so on. The configuration of the store client and of the failure detector is controlled by the VoldemortConfig.

At this time, delete operations are not proxied. Delete operations do not interact well with eventual consistency in the current implementation and so were excluded from being proxied.

Aborting rebalances

Proxy-put operations ensure that a rebalance can safely be aborted. At any time during a rebalance batch, "old" and "new" partition-stores are up-to-date and so rebalance can be aborted without data loss.

If rebalance tasks fail on the server-side, the controller safely aborts the rebalance and rolls back the metadata (cluster.xml and, if necessary, stores.xml) to the beginning of the batch. If the rebalance controller hangs, or is otherwise interrupted, then the rebalance may have to be aborted by hand. See notes below on the run book for aborting a rebalance.

Historical context

For a while, there were two rebalancing implementations: stealer-based and donor-based. Stealer-based was the original type of rebalancing. In stealer-based, the node receiving data drives the process on the server side. Donor-based was added to mitigate the issue that stealer-based rebalancing could result in multiple full scans of all stores on a node. In donor-based, each server did a single full scan and sent pertinent partition-stores to the appropriate stealers. Since the addition of storage engines that support partition scans (see StorageEngine.isPartitionScanSupported()), stealer-based rebalancing is as, or more, efficient than donor-based rebalancing. Voldemort open source release 1.3 was the last release that included donor-based rebalancing. If donor-based ever needs to be resurrected, then it needs to be changed to use zone n-ary logic (as opposed to the historic "ReplicaType" and "RebalancePartitionsInfo" types).

Rebalancing tools

This is a summary of each of the tools used in rebalancing.

voldemort.tools.PartitionAnalysisCLI

Analyzes the partition layout of a cluster to determine its balance:

voldemort.tools.PartitionAnalysisCLI --cluster <current-cluster> --stores <current-stores>

The output of this tool is fairly verbose:

Nodes in each zone are listed.
Number of primary partitions per node are listed.
Partitions in each zone are listed.
Histograms of "contiguous partition run lengths in each zone" are listed. This is useful for understanding the balance of zoned clusters. In particular, if many primary partition IDs in a row are stored in one zone, these histograms will show how long such runs are.
Partition dump for each store. For each node, the partitions hosted as zone primary ("0"), zone secondary ("1"), zone tertiary ("2") are listed. These partitions are referred to collectively as the zone n-aries.
The "aggregate primary-partition count" is listed. This value is aggregated across all stores for each node. The value is the number of primary partitions on a node multiplied by the number of stores on the cluster.
The "aggregate zone-primary partition count" is listed. This value is aggregated across all stores for each node. The value is the number of zone primary partitions on the node multiplied by the number of stores on the cluster. This is a measure of how well balanced get operations will be in failure-free periods since get operations are in a zone are served by the zone primary. Note that every zone has a "primary", thus the use of the term zone primary.
The "aggregate nary-partition count" is listed. This value is aggregated across all stores for each node. The value is the number of partition-stores hosted on the node. A partition-store is a specific zone n-ary partition for a specific store. This is a measure of how well balanced put operations will be in failure-free periods since each zone n-ary ought to (eventually) receive a replica. Note, this is therefore also a measure of how well balanced disk utilization should be across all nodes.
The utility value is also listed. The lower the utility value, the better. Given current constants, the "optimal" utility value is three times the number of zones in the cluster. So, a zoned cluster with two zones has an optimal utility value of 6 whereas a non-zoned cluster has an optimal utility value of 3.

voldemort.tools.RepartitionerCLI

Repartitions a cluster to achieve better balance. For the shuffle use case, a cluster is repartitioned in-place. For the cluster expansion use case, primary partitions are moved onto new nodes and, optionally, additional partitions are swapped around to achieve better balance. For the zone expansion use case, primary partitions are moved into the new zone from existing zone(s) and, optionally, additional partitions are swapped around to achieve better balance.

The cluster output by this tool can be used as input to voldemort.tools.RebalanceControllerCLI to actually rebalance the cluster. Please see the notes below on the voldemort.tools.RebalanceControllerCLI tool.

There are two ways in which partitions are "swapped around": random and greedy. Random swapping identifies random pairs of partitions to swap; if the swap improves the utility value then the swap is kept, otherwise it is not. Greedy swapping attempts many distinct random swaps and selects the one best such swap, i.e. the one swap that most improved the utility value. These techniques are variants of "hill climbing" algorithms or "simulated annealing".

Examples of usages of the tool are as follows:

Random shuffle:

voldemort.tools.RepartitionerCLI --current-cluster <current-cluster> --current-stores <current_stores> --output-dir <output_dir> --enable-random-swaps

Greedy shuffle:

voldemort.tools.RepartitionerCLI --current-cluster <current-cluster> --current-stores <current_stores> --output-dir <output_dir> --enable-greedy-swaps

Cluster expansion:

voldemort.tools.RepartitionerCLI --current-cluster <current-cluster> --current-stores <current_stores>  --interim-cluster <interim-cluster> --output-dir <output_dir>

Zone expansion:

voldemort.tools.RepartitionerCLI --current-cluster <current-cluster> --current-stores <current_stores>  --interim-cluster <interim-cluster> --final-stores  <final_stores>  --output-dir <output_dir>

For cluster expansion, an interim-cluster must be provided which includes the new nodes without any partition IDs assigned to them. For zone expansion, both an interim-cluster and final-stores must be provided. The latter is necessary because zone replication factors must be updated in the stores.xml file.

voldemort.tools.RebalancePlanCLI

This tool generates the plan for moving from the current cluster.xml to some specified final.xml. A plan is an unordered list of partition-stores that must move from a donor node to a stealer node.

There are two common usages for this tool:

voldemort.tools.RebalancePlanCLI --current-cluster  <current-cluster> --current-stores <current_stores> --final-cluster <final-cluster> --output-dir <output_dir>

and

voldemort.tools.RebalancePlanCLI --current-cluster  <current-cluster> --current-stores <current_stores> --final-cluster <final-cluster> --final-stores <final_stores> --output-dir <output_dir>

The difference lies in the inclusion of argument "--final-stores <final_stores>". or shuffle and cluster expansion use cases, the final stores argument is not needed.

For zone expansion, the final stores argument is needed since storage definitions change when the number of zones changes. The following snippets of storage definitions illustrate how they need to change for zone expansion.

Storage definition for two zones:

...
<replication-factor>4</replication-factor>
<zone-replication-factor>
  <replication-factor zone-id="0">2</replication-factor>
  <replication-factor zone-id="1">2</replication-factor>
</zone-replication-factor>
...

Storage definition change for three zones (zone expansion use case):

...
<replication-factor>6</replication-factor>
<zone-replication-factor>
  <replication-factor zone-id="0">2</replication-factor>
  <replication-factor zone-id="1">2</replication-factor>
  <replication-factor zone-id="2">2</replication-factor>
</zone-replication-factor>
...

There is an optional batch-size parameter for this tool. Current best practice is to use the default value of "infinite" (Long.MAX_VALUE). Breaking plans into batches tends to result in more work and slower rebalances.

There are three metrics listed in the plan that are of interest:

"invalid metadata rates per zone" : The likelihood that a random client put/get will yield an invalid metadata exception. This is important for default store clients (as opposed to Zen store clients).
"Total number of rebalance tasks" : The number of primary partition IDs moved to different nodes. Historically, this number was used to understand the size of plans. The following metric is more informative.
"Total number of partition-store moves" : The number of partition-stores moved between nodes. This is indicative of the size of the rebalance. Unfortunately, this measure is a function of total number of partition IDs, number of stores, and replication factor of each store. This makes the measure useful for comparing different plans for a specific cluster at a specific time. This measure is not useful for comparing plans for different clusters, or for one cluster that has had many stores added/removed from it.
"Max per-node storage overhead" : The maximum fractional storage overhead of the plan, e.g. 1.4X means that one server will have 40% more data on disk after rebalancing until the repair job is run. This estimate is conservative.

voldemort.tools.RebalanceControllerCLI

This tool actually executes the rebalance. Example usage:

bin/run-class.sh voldemort.tools.RebalanceControllerCLI --url $URL  --final-cluster final-cluster.xml --parallelism 8 --proxy-pause 900

The URL is the bootstrap URL for the cluster. The final-cluster.xml is the final, repartitioned cluster. A successful rebalance will move the cluster to this specified state. The parallelism limits the number of rebalance tasks that can execute at one time. A single rebalance task moves data from one donor node to one stealer node. The rebalance controller ensures that no one node participates in more than one rebalance task at a time, i.e. during rebalance, every node is executing at most one rebalance task at a time. The proxy pause parameter is the number of seconds to wait between rolling cluster metadata forward to the specified final-cluster.xml and beginning to clone data. This period allows performance to be monitored while put and get operations are "proxied" without the servers doing any additional work cloning data.

Killing this tool during rebalance will not abort the rebalance cleanly. If this tool crashes, then the rebalance may have to be manually aborted. (See below.)

Rebalancing utilities

This section describes additional programs and scripts that assist with planning a rebalance and confirming a rebalance was successful.

Rebalance scripts

For each rebalancing use case (shuffle, cluster expansion, and zone expansion), there is a script that embodies a suggested best practice for repartitioning. These scripts can be chained together (where it makes sense). For example, the rebalance-cluster-expansion script could be used followed by the rebalance-shuffle script to improve the balance of the expanded cluster. These scripts help with planning a rebalance because in addition to repartitioning to improve balance, the rebalance plan is output which shows how "big" the execution of rebalancing will be. Note that when chained, the plans output by these scripts only describe the cost of each specific change so you should generate a single overall plan to determine overall cost.

bin/generate_cluster_xml.py

At a high level there is a python script (bin/generate_cluster_xml.py) that can generate cluster.xml for both zoned and non zoned clusters. This section describes that script and its usage.

For non-zoned clusters, try the following usage:

python generate_cluster_xml.py --file <file with host names, one host per line>  --name <name of the cluster> --nodes <number of nodes> --partitions <number of partitions> --sock-port <port no> --admin-port <port no> --http-port <port no> --current-stores <current_stores.xml> --output-dir <output directory>

The newly generated cluster.xml file is placed in the output directory.

For zoned clusters, try the following usage which includes passing "--zones <num of zones>":

python generate_cluster_xml.py --file <file with host names, one host per line>  --name <name of the cluster> --nodes <number of nodes> --partitions <number of partitions> --sock-port <port no> --admin-port <port no> --http-port <port no> --current-stores <current_stores.xml> --output-dir <output directory> --zones <number of zones>

For the zoned cluster use case, the python script internally calls (bin/rebalance-new-cluster.sh) as a last step which is explained below.

bin/rebalance-new-cluster.sh

Example usage:

bin/rebalance-new-cluster.sh -c current_cluster -s current_stores -o output_dir

Note that it is unnecessary to use this utility on its own since bin/generate_cluster_xml.py invokes this tool for zoned clusters.

rebalance-new-cluster.sh works in three steps:

Step 1: RepartitionerCLI is executed to limit the maximum contiguous partition. Four such runs are attempted and the best cluster.xml is passed to next step.
Step 2: RepartitionerCLI is invoked with random swap attempts to better balance the cluster.
Step 3: Same as step 1 to ensure no contiguous runs were introduced during repartitioning.

The rationale behind step 1 is to break long running contiguous partitions to help the repartitioner achieve a better balanced cluster in step 2. Smaller runs of contiguous partitions help ensure that during random swapping in step 2, a swap attempt would generally lead to an improvement in the balance. An additional run in step 3 (same as step 1) makes sure that long runs of contiguous partitions do not exist in the final cluster.xml, as some new contiguous runs might have been created during step 2.

bin/rebalance-shuffle.sh

There 2 variations on how this script can be run:

bin/rebalance-shuffle.sh -c current_cluster -s current_stores -o output_dir

and

bin/rebalance-shuffle.sh -c current_cluster -s current_stores -o output_dir -m max_contiguous_partitions

The former is the common usage, whereas the latter is only used for zoned clusters that are known to have long runs of contiguous partitions in some zones. Note that some historic tooling could have resulted in such contiguous runs.

The script shuffles the partitions around based on the current_cluster xml to achieve a more balanced cluster. The xml file corresponding to the new cluster is then written to output_dir. The script also generates a plan file "plan.out" and places it in the output_dir. Note that this script can be used again on its own output. That is, if your first attempt only gets you half-way to where you want to go in terms of repartitioning, then you can take the output final-cluster xml and use it as input to this tool again.

When invoked with the "-m" option the script first splits runs of contiguous partition that are greater than the maximum number of contiguous partitions. Generally speaking, a more balanced cluster is achievable when the runs of contiguous partitions are smaller in value (2 or 3) than when its higher (7 or 8). However, using "-m" option can create a more expensive plan so this option must be used carefully.

This script shuffles a current cluster.xml (to achieve a more balanced cluster) and then outputs a final-cluster.xml and a plan corresponding to it. The utility function sometimes hits local maxima/minima that it cannot get around and hence we use a relatively higher value of overall attempts (=10), and lower number of random swap attempts (=250).

bin/rebalance-cluster-expansion.sh

The script is used when new nodes are added to an existing zone (or, existing zones) in an existing cluster. The interim_cluster.xml file must include the new nodes in the cluster with each new nodes having an empty partition list. This script will fill out that partition list by 'stealing' partitions from other nodes. Script usage is as follows:

bin/rebalance-cluster-expansion.sh -c current_cluster -s current_stores -i interim_cluster -o output dir

The final-cluster.xml and the plan is saved inside the output_dir.

This script steals partitions from other nodes and assigns them to the new nodes in the expanded cluster. Note that while stealing partitions, the script doesn't make an effort to balance the cluster. This script has a high value of overall attempts to make sure that we aren't caught into a local minimum. However, because the repartitioner is invoked multiple distinct times, the most balanced such expansion is chosen. Chaining bin/rebalance-shuffle.sh after this script to further balance the cluster is the expected usage.

bin/rebalance-zone-expansion.sh

This script handles the use case of adding a new zone to an existing zoned cluster. The script can be executed as shown below where interim_cluster is the cluster.xml with new nodes and zone information but with an empty partition list. The final-cluster.xml (with partitions list populated) and the plan is written to output_dir. Note that the final_stores.xml must be passed in since storage definitions for zoned clusters change when a zone is added. Script usage is as follows:

bin/rebalance-zone-expansion.sh -c current_cluster -s current_stores -i interim_cluster -f final_stores -o output_dir

This script works as follows:

RepartitionerCLI is invoked in default mode. The tool makes sure that all zones have the same number of primary partitions assigned to them, and then within each zone, makes sure that each node has the same number of primary partitions assigned to them. Of course, by "same number" we mean plus or minus one. At the end of this step, all of the nodes in the new zone are populated with primary partition IDs that have been "stolen" from the existing zones.
RepartitionerCLI is invoked again, but this time with --enable-random-swaps option. Random swaps are only enabled in the new zone. Randomly swapping only within the new zone is done to attempt to minimize the overall plan size necessary to realize the repartitioning.
The final step simply produces an overall plan based on the repartitioning done in the above two steps.

Rebalancing utilities

The following utilities can be used as part of the rebalancing run book.

bin/voldemort-admin-tool.sh

The Voldemort admin tool offers many features that are useful during rebalancing.

Repair job

The repair job deletes keys (partition-stores) that have been orphaned by a successful rebalance:

bin/voldemort-admin-tool.sh --repair-job --node '-1' --url $URL

Passing '-1' for node means all nodes in cluster will run the repair job. The URL is the bootstrap URL of the cluster. An orphaned key is one that is stored on a server but belongs to a partition for which the server is not (no longer) responsible. The repair job is not well-named.

Metadata management

The Voldemort admin tool can be used to fetch current metadata:

bin/voldemort-admin-tool.sh --get-metadata cluster.xml --node 0 --url $URL

bin/voldemort-admin-tool.sh --get-metadata stores.xml --node 0 --url $URL

The former fetches current cluster.xml and the latter fetches current stores.xml. These metadata files should be used as input to repartitioning and planning.

The Voldemort admin tool can also be used to set metadata:

bin/voldemort-admin-tool.sh --url $URL --set-metadata cluster.xml --set-metadata-value $OUTDIR/interim-cluster.xml

bin/voldemort-admin-tool.sh --url $URL --set-metadata stores.xml --set-metadata-value $OUTDIR/final-stores.xml

The former sets cluster metadata, the latter sets store metadata. This feature can be used to prepare for a cluster expansion by setting the cluster to include new nodes with no partitions. This feature can be used to prepare for a zone expansion by setting the cluster to include new nodes with no partitions and setting stores to comply with more zones. This feature can also be used to manually roll back metadata if a rebalance goes awry.

Rebalancing state

The Voldemort admin tool can be used to clear rebalancing state:

bin/voldemort-admin-tool.sh --clear-rebalancing-metadata --url $URL

This feature can be used to manually clear rebalancing state if a rebalance goes awry.

Asynchronous task management

The Voldemort admin tool can be used to enumerate asynchronous tasks executing on a cluster, and then forcibly kill such asynchronous tasks:

bin/voldemort-admin-tool.sh --async get --url $URL

and, then

bin/voldemort-admin-tool.sh --async stop --url $URL --async-id <id,id,...> --node <node>

This feature is necessary since rebalance may fail and leave asynchronous tasks running on the servers. In fact, rebalance tends to fail if there are other asynchronous tasks (e.g., from the retention job) running on the cluster.

voldemort.utils.KeySamplerCLI

This tool selects some number of keys for each store for each partition. Recommended usage is as follows:

bin/run-class.sh voldemort.utils.KeySamplerCLI --url $BOOTSTRAP_URL --out-dir key-samples --records-per-partition 5

The BOOTSTRAP_URL is the bootstrap URL of the cluster. For each store, a file of keys will be dumped into output directory key-samples. 5 keys will be selected from each partition for each store.

voldemort.utils.KeyVersionFetcherCLI

This tool consumes the output of the voldemort.utils.KeySamplerCLI to fetch the version of each specified key from all servers that ought to host the key. Recommended usage is as follows:

bin/run-class.sh voldemort.utils.KeyVersionFetcherCLI --url $URL --in-dir key-samples --out-dir key-version-fetches

The BOOTSTRAP_URL is the bootstrap URL of the cluster. The output directory of voldemort.utils.KeySamplerCLI is provided as input directory. (Any directory in which there are files with the same names as stores in the cluster and that contain keys can be used.) For each store, the file of keys is processed and this tool fetches the version from each server that ought to store the key. The version is output into a file in the output directory.

We recommend using this tool before and after rebalancing a cluster. Then, the post-rebalancing results can be compared to the pre-rebalancing results. This comparison is a bit of an art since put requests during the rebalance, or delete requests, can make the post-rebalancing output differ from the pre-rebalancing output. For the zone expansion use case, this is especially true since more servers are expected to host a given key after zone expansion.

Rebalance run book

This is a simplified version of the run book LinkedIn uses to rebalance Voldemort clusters.

Tune server configs

We do not have specific recommendations for how to tune servers for rebalancing. There is a trade off between how quickly rebalance completes and how much it (may) affect foreground performance. That said, we thought about the following parameters in the server.properties file (read by VoldemortConfig) when tuning for rebalancing:

# No specific recommendations for these parameters, but each may
# affect rebalancing performance:
admin.max.threads=
client.connection.timeout.ms=
client.max.connections.per.node=
client.routing.timeout.ms=
max.parallel.stores.rebalancing=
max.proxy.put.threads=
nio.connector.selectors=
stream.read.byte.per.sec=
stream.write.byte.per.sec=

# if your storage engine supports isPartitionScanSupported() then the
# following setting is the most important one for improving
# performance of rebalance:
use.partition.scan.for.rebalance=true

Preparation

Get current metadata from the cluster, i.e. use bin/voldemort-admin-tool.sh to retrieve current cluster.xml and stores.xml.
Generate new final-cluster.xml, and if necessary, stores.xml. Use the appropriate rebalance script for your use case.
Given the repartitioned final-cluster.xml, review the plan for the rebalance. If the plan is "too big" (this is necessarily a judgment call), try to generate a repartitioning that yields a smaller plan (possibly by being less balanced).

Execution

For the cluster expansion and zone expansion use case, use bin/voldemort-admin-tool.sh to set interim metadata, i.e. set the cluster xml so that new nodes with empty partitions are present in the cluster. For zone expansion, set the stores XML to be aware of the new zone.
Gather a sample set of keys with the voldemort.utils.KeySamplerCLI.
Gather reference set of key-versions for sampled keys with the voldemort.utils.KeyVersionFetcherCLI.
Defer the retention job and bounce cluster. (Asynchronous tasks such as the retention job can cause rebalancing to abort.)
Run the rebalance using voldemort.tools.RebalanceControllerCLI.

Verification

Gather comparison set of key-versions for sampled keys with the voldemort.utils.KeyVersionFetcherCLI. Compare these key-versions to those fetched before the rebalance to confirm that rebalance completed successfully.
Run the repair job via the bin/voldemort-admin-tool.sh to reclaim disk space consumed by orphaned keys.
Reset retention job and bounce the cluster.

Roll-back failed rebalance

If rebalance fails, then manually roll-back everything.

Reset cluster metadata using bin/voldemort-admin-tool.sh. For zone expansion, the stores.xml metadata must be rolled back before the cluster.xml metadata.
Clear the rebalancing state using bin/voldemort-admin-tool.sh.
Reset retention job and bounce the cluster, or retry the rebalance!