Configuration

There are three configuration files that control server operation:

Finally there is an environment variable, VOLDEMORT_HOME, that controls the directory in which data and configuration reside. You can see an example of how the configuration is layed out in the config/ subdirectory of the project. This includes sample configurations that you can modify with your own specifics.

Cluster configuration

Here is an example cluster.xml for a 2-node cluster with 8 data partitions. We also have optional 'zone' fields which allow you to map nodes to certain logical clusters ( datacenter, rack, etc ) called zones:

 
  <cluster>
    <!-- The name is just to help users identify this cluster from the gui -->
    <name>mycluster</name>
    <zone>
      <zone-id>0</zone-id>
      <proximity-list>1</proximity-list>
    <zone>
    <zone>
      <zone-id>1</zone-id>
      <proximity-list>0</proximity-list>
    <zone>
    <server>
      <!-- The node id is a unique, sequential id beginning with 0 that identifies each server in the cluster-->
      <id>0</id>
      <host>vldmt1.prod.linkedin.com</host>
      <http-port>8081</http-port>
      <socket-port>6666</socket-port>
      <admin-port>6667</admin-port>
      <!-- A list of data partitions assigned to this server -->
      <partitions>0,1,2,3</partitions>
      <zone-id>0</zone-id>
    </server>
    <server>
      <id>1</id>
      <host>vldmt2.prod.linkedin.com</host>
      <http-port>8081</http-port>
      <socket-port>6666</socket-port>
      <admin-port>6667</admin-port>
      <partitions>4,5,6,7</partitions>
      <zone-id>1</zone-id>
    </server>
  </cluster>

One thing that is important to understand is that partitions are not static partitions of servers, but rather they are a mechanism for partitioning the key space in such a way that each key is statically mapped to a particular data partition. What this means is that a particular cluster may support many stores each with different replication factors—the replication factor is not hardcoded in the cluster design. This is important, since some data is more important than other data, and the correct trade-off between performance and consistency for one store may be different from another store.

Another important point to remember is that the number of data partitions cannot be changed. We do support an online redistribution ( rebalancing ) of partitions. In other words inclusion of new nodes results in moving ownership of partitions, but the total number of partitions will always remain the same, as will the mapping of key to partition. This means it is important to give a good number of partitions to start with. The script here will generate this part of the config for you.

Note that the configuration is currently simple files so it is important that the data in cluster.xml and stores.xml be exactly the same on each server, and that the node ids and partitions not be changed, since that can mean that clients will think their data should be on node X when really it was stored on node Y. This limitation will be removed as the configuration is moved into voldemort itself.

Store configuration

Here is an examples stores.xml for a store named test, that requires only a single read and write and uses bdb for persistence:

  <stores>
      <store>
	  <name>test</name>
	  <replication-factor>2</replication-factor>
	  <preferred-reads>2</preferred-reads>
	  <required-reads>1</required-reads>
	  <preferred-writes>2</preferred-writes>
	  <required-writes>1</required-writes>
	  <persistence>bdb</persistence>
	  <routing>client</routing>
	  <routing-strategy>consistent-routing</routing-strategy>
	  <key-serializer>
	      <type>string</type>
	      <schema-info>utf8</schema-info>
	  </key-serializer>
	  <value-serializer>
	      <type>json</type>
	      <schema-info version="1">[{"id":"int32", "name":"string"}]</schema-info>
	      <compression>
		  <type>gzip<type>
	      </compression>
	  </value-serializer>
      </store>
  </stores>

Each of these parameters deserves a quick discussion:

If you intend to use the zone-routing strategy we need to extend the store definition to tell it how to replicate w.r.t. zones. Here is an example of a store definition with 'zone-routing' enabled.

  <stores>
      <store>
	  <name>test</name>
	  ...
	  <routing-strategy>zone-routing</routing-strategy>
	  <!-- This number should be total of individual zone-replication-factor's -->
	  <replication-factor>2</replication-factor>
	  <zone-replication-factor>
	      <replication-factor zone-id="0">1</replication-factor>
	      <replication-factor zone-id="1">1</replication-factor>
	  </zone-replication-factor>
	  <zone-count-reads>0</zone-count-reads>
	  <zone-count-writes>0</zone-count-writes>
	  <hinted-handoff-strategy>proximity-handoff</hinted-handoff-strategy>
	  ...
       </store>
  </stores>

The important change here is the introduction of zone-replication-factor which should contain a replication factor that you would want in every zone. Other parameters :

Server configuration

We store per-node based configuration in the server.properties file. Most of the properties have sane defaults ( hopefully ). The bare minimal file should have the following property.

  # The ID of *this* particular cluster node (different for each node in cluster)
  node.id=0

You can find the list of server side parameters here

Here is a list of all the configuration options supported:

BDB Management

The underlying key-value store is also important for configuration and operation management. If BDB is used then all configuration is done through the server.properties file. If MySQL is used then usual mysql administration must be done.

Oracle has a writeup that gives a good overview of the operational side of BDB.

Client configuration

The above settings were all for the server. It is important to correctly configure the client as well: List of all the client configuration parameters can be found here

Some additional suggestions

JVM Settings

At LinkedIn we maintain two sets of clusters, read-only and read-write. The read-write clusters are clusters using BDB stores and have totally different JVM characteristics from those using read-only stores. Here is what we use at LinkedIn for our read-write stores:

  # Min, max, total JVM size
  JVM_SIZE="-server -Xms32g -Xmx32g"

  # New Generation Sizes
  JVM_SIZE_NEW="-XX:NewSize=2048m -XX:MaxNewSize=2048m"

  # Type of Garbage Collector to use
  JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"

  # Tuning options for the above garbage collector
  JVM_GC_OPTS="-XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2"

  # JVM GC activity logging settings
  JVM_GC_LOG="-XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$LOG_DIR/gc.log"

Note that you must use the concurrent mark and sweep gc or else the GC pauses from collecting such a large heap will cause unresponsive periods (it doesn't happen at first either, it creeps up and then eventually goes into a spiral of gc pause death).

This is the setup on a 48GB RAM box with a BDB cache size of 20GB and 1 cleaner threads, on SSDs. You can find the entire configuration under config/prod_single_node_cluster. To bring a server up with these settings, use bin/voldemort-prod-server.sh

For the read-only clusters we use the same JVM GC settings, except the heap size is set to a smaller value.

  # Min, max, total JVM size
  JVM_SIZE="-server -Xms4096m -Xmx4096m"

This is done because in the case of read-only stores we rely on the OS page cache and don't really want our JVM heap to take up space.

Fork me on GitHub