Configuration

There are three configuration files that control server operation:

Finally there is an environment variable, VOLDEMORT_HOME, that controls the directory in which data and configuration reside. You can see an example of how the configuration is layed out in the config/ subdirectory of the project. This includes sample configurations that you can modify with your own specifics.

Cluster configuration

Here is an example cluster.xml for a 2-node cluster with 8 data partitions. We also have optional 'zone' fields which allow you to map nodes to certain logical clusters ( datacenter, rack, etc ) called zones:

 
  <cluster>
    <!-- The name is just to help users identify this cluster from the gui -->
    <name>mycluster</name>
    <zone>
      <zone-id>0</zone-id>
      <proximity-list>1</proximity-list>
    <zone>
    <zone>
      <zone-id>1</zone-id>
      <proximity-list>0</proximity-list>
    <zone>
    <server>
      <!-- The node id is a unique, sequential id beginning with 0 that identifies each server in the cluster-->
      <id>0</id>
      <host>vldmt1.prod.linkedin.com</host>
      <http-port>8081</http-port>
      <socket-port>6666</socket-port>
      <admin-port>6667</admin-port>
      <!-- A list of data partitions assigned to this server -->
      <partitions>0,1,2,3</partitions>
      <zone-id>0</zone-id>
    </server>
    <server>
      <id>1</id>
      <host>vldmt2.prod.linkedin.com</host>
      <http-port>8081</http-port>
      <socket-port>6666</socket-port>
      <admin-port>6667</admin-port>
      <partitions>4,5,6,7</partitions>
      <zone-id>1</zone-id>
    </server>
  </cluster>

One thing that is important to understand is that partitions are not static partitions of servers, but rather they are a mechanism for partitioning the key space in such a way that each key is statically mapped to a particular data partition. What this means is that a particular cluster may support many stores each with different replication factors—the replication factor is not hardcoded in the cluster design. This is important, since some data is more important than other data, and the correct trade-off between performance and consistency for one store may be different from another store.

Another important point to remember is that the number of data partitions cannot be changed. We do support an online redistribution ( rebalancing ) of partitions. In other words inclusion of new nodes results in moving ownership of partitions, but the total number of partitions will always remain the same, as will the mapping of key to partition. This means it is important to give a good number of partitions to start with. The script here will generate this part of the config for you.

Note that the configuration is currently simple files so it is important that the data in cluster.xml and stores.xml be exactly the same on each server, and that the node ids and partitions not be changed, since that can mean that clients will think their data should be on node X when really it was stored on node Y. This limitation will be removed as the configuration is moved into voldemort itself.

Store configuration

Here is an examples stores.xml for a store named test, that requires only a single read and write and uses bdb for persistence:

  <stores>
      <store>
          <name>test</name>
          <replication-factor>2</replication-factor>
          <preferred-reads>2</preferred-reads>
          <required-reads>1</required-reads>
          <preferred-writes>2</preferred-writes>
          <required-writes>1</required-writes>
          <persistence>bdb</persistence>
          <routing>client</routing>
          <routing-strategy>consistent-routing</routing-strategy>
          <key-serializer>
              <type>string</type>
              <schema-info>utf8</schema-info>
          </key-serializer>
          <value-serializer>
              <type>json</type>
              <schema-info version="1">[{"id":"int32", "name":"string"}]</schema-info>
              <compression>
      <type>gzip<type>
        </compression>
          </value-serializer>
      </store>
  </stores>

Each of these parameters deserves a quick discussion:

If you intend to use the zone-routing strategy we need to extend the store definition to tell it how to replicate w.r.t. zones. Here is an example of a store definition with 'zone-routing' enabled.

  <stores>
      <store>
          <name>test</name>
          ...
                <routing-strategy>zone-routing</routing-strategy>
                <!-- This number should be total of individual zone-replication-factor's -->
                <replication-factor>2</replication-factor>
          <zone-replication-factor>
              <replication-factor zone-id="0">1</replication-factor>
              <replication-factor zone-id="1">1</replication-factor>
          </zone-replication-factor>
          <zone-count-reads>0</zone-count-reads>
          <zone-count-writes>0</zone-count-writes>
          <hinted-handoff-strategy>proximity-handoff</hinted-handoff-strategy>
          ... 
           </store>
  </stores>
The important change here is the introduction of zone-replication-factor which should contain a replication factor that you would want in every zone. Other parameters :

Per-node configuration

We store per-node based configuration in the server.properties file. Most of the properties have sane defaults ( hopefully ). The bare minimal file should have the following property.

  # The ID of *this* particular cluster node (different for each node in cluster)
  node.id=0
Here is a list of all the configuration options supported:
name default description
node.id none The unique, sequential identifier for this server in the cluster (starts with 0)
voldemort.home none The base directory for voldemort. Can also be specified via the environment variable VOLDEMORT_HOME or via a command line option.
data.directory ${voldemort.home}/data The directory where voldemort data is stored
metadata.directory ${voldemort.home}/config The directory where voldemort configuration is stored
BDB stores configuration
enable.bdb.engine true Should the BDB engine be enabled?
bdb.cache.size 200MB The BDB cache that is shared by all BDB tables. Bigger is better.
bdb.write.transactions false Should transactions be immediately written to disk?
bdb.flush.transactions false When the transaction has been written to disk should we force the disk to flush the OS cache. This is a fairly expensive operation.
bdb.data.directory ${data.directory}/bdb The directory where the BDB environment is located
bdb.max.logfile.size 1GB The size of an individual log file
bdb.btree.fanout 512 The fanout size for the btree. Bigger fanout more effienciently supports larger btrees.
bdb.checkpoint.interval.bytes 20 * 1024 * 1024 How often (in bytes) should we checkpoint the transaction log? Checkpoints make startup and shutdown faster.
bdb.checkpoint.interval.ms 30000 How often in ms should we checkpoint the transaction log
bdb.one.env.per.store false Use one BDB environment for every store
bdb.cleaner.threads 1 Number of BDB cleaner threads
MySQL stores configuration
enable.mysql.engine false Should we enabled the mysql storage engine? Doing so will create a connection pool that will be used for the mysql instance
mysql.user root The mysql username to user
mysql.password The mysql password to user
mysql.host localhost The host of the mysql instance
mysql.port 3306 The port of the mysql instance
mysql.database voldemort The name of the mysql database
Read-only stores configuration
enable.readonly.engine false Should we enable the readonly storage engine?
readonly.backups 1 The number of backup copies of the data to keep around for rollback.
readonly.search.strategy BinarySearchStrategy Class name of search strategy to use while finding key. We support BinarySearchStrategy and InterpolationSearchStrategy
readonly.data.directory ${data.directory}/read-only The directory in which to store readonly data files.
readonly.delete.backup.ms 0 Millisecond we wait for before deleting old data. Useful to decreasing IO during swap.
Slop store configuration
slop.enable true Do we want to initialize a storage engine for slops + have the job enabled?
slop.store.engine bdb What storage engine should we use for storing misdelivered messages that need to be rerouted?
slop.pusher.enable true Enable the slop pusher job which pushes every 'slop.frequency.ms' ms ( Prerequisite - slop.enable=true )
slop.read.byte.per.sec 10 * 1000 * 1000 Slop max read throughput
slop.write.byte.per.sec 10 * 1000 * 1000 Slop max write throughput
pusher.type StreamingSlopPusherJob Job type to use for pushing out the slops
slop.frequency.ms 5 * 60 * 1000 Frequency at which we'll try to push out the slops
Rebalancing configuration
enable.rebalancing true Enable rebalance service?
max.rebalancing.attempts 3 Number of attempts the server side rebalancer makes to fetch data
rebalancing.timeout.seconds 10 * 24 * 60 * 60 Time we give for the server side rebalancing to finish copying data
max.parallel.stores.rebalancing 3 Stores to rebalancing in parallel
rebalancing.optimization true Should we run our rebalancing optimization for non-partition aware stores?
Retention configuration
retention.cleanup.first.start.hour 0 Hour when we want to start the first retention cleanup job
retention.cleanup.period.hours 24 Run the retention clean up job every n hours
Gossip configuration
enable.gossip false Enable gossip to synchronize state
gossip.interval.ms 30*1000 Enable gossup every n ms
Admin service
admin.enable true Enable the Admin service?
admin.max.threads 20 Max Number of threads used by Admin services. Used by BIO ( i.e. if enable.nio.connector = false )
admin.core.threads max(1, ${admin.max.threads} / 2) The number of threads to keep alive by Admin service even when idle. Used by BIO ( i.e. if enable.nio.connector = false )
nio.admin.connector.selectors max ( 8, number of processors ) Number of selector threads for admin operations. Used by NIO ( i.e. if enable.nio.connector = true )
Core Voldemort server configuration
enable.nio.connector true Enable NIO on server side
nio.connector.selectors max ( 8, number of processors ) Number of selector threads for normal operations. Used by NIO ( i.e. if enable.nio.connector = true )
max.threads 100 The maximum number of threads the server can use ( Used by HTTP and BIO - enable.nio.connector = false - service only )
core.threads max(1, ${max.threads} / 2) The number of threads to keep alive even when idle ( Used by HTTP and BIO - enable.nio.connector = false - service only )
socket.timeout.ms 4000 The socket SO_TIMEOUT. Essentially the amount of time to block on a low-level network operation before throwing an error.
routing.timeout.ms 5000 The total amount of time to wait for adequate responses from all nodes before throwing an error.
stream.read.byte.per.sec 10 * 1000 * 1000 Max read throughput allowed when Admin service streams data
stream.write.byte.per.sec 10 * 1000 * 1000 Max write throughput allowed when Admin service streams data
http.enable true Enable the HTTP data server?
socket.enable true Enable the socket data server?
jmx.enable true Enable JMX monitoring?
enable.verbose.logging true Log every operation on all stores.
enable.stat.tracking true Track load statistics on the stores.
scheduler.threads 6 Number of threads to use for scheduled jobs

BDB Management

The underlying key-value store is also important for configuration and operation management. If BDB is used then all configuration is done through the server.properties file. If MySQL is used then usual mysql administration must be done.

Oracle has a writeup that gives a good overview of the operational side of BDB.

Client configuration

The above settings were all for the server. It is important to correctly configure the client as well. Following is a list of configuration options for the clients:

name default description
max_connections 50 Maximum number of connection allowed to each voldemort node
max_threads 5 The maximum number of client threads ( Used by the client thread pool )
max_queued_requests 50 The maximum number of queued node operations before client actions will be blocked ( Used by the client thread pool )
thread_idle_ms 100000 The amount of time to keep an idle client thread alive ( Used by the client thread pool )
connection_timeout_ms 500 Set the maximum allowable time to block waiting for a free connection
socket_timeout_ms 5000 Maximum amount of time the socket will block waiting for network activity
routing_timeout_ms 15000 Set the timeout for all blocking operations to complete on all nodes. The number of blocking operations can be configured using the preferred-reads and preferred-writes configuration for the store.
selectors 8 Number of selectors used for multiplexing requests in our NIO client
socket_buffer_size 64 * 1024 Set the size of the socket buffer (in bytes) to use for both socket reads and socket writes
enable_jmx true Enable JMX monitoring
enable_pipeline_routed_store true Use the new pipeline routed store for client side routing
max_bootstrap_retries 2 Number of times we'll try to connect to bootstrap url
bootstrap_urls Compulsory parameter Comma separated list of URLs to use as bootstrap servers
serializer_factory_class Default serializer factory with support for avro, pb, java, etc. Custom serializer factory class name
client_zone_id 0 Zone id where the client resides. Used to make smarter routing decision in case of 'zone-routing'
Failure detector configs
failuredetector_implementation ThresholdFailureDetector Class name of the failure detector that the client will use. We support BannagePeriodFailureDetector and ThresholdFailureDetector
failuredetector_bannage_period 30000 BannagePeriodFailureDetector : The number of milliseconds this node is considered as 'banned'
failuredetector_threshold_countminimum 30 ThresholdFailureDetector : Minimum number of failures that must occur before the success ratio is checked against the threshold
failuredetector_threshold_interval 30000 ThresholdFailureDetector : Millisecond interval for which the threshold is valid; it is 'reset' after this period is exceeded
failuredetector_threshold 95 ThresholdFailureDector : The integer percentage representation of the threshold that must be met or exceeded

Some additional suggestions

JVM Settings

At LinkedIn we maintain two sets of clusters, read-only and read-write. The read-write clusters are clusters using BDB stores and have totally different JVM characteristics from those using read-only stores. Here is what we use at LinkedIn for our read-write stores:
  # Min, max, total JVM size 
  JVM_SIZE="-server -Xms22g -Xmx22g"

  # New Generation Sizes 
  JVM_SIZE_NEW="-XX:NewSize=2048m -XX:MaxNewSize=2048m"

  # Type of Garbage Collector to use
  JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"

  # Tuning options for the above garbage collector
  JVM_GC_OPTS="-XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2"

  # JVM GC activity logging settings
  JVM_GC_LOG="-XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$LOG_DIR/gc.log"
This is the setup on a 32GB RAM box with a BDB cache size of 10GB and 3 cleaner threads. There are two key things here: (1) BDB cache must fit in heap or else it won't work (obviously), (2) you must use the concurrent mark and sweep gc or else the GC pauses from collecting such a large heap will cause unresponsive periods (it doesn't happen at first either, it creeps up and then eventually goes into a spiral of gc pause death).

For the read-only clusters we use the same JVM GC settings, except the heap size is set to a smaller value.

  # Min, max, total JVM size 
  JVM_SIZE="-server -Xms4096m -Xmx4096m"
This is done because in the case of read-only stores we rely on the OS page cache and don't really want our JVM heap to take up space.
Fork me on GitHub