Configuration

There are three configuration files that control server operation:

Finally there is an environment variable, VOLDEMORT_HOME, that controls the directory in which data and configuration reside. You can see an example of how the configuration is layed out in the config/ subdirectory of the project. This includes sample configurations that you can modify with your own specifics.

Cluster configuration

Here is an example cluster.xml for a 2-node cluster with 8 data partitions:

	
  <cluster>
    <!-- The name is just to help users identify this cluster from the gui -->
    <name>mycluster</name>
    <server>
      <!-- The node id is a unique, sequential id beginning with 0 that identifies each server in the cluster-->
      <id>0</id>
      <host>vldmt1.prod.linkedin.com</host>
      <http-port>8081</http-port>
      <socket-port>6666</socket-port>
      <admin-port>6667</admin-port>
      <!-- A list of data partitions assigned to this server -->
      <partitions>0,1,2,3</partitions>
    </server>
    <server>
      <id>1</id>
      <host>vldmt2.prod.linkedin.com</host>
      <http-port>8081</http-port>
      <socket-port>6666</socket-port>
      <admin-port>6667</admin-port>
      <partitions>4,5,6,7</partitions>
    </server>
  </cluster>
One thing that is important to understand is that partitions are not static partitions of servers, but rather they are a mechanism for partitioning the key space in such a way that each key is statically mapped to a particular data partition. What this means is that a particular cluster may support many stores each with different replication factors—the replication factor is not hardcoded in the cluster design. This is important, since some data is more important than other data, and the correct trade-off between performance and consistency for one store may be different from another store. The number of data partitions cannot be changed. Online redistribution of data is not yet supported, but this provides the mechanism by which it will work when it is. Partitions will be moved to the new servers (or rebalanced between servers), but the total number of partitions will always remain the same, as will the mapping of key to partition. This means it is important to give a good number of partitions to start with. The script test/integration/generate_partitions.py will generate this part of the config for you. Note that the configuration is currently simple files so it is important that the data in cluster.xml and stores.xml be exactly the same on each server, and that the node ids and partitions not be changed, since that can mean that clients will think their data should be on node X when really it was stored on node Y. This limitation will be removed as the configuration is moved into voldemort itself.

Store configuration

Here is an examples stores.xml for a store named test, that requires only a single read and write and uses bdb for persistence:

	<stores>
	    <store>
	        <name>test</name>
	        <replication-factor>2</replication-factor>
	        <preferred-reads>2</preferred-reads>
	        <required-reads>1</required-reads>
	        <preferred-writes>2</preferred-writes>
	        <required-writes>1</required-writes>
	        <persistence>bdb</persistence>
	        <routing>client</routing>
	        <key-serializer>
	            <type>string</type>
	            <schema-info>utf8</schema-info>
	        </key-serializer>
	        <value-serializer>
	            <type>json</type>
	            <schema-info version="1">[{"id":"int32", "name":"string"}]</schema-info>
	            <compression>
			<type>gzip<type>
		    </compression>
	        </value-serializer>
	    </store>
	</stores>

Each of these parameters deserves a quick discussion:

Per-node configuration

Here is an example server.properties for node 0 in the cluster. Most properties have (hopefully) sane defaults and can be skipped. Here is a minimal server.properties:

	# The ID of *this* particular cluster node (different for each node in cluster)
	node.id=0
Here is a list of all the configuration options supported:
name default description
node.id none The unique, sequential identifier for this server in the cluster (starts with 0)
voldemort.home none The base directory for voldemort. Can also be specified via the environment variable VOLDEMORT_HOME or via a command line option.
data.directory ${voldemort.home}/data The directory where voldemort data is stored
metadata.directory ${voldemort.home}/config The directory where voldemort configuration is stored
enable.bdb.engine true Should the BDB engine be enabled?
bdb.cache.size 200MB (make it bigger!!!) The BDB cache that is shared by all BDB tables. Bigger is better.
bdb.write.transactions false Should transactions be immediately written to disk?
bdb.flush.transactions false When the transaction has been written to disk should we force the disk to flush the OS cache. This is a fairly expensive operation.
bdb.data.directory ${data.directory}/bdb The directory where the BDB environment is located
bdb.max.logfile.size 1GB The size of an individual log file
bdb.btree.fanout 512 The fanout size for the btree. Bigger fanout more effienciently supports larger btrees.
bdb.checkpoint.interval.bytes 20 * 1024 * 1024 How often (in bytes) should we checkpoint the transaction log? Checkpoints make startup and shutdown faster.
bdb.checkpoint.interval.ms 30000 How often in ms should we checkpoint the transaction log
enable.mysql.engine false Should we enabled the mysql storage engine? Doing so will create a connection pool that will be used for the mysql instance
mysql.user root The mysql username to user
mysql.password The mysql password to user
mysql.host localhost The host of the mysql instance
mysql.port 3306 The port of the mysql instance
mysql.database voldemort The name of the mysql database
enable.memory.engine true Should we enable the memory storage engine? Might as well this takes no resources and is just here for consistency.
enable.cache.engine true Should we enable the cache storage engine? Might as well this takes no resources and is just here for consistency.
enable.readonly.engine false Should we enable the readonly storage engine?
readonly.file.wait.timeout.ms 4000 The maximum time to wait to acquire a filehandle to perform reads.
readonly.backups 1 The number of backup copies of the data to keep around for rollback.
readonly.file.handles 5 The number of file descriptors to pool per store.
readonly.data.directory ${data.directory}/read-only The directory in which to store readonly data files.
slop.store.engine bdb What storage engine should we use for storing misdelivered messages that need to be rerouted?
max.threads 100 The maximum number of threads the server can use.
core.threads max(1, ${max.threads} / 2) The number of threads to keep alive even when idle.
socket.timeout.ms 4000 The socket SO_TIMEOUT. Essentially the amount of time to block on a low-level network operation before throwing an error.
routing.timeout.ms 5000 The total amount of time to wait for adequate responses from all nodes before throwing an error.
http.enable true Enable the HTTP data server?
socket.enable true Enable the socket data server?
jmx.enable true Enable JMX monitoring?
slop.detection.enable false Enable detection of misdelivered messages for persistence and redelivery.
enable.verbose.logging true Log every operation on all stores.
enable.stat.tracking true Track load statistics on the stores.
enable.gossip false Enable gossip to synchronize state
pusher.poll.ms 2 * 60 * 1000 How often should misdelivered "slop" data be pushed out to nodes?
scheduler.threads 3 The number of threads to use for scheduling periodic jobs
admin.enable true Enable the Admin service?
admin.max.threads 20 Max Number of threads used by Admin services
admin.core.threads max(1, ${admin.max.threads} / 2) The number of threads to keep alive by Admin service even when idle
stream.read.byte.per.sec 10 * 1000 * 1000 Max read throughput allowed when Admin service streams data
stream.write.byte.per.sec 10 * 1000 * 1000 Max write throughput allowed when Admin service streams data
enable.rebalancing true Enable rebalance service?
max.rebalancing.attempts 3 Number of attempts made during rebalancing

BDB Management

The underlying key-value store is also important for configuration and operation management. If BDB is used then all configuration is done through the server.properties file. If MySQL is used then usual mysql administration must be done.

Oracle has a writeup that gives a good overview of the operational side of BDB.

Client configuration

The above settings were all for the server. It is important to correctly configure the client as well. Following is a list of configuration options for the clients:

name default description
max_connections 6 Maximum number of connection allowed to each voldemort node
max_total_connections 500 Maximum number of connection allowed to all voldemort node
max_threads 5 The maximum number of client threads
max_queued_requests 50 The maximum number of queued node operations before client actions will be blocked
thread_idle_ms 100000 The amount of time to keep an idle client thread alive
connection_timeout_ms 500 Set the maximum allowable time to block waiting for a free connection
socket_timeout_ms 500 Maximum amount of time the socket will block waiting for network activity
routing_timeout_ms 15000 Set the timeout for all blocking operations to complete on all nodes. The number of blocking operations can be configured using the preferred-reads and preferred-writes configuration for the store.
socket_buffer_size 64 * 1024 Set the size of the socket buffer (in bytes) to use for both socket reads and socket writes
enable_jmx true Enable JMX monitoring
enable_pipeline_routed_store false Use the new pipeline routed store for client side routing
max_bootstrap_retries 2 Number of times we'll try to connect to bootstrap url
bootstrap_urls Compulsory parameter Comma separated list of URLs to use as bootstrap servers

Some additional suggestions

JVM Settings

Since the Voldemort servers will likely have fairly large heap sizes, getting good JVM garbage collector settings is important. Here is what we use at LinkedIn, with some success:
	# Min, max, total JVM size (-Xms -Xmx)
	JVM_SIZE="-server -Xms12g -Xmx12g"

	# New Generation Sizes (-XX:NewSize -XX:MaxNewSize)
	JVM_SIZE_NEW="-XX:NewSize=2048m -XX:MaxNewSize=2048m"

	# Type of Garbage Collector to use
	JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"

	# Tuning options for the above garbage collector
	JVM_GC_OPTS="-XX:CMSInitiatingOccupancyFraction=70"

	# JVM GC activity logging settings ($LOG_DIR set in the ctl script)
	JVM_GC_LOG="-XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$LOG_DIR/gc.log"
This setup was used with an 8GB BDB cache. There are two key things here: (1) BDB cache must fit in heap or else it won't work (obviously), (2) you must use the concurrent mark and sweep gc or else the GC pauses from collecting such a large heap will cause unresponsive periods (it doesn't happen at first either, it creeps up and then eventually goes into a spiral of gc pause death).