Configuration
There are three configuration files that control server operation:
- cluster.xml – This holds the information about all the nodes (i.e. servers) in the cluster, what hostname they are at, the ports they use, etc. It is exactly the same for all voldemort nodes. It does not hold tuning parameters or data directories for those nodes, since that is not information public to the cluster but is specific to that particular nodes configuration.
- stores.xml – This holds the information about all the stores (i.e. tables) in the cluster. This includes information about the required number of successful reads to maintain consistency, the required number of writes, as well as how keys and values are serialized into bytes. It is the same on all nodes in the cluster.
- server.properties – This contains the tuning parameters that control a particular node. This includes the id of the local node so it knows which entry in cluster.xml corresponds to itself, also the threadpool size, as well as any configuration needed for the local persistence engine such as BDB or mysql. This file is different on each node.
Finally there is an environment variable, VOLDEMORT_HOME, that controls the directory in which data and configuration reside. You can see an example of how the configuration is layed out in the config/ subdirectory of the project. This includes sample configurations that you can modify with your own specifics.
Cluster configuration
Here is an example cluster.xml for a 2-node cluster with 8 data partitions:
<cluster>
<!-- The name is just to help users identify this cluster from the gui -->
<name>mycluster</name>
<server>
<!-- The node id is a unique, sequential id beginning with 0 that identifies each server in the cluster-->
<id>0</id>
<host>vldmt1.prod.linkedin.com</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<!-- A list of data partitions assigned to this server -->
<partitions>0,1,2,3</partitions>
</server>
<server>
<id>1</id>
<host>vldmt2.prod.linkedin.com</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<partitions>4,5,6,7</partitions>
</server>
</cluster>
One thing that is important to understand is that partitions are not static partitions of servers, but rather they are a mechanism for partitioning the key space in such a way that each key is statically mapped to a particular data
partition. What this means is that a particular cluster may support many stores each with different replication factors—the replication factor is not hardcoded in the cluster design. This is important, since some data is more important than other data, and the correct trade-off between performance and consistency for one store may be different from another store. The number of data partitions cannot be changed. Online redistribution of data is not yet supported, but this provides the mechanism by which it will work when it is. Partitions will be moved to the new servers (or rebalanced between servers), but the total number of partitions will always remain the same, as will the mapping of key to partition. This means it is important to give a good number of partitions to start with. The script test/integration/generate_partitions.py will generate this part of the config for you.
Note that the configuration is currently simple files so it is important that the data in cluster.xml and stores.xml be exactly the same on each server, and that the node ids and partitions not be changed, since that can mean that clients will think their data should be on node X when really it was stored on node Y. This limitation will be removed as the configuration is moved into voldemort itself.
Store configuration
Here is an examples stores.xml for a store named test, that requires only a single read and write and uses bdb for persistence:
<stores>
<store>
<name>test</name>
<replication-factor>2</replication-factor>
<preferred-reads>2</preferred-reads>
<required-reads>1</required-reads>
<preferred-writes>2</preferred-writes>
<required-writes>1</required-writes>
<persistence>bdb</persistence>
<routing>client</routing>
<key-serializer>
<type>string</type>
<schema-info>utf8</schema-info>
</key-serializer>
<value-serializer>
<type>json</type>
<schema-info version="1">[{"id":"int32", "name":"string"}]</schema-info>
<compression>
<type>gzip<type>
</compression>
</value-serializer>
</store>
</stores>
Each of these parameters deserves a quick discussion:
- name— The name of the store. This is the string by which clients will be able to connect and operate on this store. It is equivalent to the table name in sql.
- replication-factor— This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
- preferred-reads (optional)—The number of successful reads the client will attempt to do before returning a value to the application. This defaults to be equal to required reads
- required-reads—The least number of reads that can succeed without throwing an exception. Consider a case where the replication factor is 5, preferred reads is 4, and required reads is 2. If 3 of the 5 nodes are operational then the client may try all the nodes to try to reach the preferred 4 reads, but since only 3 are responsive it will allow the read to complete. Had only 1 been responsive it would have thrown an exception, since that was lower than the consistency guarantee requested for this table (and that could mean returning stale data).
- preferred-writes(optional)—The number of successful writes the client attempts to block for before returning success. Defaults to required-writes
- required-writes— The least number of writes that can succeed without the client getting back an exception.
- persistence— The persistence backend used by the store. Currently this could be one of bdb, mysql, memory, readonly, and cache. The difference between cache and memory is that memory will throw and OutOfMemory exception if it grows larger than the JVM heap whereas cache will discard data.
- routing— Determines the routing policy. Currently only client-side routing is fully supported. Server side routing will be coming soon, as will a few more interesting policies.
- key-serializer— The serialization type used for reading and writing keys. The type can be json, java-serialization, string, protobuff, thrift, or identity (meaning raw bytes). The schema-info gives information to the serializer about how to perform the mapping (e.g. the JSON schema described in here).
- value-serializer— The serialization type used for reading and writing values. The supported types are the same as for keys. In the above example we also highlight the subelement 'compression' which currently supports 'gzip' and 'lzf' compression. The subelements are same as for the key-serializer, except that the the value serializer can have multiple schema-infos with different versions. The highest version is the one used for writing data, but data is always read with the version it was written with. This allows for gradual schema evolution. Versioning is only supported by the JSON serializer as other serialization formats have their own versioning systems.
Here are some example serializers:
<!-- A serializer that serializes plain strings in UTF8 encoding --> <value-serializer> <type>string</type> <schema-info>utf8</schema-info> </value-serializer> <!-- A serializer that serializes binary-format JSON data with the given schema. Each value is a List<Map<String, ?>> where the keys "id" and "name" and the values are a 32-bit integer id and a string name. --> <value-serializer> <type>json</type> <schema-info>[{"id":"int32", "name":"string"}]</schema-info> </value-serializer> <!-- A serializer that serializes protocol buffer objects of the given class. --> <value-serializer> <type>protobuff</type> <schema-info>java=com.something.YourProtoBuffClassName</schema-info> </value-serializer> <!-- A serializer that serializes thrift generated objects using one of the following protocols - 'binary', 'json' or 'simple-json'. Current support for Java clients only. --> <value-serializer> <type>thrift</type> <schema-info>java=com.something.YourThriftClassName,protocol=binary</schema-info> </value-serializer> <!-- Avro serialization - either 'avro-generic', 'avro-specific' or 'avro-reflective' --> <value-serializer> <type>avro-generic</type> <schema-info>{"name": "Kind", "type": "enum", "symbols": ["FOO","BAR"]}</schema-info> </value-serializer>
Per-node configuration
Here is an example server.properties for node 0 in the cluster. Most properties have (hopefully) sane defaults and can be skipped. Here is a minimal server.properties:
# The ID of *this* particular cluster node (different for each node in cluster) node.id=0Here is a list of all the configuration options supported:
| name | default | description |
|---|---|---|
| node.id | none | The unique, sequential identifier for this server in the cluster (starts with 0) |
| voldemort.home | none | The base directory for voldemort. Can also be specified via the environment variable VOLDEMORT_HOME or via a command line option. |
| data.directory | ${voldemort.home}/data | The directory where voldemort data is stored |
| metadata.directory | ${voldemort.home}/config | The directory where voldemort configuration is stored |
| enable.bdb.engine | true | Should the BDB engine be enabled? |
| bdb.cache.size | 200MB (make it bigger!!!) | The BDB cache that is shared by all BDB tables. Bigger is better. |
| bdb.write.transactions | false | Should transactions be immediately written to disk? |
| bdb.flush.transactions | false | When the transaction has been written to disk should we force the disk to flush the OS cache. This is a fairly expensive operation. |
| bdb.data.directory | ${data.directory}/bdb | The directory where the BDB environment is located |
| bdb.max.logfile.size | 1GB | The size of an individual log file |
| bdb.btree.fanout | 512 | The fanout size for the btree. Bigger fanout more effienciently supports larger btrees. |
| bdb.checkpoint.interval.bytes | 20 * 1024 * 1024 | How often (in bytes) should we checkpoint the transaction log? Checkpoints make startup and shutdown faster. |
| bdb.checkpoint.interval.ms | 30000 | How often in ms should we checkpoint the transaction log |
| enable.mysql.engine | false | Should we enabled the mysql storage engine? Doing so will create a connection pool that will be used for the mysql instance |
| mysql.user | root | The mysql username to user |
| mysql.password | The mysql password to user | |
| mysql.host | localhost | The host of the mysql instance |
| mysql.port | 3306 | The port of the mysql instance |
| mysql.database | voldemort | The name of the mysql database |
| enable.memory.engine | true | Should we enable the memory storage engine? Might as well this takes no resources and is just here for consistency. |
| enable.cache.engine | true | Should we enable the cache storage engine? Might as well this takes no resources and is just here for consistency. |
| enable.readonly.engine | false | Should we enable the readonly storage engine? |
| readonly.file.wait.timeout.ms | 4000 | The maximum time to wait to acquire a filehandle to perform reads. |
| readonly.backups | 1 | The number of backup copies of the data to keep around for rollback. |
| readonly.file.handles | 5 | The number of file descriptors to pool per store. |
| readonly.data.directory | ${data.directory}/read-only | The directory in which to store readonly data files. |
| slop.store.engine | bdb | What storage engine should we use for storing misdelivered messages that need to be rerouted? |
| max.threads | 100 | The maximum number of threads the server can use. |
| core.threads | max(1, ${max.threads} / 2) | The number of threads to keep alive even when idle. |
| socket.timeout.ms | 4000 | The socket SO_TIMEOUT. Essentially the amount of time to block on a low-level network operation before throwing an error. |
| routing.timeout.ms | 5000 | The total amount of time to wait for adequate responses from all nodes before throwing an error. |
| http.enable | true | Enable the HTTP data server? |
| socket.enable | true | Enable the socket data server? |
| jmx.enable | true | Enable JMX monitoring? |
| slop.detection.enable | false | Enable detection of misdelivered messages for persistence and redelivery. |
| enable.verbose.logging | true | Log every operation on all stores. |
| enable.stat.tracking | true | Track load statistics on the stores. |
| enable.gossip | false | Enable gossip to synchronize state |
| pusher.poll.ms | 2 * 60 * 1000 | How often should misdelivered "slop" data be pushed out to nodes? |
| scheduler.threads | 3 | The number of threads to use for scheduling periodic jobs |
| admin.enable | true | Enable the Admin service? |
| admin.max.threads | 20 | Max Number of threads used by Admin services |
| admin.core.threads | max(1, ${admin.max.threads} / 2) | The number of threads to keep alive by Admin service even when idle |
| stream.read.byte.per.sec | 10 * 1000 * 1000 | Max read throughput allowed when Admin service streams data |
| stream.write.byte.per.sec | 10 * 1000 * 1000 | Max write throughput allowed when Admin service streams data |
| enable.rebalancing | true | Enable rebalance service? |
| max.rebalancing.attempts | 3 | Number of attempts made during rebalancing |
BDB Management
The underlying key-value store is also important for configuration and operation management. If BDB is used then all configuration is done through the server.properties file. If MySQL is used then usual mysql administration must be done.
Oracle has a writeup that gives a good overview of the operational side of BDB.
Client configuration
The above settings were all for the server. It is important to correctly configure the client as well. Following is a list of configuration options for the clients:
| name | default | description |
|---|---|---|
| max_connections | 6 | Maximum number of connection allowed to each voldemort node |
| max_total_connections | 500 | Maximum number of connection allowed to all voldemort node |
| max_threads | 5 | The maximum number of client threads |
| max_queued_requests | 50 | The maximum number of queued node operations before client actions will be blocked |
| thread_idle_ms | 100000 | The amount of time to keep an idle client thread alive |
| connection_timeout_ms | 500 | Set the maximum allowable time to block waiting for a free connection |
| socket_timeout_ms | 500 | Maximum amount of time the socket will block waiting for network activity |
| routing_timeout_ms | 15000 | Set the timeout for all blocking operations to complete on all nodes. The number of blocking operations can be configured using the preferred-reads and preferred-writes configuration for the store. |
| socket_buffer_size | 64 * 1024 | Set the size of the socket buffer (in bytes) to use for both socket reads and socket writes |
| enable_jmx | true | Enable JMX monitoring |
| enable_pipeline_routed_store | false | Use the new pipeline routed store for client side routing |
| max_bootstrap_retries | 2 | Number of times we'll try to connect to bootstrap url |
| bootstrap_urls | Compulsory parameter | Comma separated list of URLs to use as bootstrap servers |
Some additional suggestions
JVM Settings
Since the Voldemort servers will likely have fairly large heap sizes, getting good JVM garbage collector settings is important. Here is what we use at LinkedIn, with some success:# Min, max, total JVM size (-Xms -Xmx) JVM_SIZE="-server -Xms12g -Xmx12g" # New Generation Sizes (-XX:NewSize -XX:MaxNewSize) JVM_SIZE_NEW="-XX:NewSize=2048m -XX:MaxNewSize=2048m" # Type of Garbage Collector to use JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" # Tuning options for the above garbage collector JVM_GC_OPTS="-XX:CMSInitiatingOccupancyFraction=70" # JVM GC activity logging settings ($LOG_DIR set in the ctl script) JVM_GC_LOG="-XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$LOG_DIR/gc.log"This setup was used with an 8GB BDB cache. There are two key things here: (1) BDB cache must fit in heap or else it won't work (obviously), (2) you must use the concurrent mark and sweep gc or else the GC pauses from collecting such a large heap will cause unresponsive periods (it doesn't happen at first either, it creeps up and then eventually goes into a spiral of gc pause death).