Configuration
There are three configuration files that control server operation:
- cluster.xml – This holds the information about all the nodes (i.e. servers) in the cluster, what hostname they are at, the ports they use, etc. It is exactly the same for all voldemort nodes. It does not hold tuning parameters or data directories for those nodes, since that is not information public to the cluster but is specific to that particular nodes configuration.
- stores.xml – This holds the information about all the stores (i.e. tables) in the cluster. This includes information about the required number of successful reads to maintain consistency, the required number of writes, as well as how keys and values are serialized into bytes. It is the same on all nodes in the cluster.
- server.properties – This contains the tuning parameters that control a particular node. This includes the id of the local node so it knows which entry in cluster.xml corresponds to itself, also the threadpool size, as well as any configuration needed for the local persistence engine such as BDB or mysql. This file is different on each node.
Finally there is an environment variable, VOLDEMORT_HOME, that controls the directory in which data and configuration reside. You can see an example of how the configuration is layed out in the config/ subdirectory of the project. This includes sample configurations that you can modify with your own specifics.
Cluster configuration
Here is an example cluster.xml for a 2-node cluster with 8 data partitions. We also have optional 'zone' fields which allow you to map nodes to certain logical clusters ( datacenter, rack, etc ) called zones:
<cluster>
<!-- The name is just to help users identify this cluster from the gui -->
<name>mycluster</name>
<zone>
<zone-id>0</zone-id>
<proximity-list>1</proximity-list>
<zone>
<zone>
<zone-id>1</zone-id>
<proximity-list>0</proximity-list>
<zone>
<server>
<!-- The node id is a unique, sequential id beginning with 0 that identifies each server in the cluster-->
<id>0</id>
<host>vldmt1.prod.linkedin.com</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<!-- A list of data partitions assigned to this server -->
<partitions>0,1,2,3</partitions>
<zone-id>0</zone-id>
</server>
<server>
<id>1</id>
<host>vldmt2.prod.linkedin.com</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<partitions>4,5,6,7</partitions>
<zone-id>1</zone-id>
</server>
</cluster>
One thing that is important to understand is that partitions are not static partitions of servers, but rather they are a mechanism for partitioning the key space in such a way that each key is statically mapped to a particular data partition. What this means is that a particular cluster may support many stores each with different replication factors—the replication factor is not hardcoded in the cluster design. This is important, since some data is more important than other data, and the correct trade-off between performance and consistency for one store may be different from another store.
Another important point to remember is that the number of data partitions cannot be changed. We do support an online redistribution ( rebalancing ) of partitions. In other words inclusion of new nodes results in moving ownership of partitions, but the total number of partitions will always remain the same, as will the mapping of key to partition. This means it is important to give a good number of partitions to start with. The script here will generate this part of the config for you.
Note that the configuration is currently simple files so it is important that the data in cluster.xml and stores.xml be exactly the same on each server, and that the node ids and partitions not be changed, since that can mean that clients will think their data should be on node X when really it was stored on node Y. This limitation will be removed as the configuration is moved into voldemort itself.
Store configuration
Here is an examples stores.xml for a store named test, that requires only a single read and write and uses bdb for persistence:
<stores>
<store>
<name>test</name>
<replication-factor>2</replication-factor>
<preferred-reads>2</preferred-reads>
<required-reads>1</required-reads>
<preferred-writes>2</preferred-writes>
<required-writes>1</required-writes>
<persistence>bdb</persistence>
<routing>client</routing>
<routing-strategy>consistent-routing</routing-strategy>
<key-serializer>
<type>string</type>
<schema-info>utf8</schema-info>
</key-serializer>
<value-serializer>
<type>json</type>
<schema-info version="1">[{"id":"int32", "name":"string"}]</schema-info>
<compression>
<type>gzip<type>
</compression>
</value-serializer>
</store>
</stores>
Each of these parameters deserves a quick discussion:
- name— The name of the store. This is the string by which clients will be able to connect and operate on this store. It is equivalent to the table name in sql.
- replication-factor— This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
- preferred-reads (optional)—The number of successful reads the client will attempt to do before returning a value to the application. This defaults to be equal to required reads
- required-reads—The least number of reads that can succeed without throwing an exception. Consider a case where the replication factor is 5, preferred reads is 4, and required reads is 2. If 3 of the 5 nodes are operational then the client may try all the nodes to try to reach the preferred 4 reads, but since only 3 are responsive it will allow the read to complete. Had only 1 been responsive it would have thrown an exception, since that was lower than the consistency guarantee requested for this table (and that could mean returning stale data).
- preferred-writes(optional)—The number of successful writes the client attempts to block for before returning success. Defaults to required-writes
- required-writes— The least number of writes that can succeed without the client getting back an exception.
- persistence— The persistence backend used by the store. Currently this could be one of bdb, mysql, memory, readonly, and cache. The difference between cache and memory is that memory will throw and OutOfMemory exception if it grows larger than the JVM heap whereas cache will discard data.
- routing— Determines the routing policy. We support both client ( Client side routing ) and server ( Server side routing ).
- routing-strategy— Determines how we store the replicas. Currently we support three routing-strategies - consistent-routing (default), zone-routing and all-routing.
- key-serializer— The serialization type used for reading and writing keys. The type can be json, java-serialization, string, protobuff, thrift, or identity (meaning raw bytes). The schema-info gives information to the serializer about how to perform the mapping (e.g. the JSON schema described in here).
- value-serializer— The serialization type used for reading and writing values. The supported types are the same as for keys. In the above example we also highlight the subelement 'compression' which currently supports 'gzip' and 'lzf' compression. The subelements are same as for the key-serializer, except that the the value serializer can have multiple schema-infos with different versions. The highest version is the one used for writing data, but data is always read with the version it was written with. This allows for gradual schema evolution. Versioning is only supported by the JSON serializer as other serialization formats have their own versioning systems.
Here are some example serializers:
<!-- A serializer that serializes plain strings in UTF8 encoding --> <value-serializer> <type>string</type> <schema-info>utf8</schema-info> </value-serializer> <!-- A serializer that serializes binary-format JSON data with the given schema. Each value is a List<Map<String, ?>> where the keys "id" and "name" and the values are a 32-bit integer id and a string name. --> <value-serializer> <type>json</type> <schema-info>[{"id":"int32", "name":"string"}]</schema-info> </value-serializer> <!-- A serializer that serializes protocol buffer objects of the given class. --> <value-serializer> <type>protobuff</type> <schema-info>java=com.something.YourProtoBuffClassName</schema-info> </value-serializer> <!-- A serializer that serializes thrift generated objects using one of the following protocols - 'binary', 'json' or 'simple-json'. Current support for Java clients only. --> <value-serializer> <type>thrift</type> <schema-info>java=com.something.YourThriftClassName,protocol=binary</schema-info> </value-serializer> <!-- Avro serialization - either 'avro-generic', 'avro-specific' or 'avro-reflective' --> <value-serializer> <type>avro-generic</type> <schema-info>{"name": "Kind", "type": "enum", "symbols": ["FOO","BAR"]}</schema-info> </value-serializer> - retention-days (optional)— This optional parameter allows you to set a retention property to your data. Then every day, at a specified time on the servers, a scheduled job will be run to delete all data having timestamp > retention-days. This is useful to keep your data trimmed.
- retention-scan-throttle-rate (optional)— If retention-days is specified this is the rate at which we'll scan the tuples to delete data.
If you intend to use the zone-routing strategy we need to extend the store definition to tell it how to replicate w.r.t. zones. Here is an example of a store definition with 'zone-routing' enabled.
<stores>
<store>
<name>test</name>
...
<routing-strategy>zone-routing</routing-strategy>
<!-- This number should be total of individual zone-replication-factor's -->
<replication-factor>2</replication-factor>
<zone-replication-factor>
<replication-factor zone-id="0">1</replication-factor>
<replication-factor zone-id="1">1</replication-factor>
</zone-replication-factor>
<zone-count-reads>0</zone-count-reads>
<zone-count-writes>0</zone-count-writes>
<hinted-handoff-strategy>proximity-handoff</hinted-handoff-strategy>
...
</store>
</stores>
The important change here is the introduction of zone-replication-factor which should contain a replication factor that you would want in every zone. Other parameters :
- zone-count-*— The number of zones we want to block for during reads / writes before we return the request. The number 0 means we'll block for atleast one request from the local zone only. The number 1 means we'll block for atleast one request from one other zone.
- hinted-handoff-strategy (optional) — Another consistency mechanism which we've added recently is Hinted handoff. We can turn on this feature on a per store basis. This parameter defines the strategy we would use to decide which live nodes to write our "hint" to. The various options are any-handoff, consistent-handoff and proximity-handoff.
Per-node configuration
We store per-node based configuration in the server.properties file. Most of the properties have sane defaults ( hopefully ). The bare minimal file should have the following property.
# The ID of *this* particular cluster node (different for each node in cluster) node.id=0Here is a list of all the configuration options supported:
| name | default | description |
|---|---|---|
| node.id | none | The unique, sequential identifier for this server in the cluster (starts with 0) |
| voldemort.home | none | The base directory for voldemort. Can also be specified via the environment variable VOLDEMORT_HOME or via a command line option. |
| data.directory | ${voldemort.home}/data | The directory where voldemort data is stored |
| metadata.directory | ${voldemort.home}/config | The directory where voldemort configuration is stored |
| BDB stores configuration | ||
| enable.bdb.engine | true | Should the BDB engine be enabled? |
| bdb.cache.size | 200MB | The BDB cache that is shared by all BDB tables. Bigger is better. |
| bdb.write.transactions | false | Should transactions be immediately written to disk? |
| bdb.flush.transactions | false | When the transaction has been written to disk should we force the disk to flush the OS cache. This is a fairly expensive operation. |
| bdb.data.directory | ${data.directory}/bdb | The directory where the BDB environment is located |
| bdb.max.logfile.size | 1GB | The size of an individual log file |
| bdb.btree.fanout | 512 | The fanout size for the btree. Bigger fanout more effienciently supports larger btrees. |
| bdb.checkpoint.interval.bytes | 20 * 1024 * 1024 | How often (in bytes) should we checkpoint the transaction log? Checkpoints make startup and shutdown faster. |
| bdb.checkpoint.interval.ms | 30000 | How often in ms should we checkpoint the transaction log |
| bdb.one.env.per.store | false | Use one BDB environment for every store |
| bdb.cleaner.threads | 1 | Number of BDB cleaner threads |
| MySQL stores configuration | ||
| enable.mysql.engine | false | Should we enabled the mysql storage engine? Doing so will create a connection pool that will be used for the mysql instance |
| mysql.user | root | The mysql username to user |
| mysql.password | The mysql password to user | |
| mysql.host | localhost | The host of the mysql instance |
| mysql.port | 3306 | The port of the mysql instance |
| mysql.database | voldemort | The name of the mysql database |
| Read-only stores configuration | ||
| enable.readonly.engine | false | Should we enable the readonly storage engine? |
| readonly.backups | 1 | The number of backup copies of the data to keep around for rollback. |
| readonly.search.strategy | BinarySearchStrategy | Class name of search strategy to use while finding key. We support BinarySearchStrategy and InterpolationSearchStrategy |
| readonly.data.directory | ${data.directory}/read-only | The directory in which to store readonly data files. |
| readonly.delete.backup.ms | 0 | Millisecond we wait for before deleting old data. Useful to decreasing IO during swap. |
| Slop store configuration | ||
| slop.enable | true | Do we want to initialize a storage engine for slops + have the job enabled? |
| slop.store.engine | bdb | What storage engine should we use for storing misdelivered messages that need to be rerouted? |
| slop.pusher.enable | true | Enable the slop pusher job which pushes every 'slop.frequency.ms' ms ( Prerequisite - slop.enable=true ) |
| slop.read.byte.per.sec | 10 * 1000 * 1000 | Slop max read throughput |
| slop.write.byte.per.sec | 10 * 1000 * 1000 | Slop max write throughput |
| pusher.type | StreamingSlopPusherJob | Job type to use for pushing out the slops |
| slop.frequency.ms | 5 * 60 * 1000 | Frequency at which we'll try to push out the slops |
| Rebalancing configuration | ||
| enable.rebalancing | true | Enable rebalance service? |
| max.rebalancing.attempts | 3 | Number of attempts the server side rebalancer makes to fetch data |
| rebalancing.timeout.seconds | 10 * 24 * 60 * 60 | Time we give for the server side rebalancing to finish copying data |
| max.parallel.stores.rebalancing | 3 | Stores to rebalancing in parallel |
| rebalancing.optimization | true | Should we run our rebalancing optimization for non-partition aware stores? |
| Retention configuration | ||
| retention.cleanup.first.start.hour | 0 | Hour when we want to start the first retention cleanup job |
| retention.cleanup.period.hours | 24 | Run the retention clean up job every n hours |
| Gossip configuration | ||
| enable.gossip | false | Enable gossip to synchronize state |
| gossip.interval.ms | 30*1000 | Enable gossup every n ms |
| Admin service | ||
| admin.enable | true | Enable the Admin service? |
| admin.max.threads | 20 | Max Number of threads used by Admin services. Used by BIO ( i.e. if enable.nio.connector = false ) |
| admin.core.threads | max(1, ${admin.max.threads} / 2) | The number of threads to keep alive by Admin service even when idle. Used by BIO ( i.e. if enable.nio.connector = false ) |
| nio.admin.connector.selectors | max ( 8, number of processors ) | Number of selector threads for admin operations. Used by NIO ( i.e. if enable.nio.connector = true ) |
| Core Voldemort server configuration | ||
| enable.nio.connector | true | Enable NIO on server side |
| nio.connector.selectors | max ( 8, number of processors ) | Number of selector threads for normal operations. Used by NIO ( i.e. if enable.nio.connector = true ) |
| max.threads | 100 | The maximum number of threads the server can use ( Used by HTTP and BIO - enable.nio.connector = false - service only ) |
| core.threads | max(1, ${max.threads} / 2) | The number of threads to keep alive even when idle ( Used by HTTP and BIO - enable.nio.connector = false - service only ) |
| socket.timeout.ms | 4000 | The socket SO_TIMEOUT. Essentially the amount of time to block on a low-level network operation before throwing an error. |
| routing.timeout.ms | 5000 | The total amount of time to wait for adequate responses from all nodes before throwing an error. |
| stream.read.byte.per.sec | 10 * 1000 * 1000 | Max read throughput allowed when Admin service streams data |
| stream.write.byte.per.sec | 10 * 1000 * 1000 | Max write throughput allowed when Admin service streams data |
| http.enable | true | Enable the HTTP data server? |
| socket.enable | true | Enable the socket data server? |
| jmx.enable | true | Enable JMX monitoring? |
| enable.verbose.logging | true | Log every operation on all stores. |
| enable.stat.tracking | true | Track load statistics on the stores. |
| scheduler.threads | 6 | Number of threads to use for scheduled jobs |
BDB Management
The underlying key-value store is also important for configuration and operation management. If BDB is used then all configuration is done through the server.properties file. If MySQL is used then usual mysql administration must be done.
Oracle has a writeup that gives a good overview of the operational side of BDB.
Client configuration
The above settings were all for the server. It is important to correctly configure the client as well. Following is a list of configuration options for the clients:
| name | default | description |
|---|---|---|
| max_connections | 50 | Maximum number of connection allowed to each voldemort node |
| max_threads | 5 | The maximum number of client threads ( Used by the client thread pool ) |
| max_queued_requests | 50 | The maximum number of queued node operations before client actions will be blocked ( Used by the client thread pool ) |
| thread_idle_ms | 100000 | The amount of time to keep an idle client thread alive ( Used by the client thread pool ) |
| connection_timeout_ms | 500 | Set the maximum allowable time to block waiting for a free connection |
| socket_timeout_ms | 5000 | Maximum amount of time the socket will block waiting for network activity |
| routing_timeout_ms | 15000 | Set the timeout for all blocking operations to complete on all nodes. The number of blocking operations can be configured using the preferred-reads and preferred-writes configuration for the store. |
| selectors | 8 | Number of selectors used for multiplexing requests in our NIO client |
| socket_buffer_size | 64 * 1024 | Set the size of the socket buffer (in bytes) to use for both socket reads and socket writes |
| enable_jmx | true | Enable JMX monitoring |
| enable_pipeline_routed_store | true | Use the new pipeline routed store for client side routing |
| max_bootstrap_retries | 2 | Number of times we'll try to connect to bootstrap url |
| bootstrap_urls | Compulsory parameter | Comma separated list of URLs to use as bootstrap servers |
| serializer_factory_class | Default serializer factory with support for avro, pb, java, etc. | Custom serializer factory class name |
| client_zone_id | 0 | Zone id where the client resides. Used to make smarter routing decision in case of 'zone-routing' |
| Failure detector configs | ||
| failuredetector_implementation | ThresholdFailureDetector | Class name of the failure detector that the client will use. We support BannagePeriodFailureDetector and ThresholdFailureDetector |
| failuredetector_bannage_period | 30000 | BannagePeriodFailureDetector : The number of milliseconds this node is considered as 'banned' |
| failuredetector_threshold_countminimum | 30 | ThresholdFailureDetector : Minimum number of failures that must occur before the success ratio is checked against the threshold |
| failuredetector_threshold_interval | 30000 | ThresholdFailureDetector : Millisecond interval for which the threshold is valid; it is 'reset' after this period is exceeded |
| failuredetector_threshold | 95 | ThresholdFailureDector : The integer percentage representation of the threshold that must be met or exceeded |
Some additional suggestions
JVM Settings
At LinkedIn we maintain two sets of clusters, read-only and read-write. The read-write clusters are clusters using BDB stores and have totally different JVM characteristics from those using read-only stores. Here is what we use at LinkedIn for our read-write stores:# Min, max, total JVM size JVM_SIZE="-server -Xms22g -Xmx22g" # New Generation Sizes JVM_SIZE_NEW="-XX:NewSize=2048m -XX:MaxNewSize=2048m" # Type of Garbage Collector to use JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" # Tuning options for the above garbage collector JVM_GC_OPTS="-XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2" # JVM GC activity logging settings JVM_GC_LOG="-XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$LOG_DIR/gc.log"This is the setup on a 32GB RAM box with a BDB cache size of 10GB and 3 cleaner threads. There are two key things here: (1) BDB cache must fit in heap or else it won't work (obviously), (2) you must use the concurrent mark and sweep gc or else the GC pauses from collecting such a large heap will cause unresponsive periods (it doesn't happen at first either, it creeps up and then eventually goes into a spiral of gc pause death).
For the read-only clusters we use the same JVM GC settings, except the heap size is set to a smaller value.
# Min, max, total JVM size JVM_SIZE="-server -Xms4096m -Xmx4096m"This is done because in the case of read-only stores we rely on the OS page cache and don't really want our JVM heap to take up space.