Configuration
There are three configuration files that control server operation:
- cluster.xml – This holds the information about all the nodes (i.e. servers) in the cluster, what hostname they are at, the ports they use, etc. It is exactly the same for all voldemort nodes. It does not hold tuning parameters or data directories for those nodes, since that is not information public to the cluster but is specific to that particular nodes configuration.
- stores.xml – This holds the information about all the stores (i.e. tables) in the cluster. This includes information about the required number of successful reads to maintain consistency, the required number of writes, as well as how keys and values are serialized into bytes. It is the same on all nodes in the cluster.
- server.properties – This contains the tuning parameters that control a particular node. This includes the id of the local node so it knows which entry in cluster.xml corresponds to itself, also the threadpool size, as well as any configuration needed for the local persistence engine such as BDB or mysql. This file is different on each node.
Finally there is an environment variable, VOLDEMORT_HOME, that controls the directory in which data and configuration reside. You can see an example of how the configuration is layed out in the config/ subdirectory of the project. This includes sample configurations that you can modify with your own specifics.
Cluster configuration
Here is an example cluster.xml for a 2-node cluster with 8 data partitions. We also have optional 'zone' fields which allow you to map nodes to certain logical clusters ( datacenter, rack, etc ) called zones:
<cluster> <!-- The name is just to help users identify this cluster from the gui --> <name>mycluster</name> <zone> <zone-id>0</zone-id> <proximity-list>1</proximity-list> <zone> <zone> <zone-id>1</zone-id> <proximity-list>0</proximity-list> <zone> <server> <!-- The node id is a unique, sequential id beginning with 0 that identifies each server in the cluster--> <id>0</id> <host>vldmt1.prod.linkedin.com</host> <http-port>8081</http-port> <socket-port>6666</socket-port> <admin-port>6667</admin-port> <!-- A list of data partitions assigned to this server --> <partitions>0,1,2,3</partitions> <zone-id>0</zone-id> </server> <server> <id>1</id> <host>vldmt2.prod.linkedin.com</host> <http-port>8081</http-port> <socket-port>6666</socket-port> <admin-port>6667</admin-port> <partitions>4,5,6,7</partitions> <zone-id>1</zone-id> </server> </cluster>
One thing that is important to understand is that partitions are not static partitions of servers, but rather they are a mechanism for partitioning the key space in such a way that each key is statically mapped to a particular data partition. What this means is that a particular cluster may support many stores each with different replication factors—the replication factor is not hardcoded in the cluster design. This is important, since some data is more important than other data, and the correct trade-off between performance and consistency for one store may be different from another store.
Another important point to remember is that the number of data partitions cannot be changed. We do support an online redistribution ( rebalancing ) of partitions. In other words inclusion of new nodes results in moving ownership of partitions, but the total number of partitions will always remain the same, as will the mapping of key to partition. This means it is important to give a good number of partitions to start with. The script here will generate this part of the config for you.
Note that the configuration is currently simple files so it is important that the data in cluster.xml and stores.xml be exactly the same on each server, and that the node ids and partitions not be changed, since that can mean that clients will think their data should be on node X when really it was stored on node Y. This limitation will be removed as the configuration is moved into voldemort itself.
Store configuration
Here is an examples stores.xml for a store named test, that requires only a single read and write and uses bdb for persistence:
<stores> <store> <name>test</name> <replication-factor>2</replication-factor> <preferred-reads>2</preferred-reads> <required-reads>1</required-reads> <preferred-writes>2</preferred-writes> <required-writes>1</required-writes> <persistence>bdb</persistence> <routing>client</routing> <routing-strategy>consistent-routing</routing-strategy> <key-serializer> <type>string</type> <schema-info>utf8</schema-info> </key-serializer> <value-serializer> <type>json</type> <schema-info version="1">[{"id":"int32", "name":"string"}]</schema-info> <compression> <type>gzip<type> </compression> </value-serializer> </store> </stores>
Each of these parameters deserves a quick discussion:
- name— The name of the store. This is the string by which clients will be able to connect and operate on this store. It is equivalent to the table name in sql.
- replication-factor— This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
- preferred-reads (optional)—The number of successful reads the client will attempt to do before returning a value to the application. This defaults to be equal to required reads
- required-reads—The least number of reads that can succeed without throwing an exception. Consider a case where the replication factor is 5, preferred reads is 4, and required reads is 2. If 3 of the 5 nodes are operational then the client may try all the nodes to try to reach the preferred 4 reads, but since only 3 are responsive it will allow the read to complete. Had only 1 been responsive it would have thrown an exception, since that was lower than the consistency guarantee requested for this table (and that could mean returning stale data).
- preferred-writes(optional)—The number of successful writes the client attempts to block for before returning success. Defaults to required-writes
- required-writes— The least number of writes that can succeed without the client getting back an exception.
- persistence— The persistence backend used by the store. Currently this could be one of bdb, mysql, memory, readonly, and cache. The difference between cache and memory is that memory will throw and OutOfMemory exception if it grows larger than the JVM heap whereas cache will discard data.
- routing— Determines the routing policy. We support both client ( Client side routing ) and server ( Server side routing ).
- routing-strategy— Determines how we store the replicas. Currently we support three routing-strategies - consistent-routing (default), zone-routing and all-routing.
- key-serializer— The serialization type used for reading and writing keys. The type can be json, java-serialization, string, protobuff, thrift, or identity (meaning raw bytes). The schema-info gives information to the serializer about how to perform the mapping (e.g. the JSON schema described in here).
- value-serializer— The serialization type used for reading and writing values. The supported types are the same as for keys. In the above example we also highlight the subelement 'compression' which currently supports 'gzip' and 'lzf' compression. The subelements are same as for the key-serializer, except that the the value serializer can have multiple schema-infos with different versions. The highest version is the one used for writing data, but data is always read with the version it was written with. This allows for gradual schema evolution. Versioning is only supported by the JSON serializer as other serialization formats have their own versioning systems.
Here are some example serializers:
<!-- A serializer that serializes plain strings in UTF8 encoding --> <value-serializer> <type>string</type> <schema-info>utf8</schema-info> </value-serializer> <!-- A serializer that serializes binary-format JSON data with the given schema. Each value is a List<Map<String, ?>> where the keys "id" and "name" and the values are a 32-bit integer id and a string name. --> <value-serializer> <type>json</type> <schema-info>[{"id":"int32", "name":"string"}]</schema-info> </value-serializer> <!-- A serializer that serializes protocol buffer objects of the given class. --> <value-serializer> <type>protobuff</type> <schema-info>java=com.something.YourProtoBuffClassName</schema-info> </value-serializer> <!-- A serializer that serializes thrift generated objects using one of the following protocols - 'binary', 'json' or 'simple-json'. Current support for Java clients only. --> <value-serializer> <type>thrift</type> <schema-info>java=com.something.YourThriftClassName,protocol=binary</schema-info> </value-serializer> <!-- Avro serialization - either 'avro-generic', 'avro-specific' or 'avro-reflective' --> <value-serializer> <type>avro-generic</type> <schema-info>{"name": "Kind", "type": "enum", "symbols": ["FOO","BAR"]}</schema-info> </value-serializer>
- retention-days (optional)— This optional parameter allows you to set a retention property to your data. Then every day, at a specified time on the servers, a scheduled job will be run to delete all data having timestamp > retention-days. This is useful to keep your data trimmed.
- retention-scan-throttle-rate (optional)— If retention-days is specified this is the rate at which we'll scan the tuples to delete data.
If you intend to use the zone-routing strategy we need to extend the store definition to tell it how to replicate w.r.t. zones. Here is an example of a store definition with 'zone-routing' enabled.
<stores> <store> <name>test</name> ... <routing-strategy>zone-routing</routing-strategy> <!-- This number should be total of individual zone-replication-factor's --> <replication-factor>2</replication-factor> <zone-replication-factor> <replication-factor zone-id="0">1</replication-factor> <replication-factor zone-id="1">1</replication-factor> </zone-replication-factor> <zone-count-reads>0</zone-count-reads> <zone-count-writes>0</zone-count-writes> <hinted-handoff-strategy>proximity-handoff</hinted-handoff-strategy> ... </store> </stores>
The important change here is the introduction of zone-replication-factor which should contain a replication factor that you would want in every zone. Other parameters :
- zone-count-*— The number of zones we want to block for during reads / writes before we return the request. The number 0 means we'll block for at least one request from the local zone only. The number 1 means we'll block for at least one request from one other zone.
- hinted-handoff-strategy (optional) — Another consistency mechanism which we've added recently is Hinted handoff. We can turn on this feature on a per store basis. This parameter defines the strategy we would use to decide which live nodes to write our "hint" to. The various options are any-handoff, consistent-handoff and proximity-handoff.
Server configuration
We store per-node based configuration in the server.properties file. Most of the properties have sane defaults ( hopefully ). The bare minimal file should have the following property.
# The ID of *this* particular cluster node (different for each node in cluster) node.id=0
You can find the list of server side parameters here
Here is a list of all the configuration options supported:
BDB Management
The underlying key-value store is also important for configuration and operation management. If BDB is used then all configuration is done through the server.properties file. If MySQL is used then usual mysql administration must be done.
Oracle has a writeup that gives a good overview of the operational side of BDB.
Client configuration
The above settings were all for the server. It is important to correctly configure the client as well: List of all the client configuration parameters can be found here
Some additional suggestions
JVM Settings
At LinkedIn we maintain two sets of clusters, read-only and read-write. The read-write clusters are clusters using BDB stores and have totally different JVM characteristics from those using read-only stores. Here is what we use at LinkedIn for our read-write stores:
# Min, max, total JVM size JVM_SIZE="-server -Xms32g -Xmx32g" # New Generation Sizes JVM_SIZE_NEW="-XX:NewSize=2048m -XX:MaxNewSize=2048m" # Type of Garbage Collector to use JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" # Tuning options for the above garbage collector JVM_GC_OPTS="-XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2" # JVM GC activity logging settings JVM_GC_LOG="-XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$LOG_DIR/gc.log"
Note that you must use the concurrent mark and sweep gc or else the GC pauses from collecting such a large heap will cause unresponsive periods (it doesn't happen at first either, it creeps up and then eventually goes into a spiral of gc pause death).
This is the setup on a 48GB RAM box with a BDB cache size of 20GB and 1 cleaner threads, on SSDs. You can find the entire configuration under config/prod_single_node_cluster. To bring a server up with these settings, use bin/voldemort-prod-server.sh
For the read-only clusters we use the same JVM GC settings, except the heap size is set to a smaller value.
# Min, max, total JVM size JVM_SIZE="-server -Xms4096m -Xmx4096m"
This is done because in the case of read-only stores we rely on the OS page cache and don't really want our JVM heap to take up space.