Build and Push Jobs for Voldemort Read Only Stores
Introduction
We have been using the Build and Push Job at Linkedin to create Voldemort Read-Only stores from data present in sequence files/ Avro Container files on HDFS.
The Voldemort build and push job uses the fault tolerance and parallelism of Hadoop and builds individual Voldemort node/partition-level data stores, which are transferred to Voldemort for serving.
A Hadoop job reads data from a source in HDFS, repartitions it on a per-node basis, and finally writes the data to individual Read-only storage engine [1].
The VoldemortBuildAndPushJob will behave in the following way:
- Build an XML storeDef for your data (based off of the key and value metadata in your JsonSequenceFile/Avro on HDFS).
- Connect to push.cluster
- Get the storeDefs for all stores in the push.cluster
- Look through the storeDefs for a store with the same name as push.store.name. If one is found, validate that the storeDef in the cluster matches the storeDef for your data. If it doesn’t, fail. If no storeDef exists on the cluster that matches push.store.name, then add your storeDef to the cluster.
- Build the Voldemort store in Hadoop
- Push the Voldemort store to push.cluster
Azkaban Job
The Build and Push Job is an Azkaban job. Azkaban is a workflow scheduler used at LinkedIn [2]. You provide a job file to Azkaban with a set of properties and the jars and Azkaban executes this job.
You candownloadthe tarball. Then Untar it
tar -xvf build-and-push.tar.gz
Structure
/conf All Azkaban .job and .properties files must go in this directory. /lib All .jar files should go into this directory. /src All code should go into this directory. This includes Java, as well as scripting languages and shell scripts. /test Any testing code should be placed in this folder.
PS: In case you are running Voldemort Server locally before you start the Voldemort server ensure that the server.properties file has the following entry
file.fetcher.class=voldemort.store.readonly.fetcher.HdfsFetcher
This instructs the server to use that class while fetching files from Hadoop during the push phase
Pushing JSON Data - Job File
type=java job.class=voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob hadoop.job.ugi=anagpal,hadoop build.input.path=/tmp/new_op build.output.dir=/tmp/build_output/ push.store.name=anagpal-test-old push.cluster=tcp://localhost:6666 push.store.description="test store" push.store.owners=myemail@myworkplace.com build.replication.factor=1
Pushing AVRO Data - Job File
type=java job.class=voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob build.input.path=/user/anagpal/avro-data build.output.dir=/tmp push.cluster=tcp://localhost:6666 azkaban.should.proxy=true user.to.proxy= anagpal build.replication.factor= 1 build.type.avro=true build.output.dir=/tmp/ avro.key.field=memberId avro.value.field=localizedFirstNames push.store.name=test-avro-store push.store.description="Testing avro build and push" push.store.owners= myemail@myworkplace.com build.input.path=/user/anagpal/avro-data build.output.dir=/tmp
Notice the following properties:
- build.type.avro=true This specifies that input data is Avro.
- avro.key.field=memberId This specifies the field to be used as the key
- avro.value.field=localizedFirstNames This specifies the field to be used as the value
Running the Job
ant
Then after compiling on your hadoop gateway/local machine copy the directory and execute the following command:
./run-job dist/build-and-push-1.00-all.jar -j dist/package/ -c dist/package/ --ignore-deps <job name>
You need to change the input/output paths along with the ugi name,store name and server location. You can then query the voldemort server to see the new store entries
File Format
We use a custom data and index format for the Read-Only store.
Ony every Node you will find a node directory containing one or mutliple data and index files with the following naming convention:
partition_id.replica_id.chunkset_id
Troubleshoot
Chunk size issue:
- Symptom (possible):
- Caused by: java.io.IOException: Job failed! :
- Cause:
- Check the number of mappers and / or reducers (limit = 10000). If they're over the limit, use the num.chunks parameter to reduce number of chunks and hence #mappers, reducer
Size limit Chunk overflow exception: chunk
- Cause:
- Each chunk data file is capped at 2 Gb and hence you may want to increase the num.chunks to break it down into multiple chunks