Build & Push Jobs for Voldemort Read-Only

Introduction

We have been using the Build and Push Job at Linkedin to create Voldemort Read-Only stores from data present in sequence files/ Avro Container files on HDFS. The Voldemort build and push job uses the fault tolerance and parallelism of Hadoop and builds individual Voldemort node/partition-level data stores, which are transferred to Voldemort for serving. A Hadoop job reads data from a source in HDFS, repartitions it on a per-node basis, and finally writes the data to individual Read-only storage engine [1].

The VoldemortBuildAndPushJob will behave in the following way:

  1. Build an XML storeDef for your data (based off of the key and value metadata in your JsonSequenceFile/Avro on HDFS).
  2. Connect to push.cluster
  3. Get the storeDefs for all stores in the push.cluster
  4. Look through the storeDefs for a store with the same name as push.store.name. If one is found, validate that the storeDef in the cluster matches the storeDef for your data. If it doesn’t, fail. If no storeDef exists on the cluster that matches push.store.name, then add your storeDef to the cluster.
  5. Build the Voldemort store in Hadoop
  6. Push the Voldemort store to push.cluster

Getting Started

First, you should clone and build the latest version of Voldemort, as described in the Quickstart guide. The "server.properties" file on the Voldemort server needs to have the following entry:

file.fetcher.class=voldemort.store.readonly.fetcher.HdfsFetcher

This instructs the server to use that class while fetching files from Hadoop during the push phase.

Then, from the root of the Voldemort git repository, you can generate a fat jar containing all of the dependencies that the Build and Push job needs with the following command:

	./gradlew bnpJar

After that, you can launch the BnP job from your local machine using the following command:

	./bin/run-bnp.sh <config_file>

The config_file argument should point to the path of a text file containing configuration parameters for the BnP job. Below are two such example configurations.

Pushing JSON Data - Job File

type=java
job.class=voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob
hadoop.job.ugi=anagpal,hadoop
build.input.path=/tmp/new_op
build.output.dir=/tmp/build_output/
push.store.name=anagpal-test-old
push.cluster=tcp://localhost:6666
push.store.description="test store"
push.store.owners=myemail@myworkplace.com
build.replication.factor=1

Pushing AVRO Data - Job File

type=java
job.class=voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob
build.input.path=/user/anagpal/avro-data
build.output.dir=/tmp
push.cluster=tcp://localhost:6666
azkaban.should.proxy=true
user.to.proxy= anagpal
build.replication.factor= 1
build.type.avro=true
build.output.dir=/tmp/
avro.key.field=memberId
avro.value.field=localizedFirstNames
push.store.name=test-avro-store
push.store.description="Testing avro build and push"
push.store.owners= myemail@myworkplace.com
build.input.path=/user/anagpal/avro-data
build.output.dir=/tmp

Notice the following properties:

  1. build.type.avro=true This specifies that input data is Avro.
  2. avro.key.field=memberId This specifies the field to be used as the key
  3. avro.value.field=localizedFirstNames This specifies the field to be used as the value

File Format

We use a custom data and index format for the Read-Only store.

On every Node you will find a node directory containing one or mutliple data and index files with the following naming convention:

partition_id.replica_id.chunkset_id

Troubleshooting

If you have any issues with Voldemort Read-Only or the BnP job, please reach out to the open-source community on the Voldemort mailing list.

References

  1. Serving Large-scale Batch Computed Data with Project Voldemort
  2. Configuring the Build and Push job as an Azkaban plugin
Fork me on GitHub