Running Apache Beam pipeline using Spark Runner on a local standalone Spark Cluster

The best thing about Apache Beam (Batch + Stream) is that multiple runners can be plugged in and same pipeline can be run using Spark, Flink or Google Cloud Dataflow.

If you are a beginner like me and want to run a simple pipeline using Spark Runner then whole setup may be tad daunting.

Start with Beam's WordCount examples which help you quickstart with running pipelines using different types of runners. There are code snippets for running the same pipeline using different types of runners but here the code is running on your local system using Spark libraries which is good for testing and debugging pipeline. If you want to run the pipeline on a Spark cluster you need to do a little more work!

Let's start by setting up a simple standalone single-node cluster on our local machine. Extending the cluster is as easy as running a command on another machine, which you want to add to cluster.

Start with the obvious: install spark on your machine! (Remember to have Java and Scala downloaded and installed beforehand.)

To setup master and slave nodes, follow these steps:

1. Go to “SPARK_HOME/libexec/conf/” directory.
(SPARK_HOME is the complete path to root directory of Apache Spark in your machine.)

2. Edit the spark-env.sh.template file and save it as spark-env.sh.
Add following line: SPARK_MASTER_HOST=''

3. Goto SPARK_HOME/libexec/sbin and execute the following command.
$ ./start-master.sh

4. The output of above command will be a log file. Go to the log file and check the port at which spark master webUI is running. For me it is 8080.

Don't confuse this with 7077. That is the port at which master will listen for slave nodes.

5. Now we need to setup slave node for above master. On the machine where you want to setup the slave repeat the steps #1 and #2 above. If you are setting the slave on same machine, then no need to do these.

6. Goto SPARK_HOME/libexec/sbin and execute the following command.
$ ./start-slave.sh spark://:7077

7. Once again verify the logs. Now if you go to the master's webUI, you will see your slave listed under 'Workers'.

Our cluster is setup and we need to now run the same Beam WordCount examples on this cluster. We need to create a jar (which will be submitted to spark cluster) which bundles our code with required spark libraries. Without these libraries Beam will not accept spark-runner as an eligible runner for the pipeline and will throw an error for this when you run the pipeline.

mvn package -Pspark-runner

Now submit the job with same parameters as earlier.

spark-submit --class org.apache.beam.examples.WordCount --master spark://:7077 target/word-count-beam-bundled-0.1.jar --runner=SparkRunner --inputFile=pom.xml --output=counts

While the code is running you can now refresh the master's webUI and you will see an entry in running applications!

This should help you start the learning. Any questions or comments, please leave in comments section.

Running Commentary

Search This Blog

Running Apache Beam pipeline using Spark Runner on a local standalone Spark Cluster

Labels

Comments

Popular posts from this blog

How to upload to Google Cloud Storage buckets using CURL

Changing Eclipse Workspace Directory