Skip to main content

Running Apache Beam pipeline using Spark Runner on a local standalone Spark Cluster

The best thing about Apache Beam (Batch + Stream) is that multiple runners can be plugged in and same pipeline can be run using Spark, Flink or Google Cloud Dataflow.

If you are a beginner like me and want to run a simple pipeline using Spark Runner then whole setup may be tad daunting.

Start with Beam's WordCount examples which help you quickstart with running pipelines using different types of runners. There are code snippets for running the same pipeline using different types of runners but here the code is running on your local system using Spark libraries which is good for testing and debugging pipeline. If you want to run the pipeline on a Spark cluster you need to do a little more work!

Let's start by setting up a simple standalone single-node cluster on our local machine. Extending the cluster is as easy as running a command on another machine, which you want to add to cluster.

Start with the obvious: install spark on your machine! (Remember to have Java and Scala downloaded and installed beforehand.)

To setup master and slave nodes, follow these steps:

1. Go to “SPARK_HOME/libexec/conf/” directory.
(SPARK_HOME is the complete path to root directory of Apache Spark in your machine.)

2. Edit the spark-env.sh.template file and save it as spark-env.sh.
Add following line: SPARK_MASTER_HOST=''

3. Goto SPARK_HOME/libexec/sbin and execute the following command.
$ ./start-master.sh

4. The output of above command will be a log file. Go to the log file and check the port at which spark master webUI is running. For me it is 8080.


Don't confuse this with 7077. That is the port at which master will listen for slave nodes.

5. Now we need to setup slave node for above master. On the machine where you want to setup the slave repeat the steps #1 and #2 above. If you are setting the slave on same machine, then no need to do these.

6.  Goto SPARK_HOME/libexec/sbin and execute the following command.
$ ./start-slave.sh spark://:7077

7. Once again verify the logs. Now if you go to the master's webUI, you will see your slave listed under 'Workers'.

Our cluster is setup and we need to now run the same Beam WordCount examples on this cluster. We need to create a jar (which will be submitted to spark cluster) which bundles our code with required spark libraries. Without these libraries Beam will not accept spark-runner as an eligible runner for the pipeline and will throw an error for this when you run the pipeline.
mvn package -Pspark-runner

Now submit the job with same parameters as earlier. 

spark-submit --class org.apache.beam.examples.WordCount --master spark://:7077 target/word-count-beam-bundled-0.1.jar --runner=SparkRunner --inputFile=pom.xml --output=counts

While the code is running you can now refresh the master's webUI and you will see an entry in running applications! 

This should help you start the learning. Any questions or comments, please leave in comments section. 

Comments

Popular posts from this blog

How to upload to Google Cloud Storage buckets using CURL

Signed URLs are pretty nifty feature given by Google Cloud Platform to let anyone access your cloud storage (bucket or any file in the bucket) without need to sign in. Official documentation gives step by step details as to how to read/write to the bucket using gsutil or through a program. This article will tell you how to upload a file to the bucket using curl so that any client which doesn't have cloud SDK installed can do this using a simple script. This command creates a signed PUT URL for your bucket. gsutil signurl -c 'text/plain' -m PUT serviceAccount.json gs://test_bucket_location Here is my URL: https://storage.googleapis.com/test_sl?GoogleAccessId=my-project-id@appspot.gserviceaccount.com&Expires=1490266627&Signature=UfKBNHWtjLKSBEcUQUKDeQtSQV6YCleE9hGG%2BCxVEjDOmkDxwkC%2BPtEg63pjDBHyKhVOnhspP1%2FAVSr%2B%2Fty8Ps7MSQ0lM2YHkbPeqjTiUcAfsbdcuXUMbe3p8FysRUFMe2dSikehBJWtbYtjb%2BNCw3L09c7fLFyAoJafIcnoIz7iJGP%2Br6gAUkSnZXgbVjr6wjN%2FIaudXIqA...

Changing Eclipse Workspace Directory

Recently I moved my entire Eclipse installation directory but the workspace was still getting created in the older location only. And worst there was no option to select the Workspace directory in the Window->Options->Workspace menu. To change the workspace location in Eclipse do this. Goto ECLIPSE_HOME\configuration\.settings directory, edit the org.eclipse.ui.ide.prefs file and change the RECENT_WORKSPACES value to the desired location. If you want that Eclipse prompts you to select workspace when you start it, change the SHOW_WORKSPACE_SELECTION_DIALOG value to true. And you are done!