Skip to main content

Posts

Showing posts from 2018

Running Apache Beam pipeline using Spark Runner on a local standalone Spark Cluster

The best thing about Apache Beam ( B atch + Str eam ) is that multiple runners can be plugged in and same pipeline can be run using Spark, Flink or Google Cloud Dataflow. If you are a beginner like me and want to run a simple pipeline using Spark Runner then whole setup may be tad daunting. Start with Beam's WordCount examples  which help you quickstart with running pipelines using different types of runners. There are code snippets for running the same pipeline using different types of runners but here the code is running on your local system using Spark libraries which is good for testing and debugging pipeline. If you want to run the pipeline on a Spark cluster you need to do a little more work! Let's start by setting up a simple standalone single-node cluster on our local machine. Extending the cluster is as easy as running a command on another machine, which you want to add to cluster. Start with the obvious: install spark on your machine! (Remember to have Java a