Skip to main content


Running Apache Beam pipeline using Spark Runner on a local standalone Spark Cluster

The best thing about Apache Beam (Batch + Stream) is that multiple runners can be plugged in and same pipeline can be run using Spark, Flink or Google Cloud Dataflow.

If you are a beginner like me and want to run a simple pipeline using Spark Runner then whole setup may be tad daunting.

Start with Beam's WordCount examples which help you quickstart with running pipelines using different types of runners. There are code snippets for running the same pipeline using different types of runners but here the code is running on your local system using Spark libraries which is good for testing and debugging pipeline. If you want to run the pipeline on a Spark cluster you need to do a little more work!

Let's start by setting up a simple standalone single-node cluster on our local machine. Extending the cluster is as easy as running a command on another machine, which you want to add to cluster.

Start with the obvious: install spark on your machine! (Remember to have Java and Scala dow…
Recent posts

Running Jenkins inside Docker

Whether you are a beginner to Jenkins and want to have your setup done quickly or you want to harness the full power of containers by running Jenkins in Docker, this simple step-by-step tutorial is for you.
Running Jenkins in Docker
We will pull Jenkins Docker image from Docker repository:
docker pull jenkins
Using below command we will map ‘jenkins’ directory in my working directory ($PWD/jenkins)to /var/jenkins_homedirectory in container and map port 49001 on host to 8080 on container.
docker run -d -p 49001:8080 -v $PWD/jenkins:/var/jenkins_home:z -t jenkins
Now to run jenkins, just give http://localhost:49001/. You can proceed with the setup and management of Jenkins now.
Your entire Jenkins setup has been created in the 'jenkins' directory which you mapped earlier. Go ahead and take a look! 
Packaging app in Docker and pushing to repository
Sometimes you would want to package your app as a Docker image and publish to the Docker Hub (or any other container registry of your choice) …

Introduction to Mutation Testing

You have written unit tests, integration tests, acceptance tests and all sorts of tests and code coverage is 99% but still there are bugs! How is that possible? It is possible if your testcases are incorrect and quality of test code is low. With Mutation testing, find out how good are your tests. Know more about this in this presentation!

How to query data using GQL if your Kind name has a dot in it?

There is a big chance that your kind (table) name is of the format . because there are various tables in your module.

e.g. I had StoreService.Store and StoreService.Employee as two of the kinds in my service named StoreService.

In Google Cloud Console, if you write a GQL to query to fetch/filter the entities you will get an error:

SELECT * from StoreService.Store
GQL query error: Encountered "." at line 1, column 27. Was expecting one of: "group", "limit", "offset", "order", "where"

Enclosing the Kind name in single quotes doesn't help either:

SELECT * from 'StoreService.Store'
GQL query error: Encountered "'StoreService.Store'" at line 1, column 15. Was expecting one of: ,

To do it right, you need to enclose the name in backquotes (that key below ~ in the keyboard)!!!
SELECT * from `StoreService.Store`

This information is hidden in Datastore documentation here. Try finding it!

How to transfer data from Cloud Datastore to Big Query in Google Cloud Platform

If you are here I am assuming that you are looking to migrate the data from Cloud Datastore to Big Query because you want to do some analysis and are frustrated by limitations imposed by GQL (Google Query Language).

First of all you need to create a backup of the data in datastore. Use the Datastore Admin tool provided by Google to take a backup and store it automatically in the Cloud Storage bucket.

Select all the entities and press 'Backup Entities'. Give a backup name, select Google Cloud Storage as backup storage destination and specify a bucket name.

Once the backup job is completed, you will see the backup listed. You can select a backup and press 'Info' and see the details (Entities are masked in the screenshot below).

Go to the bucket mentioned in 'Handle' and you will see the file mentioned above. You will also see many more files with similar names, ending with .backup_info (e.g. ahRzfmpkYS1wZC1zbG8tc2FuZGJveHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGl…

How to upload to Google Cloud Storage buckets using CURL

Signed URLs are pretty nifty feature given by Google Cloud Platform to let anyone access your cloud storage (bucket or any file in the bucket) without need to sign in.

Official documentation gives step by step details as to how to read/write to the bucket using gsutil or through a program. This article will tell you how to upload a file to the bucket using curl so that any client which doesn't have cloud SDK installed can do this using a simple script. This command creates a signed PUT URL for your bucket. gsutil signurl -c 'text/plain' -m PUT serviceAccount.json gs://test_bucket_location
Here is my URL:…