Skip to main content

Hadoop for beginners, by a beginner (Part 1) - Using Cloudera Quickstart VM

So you have read white papers, blogs and even some books on what is Big Data and how it is transforming world by giving us insights into the data usage through advanced analytical strategies. You might also have read about Hadoop and Map/Reduce.

But now what? How do you begin? Theory is not going to cut it, right? You want to get your hands dirty and write some code and setup some clusters, right? Right. So let's start.

Admittedly, Hadoop is intimidating. Apart from having a plethora of software (first of all, you need a Linux box!) you also need to have a 'cluster' of machines because Hadoop running on one machine is not really what a real life Hadoop installation looks like. As a beginner you would like to quickly write a 'Hello World' of Hadoop rather than setting up environment.

Easiest way to start instantly is using Cloudera's Quickstart VM for Hadoop.(Cloudera is one of the three biggest Hadoop distributors). First of all we need to install a virtualization software like Oracle Virtual Box or VMWare. Cloudera's Quickstart VM gives you a CentOS (a Linux distribution) installation with entire suite of Hadoop software along with 'Cloudera Manager' which is a browser based tool to manage these software.

Download and install Oracle VM VirtualBox from here. Once you are done, download Cloudera's Quickstart VM for Oracle VM VirtualBox from here.

Start VirtualBox and click on File->Import Appliance and select the Quickstart VM you just downloaded. It will ask you to assign some RAM, hard disk, processors and location for snapshot (C:\Users directory by default). Ideally 6-8 GB RAM, 20-40 GB HD and minimum of 2 cores are required.



It will take a few minutes. Once it is done, you will get this screen:

 

Click on 'Start' button and once the VM is ready you will see this.


Click on 'Launch Cloudera Express'. Follow the instructions. Open the URL for Cloudera Manager in the browser. Congratulations, you have setup your Hadoop 'Learning' Environment successfully!

You will see various services in stopped status. Go ahead and start HDFS, YARN and HUE (in this order and others later). 

Click on Hue and go to Hue home page and click on Hue Web UI. You can now interact with various Hadoop services using a web interface (HUE stands for 'Hadoop User Experience' which gives you a UI alternative to command line.)


Click on 'File Browser' on top right side of the page. Explore the file browser and try the various actions available. Play around. Give yourself a pat on the back!

I will be back with next steps in Part 2.

Comments

Popular posts from this blog

How to upload to Google Cloud Storage buckets using CURL

Signed URLs are pretty nifty feature given by Google Cloud Platform to let anyone access your cloud storage (bucket or any file in the bucket) without need to sign in. Official documentation gives step by step details as to how to read/write to the bucket using gsutil or through a program. This article will tell you how to upload a file to the bucket using curl so that any client which doesn't have cloud SDK installed can do this using a simple script. This command creates a signed PUT URL for your bucket. gsutil signurl -c 'text/plain' -m PUT serviceAccount.json gs://test_bucket_location Here is my URL: https://storage.googleapis.com/test_sl?GoogleAccessId=my-project-id@appspot.gserviceaccount.com&Expires=1490266627&Signature=UfKBNHWtjLKSBEcUQUKDeQtSQV6YCleE9hGG%2BCxVEjDOmkDxwkC%2BPtEg63pjDBHyKhVOnhspP1%2FAVSr%2B%2Fty8Ps7MSQ0lM2YHkbPeqjTiUcAfsbdcuXUMbe3p8FysRUFMe2dSikehBJWtbYtjb%2BNCw3L09c7fLFyAoJafIcnoIz7iJGP%2Br6gAUkSnZXgbVjr6wjN%2FIaudXIqA

Running Apache Beam pipeline using Spark Runner on a local standalone Spark Cluster

The best thing about Apache Beam ( B atch + Str eam ) is that multiple runners can be plugged in and same pipeline can be run using Spark, Flink or Google Cloud Dataflow. If you are a beginner like me and want to run a simple pipeline using Spark Runner then whole setup may be tad daunting. Start with Beam's WordCount examples  which help you quickstart with running pipelines using different types of runners. There are code snippets for running the same pipeline using different types of runners but here the code is running on your local system using Spark libraries which is good for testing and debugging pipeline. If you want to run the pipeline on a Spark cluster you need to do a little more work! Let's start by setting up a simple standalone single-node cluster on our local machine. Extending the cluster is as easy as running a command on another machine, which you want to add to cluster. Start with the obvious: install spark on your machine! (Remember to have Java a

Changing Eclipse Workspace Directory

Recently I moved my entire Eclipse installation directory but the workspace was still getting created in the older location only. And worst there was no option to select the Workspace directory in the Window->Options->Workspace menu. To change the workspace location in Eclipse do this. Goto ECLIPSE_HOME\configuration\.settings directory, edit the org.eclipse.ui.ide.prefs file and change the RECENT_WORKSPACES value to the desired location. If you want that Eclipse prompts you to select workspace when you start it, change the SHOW_WORKSPACE_SELECTION_DIALOG value to true. And you are done!