Skip to main content

Hadoop for beginners, by a beginner (Part 1) - Using Cloudera Quickstart VM

So you have read white papers, blogs and even some books on what is Big Data and how it is transforming world by giving us insights into the data usage through advanced analytical strategies. You might also have read about Hadoop and Map/Reduce.

But now what? How do you begin? Theory is not going to cut it, right? You want to get your hands dirty and write some code and setup some clusters, right? Right. So let's start.

Admittedly, Hadoop is intimidating. Apart from having a plethora of software (first of all, you need a Linux box!) you also need to have a 'cluster' of machines because Hadoop running on one machine is not really what a real life Hadoop installation looks like. As a beginner you would like to quickly write a 'Hello World' of Hadoop rather than setting up environment.

Easiest way to start instantly is using Cloudera's Quickstart VM for Hadoop.(Cloudera is one of the three biggest Hadoop distributors). First of all we need to install a virtualization software like Oracle Virtual Box or VMWare. Cloudera's Quickstart VM gives you a CentOS (a Linux distribution) installation with entire suite of Hadoop software along with 'Cloudera Manager' which is a browser based tool to manage these software.

Download and install Oracle VM VirtualBox from here. Once you are done, download Cloudera's Quickstart VM for Oracle VM VirtualBox from here.

Start VirtualBox and click on File->Import Appliance and select the Quickstart VM you just downloaded. It will ask you to assign some RAM, hard disk, processors and location for snapshot (C:\Users directory by default). Ideally 6-8 GB RAM, 20-40 GB HD and minimum of 2 cores are required.

It will take a few minutes. Once it is done, you will get this screen:


Click on 'Start' button and once the VM is ready you will see this.

Click on 'Launch Cloudera Express'. Follow the instructions. Open the URL for Cloudera Manager in the browser. Congratulations, you have setup your Hadoop 'Learning' Environment successfully!

You will see various services in stopped status. Go ahead and start HDFS, YARN and HUE (in this order and others later). 

Click on Hue and go to Hue home page and click on Hue Web UI. You can now interact with various Hadoop services using a web interface (HUE stands for 'Hadoop User Experience' which gives you a UI alternative to command line.)

Click on 'File Browser' on top right side of the page. Explore the file browser and try the various actions available. Play around. Give yourself a pat on the back!

I will be back with next steps in Part 2.


Popular posts from this blog

File upload problem: UTF-8 encoding not honored when form has multipart/form-data

The problem that I was facing was something like this. I was using Apache Commons File Upload library to upload and download some file.

I had a form in which user can upload a file and another field 'name' in which she can give any name to the file being loaded.

When I submitted the form, the file was uploaded fine but the value in name field was garbled. I followed all the possible suggestions I found:

<%@page pageEncoding="UTF-8"%> set. <%@page contentType="text/html;charset=UTF-8"%gt; set after the first directive. <meta equiv="Content-Type" content="text/html;charset=UTF-8"> in the head. enctype="multipart/form-data" attribute in the form. accept-charset="UTF-8" attribute in the form.
in the Servlet:
before doing any operations on request object: request.setCharacterEncoding("UTF-8"); For accessing the value

FileItem item = (FileItem);

if (item.isFormField()) {

//For regular…

java.lang.IllegalArgumentException: Malformed \uxxxx encoding

I was getting this exception during build while running ant. Googling didn't help much and I was flummoxed because the same code was running fine till now.

My code reads a text file and does some operations on the basis of values read. It was only when I saw the text files I understood the error. I had copied the text in wordpad and saved it as .txt file. Wordpad had put lot of formatting information before and after the content. Also there was "\par" after every line, which was giving this error.

So moral of the story: if you get this exception check your properties file (or any other file that your code might be reading.)

Easiest way to print Timestamp in Java

Rather than using Calendar.getTime() we can use java.sql.Timestamp class to get the time stamp which gives date and time till millisecond precision.

System.out.println(new Timestamp(System.currentTimeMillis()));

Above will give you current timestamp in this format: 2010-07-27 16:37:45.39