Skip to main content

Hadoop for beginners, by a beginner (Part 2) - Hadoop Architecture explained easy

If you have gone through part 1 of this series, you have used Cloudera's Hadoop Quickstart VM to setup a working instance of Hadoop and have running instance of various Hadoop services. Now it is a good time to go back to a little theory and see how different pieces fit with each other.

HortonWorks, another Hadoop distributor, has got an excellent tutorial for Hadoop and each of its accompanying services, which you see in below image (taken from the above URL). You can go to this page and read about Hadoop in detail.



However I am going to summarize and simplify some of the content and definitions which make it easy for a beginner to quickly understand and proceed.

OK, so this is the definition of Hadoop on HortonWorks site: Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware.

You will agree that biggest challenges in any computing are very basic: storage and processing. Processing could be any operation that you do on the data. It could be merely counting the words in text files or finding more complex patterns in them. And since you can't do all the processing with your data kept in RAM all the time, you need to store it on HDD.

Hadoop uses 'commodity hardware' for both these purposes. The usual machines used as 'commodity hardware' have generally better configuration than your average laptop which has 8 GB RAM, 5-7 core processors and 500 GB HDD. But the non-commodity hardware which Hadoop clusters tend to compete with, are something like this: Oracle Exadata and Exalogic machines having TBs of RAM and hundreds of processors. These are extremely well engineered machines with software and hardware carefully designed to complement each other, and they cost a bomb!

Such machines are an example of what is called 'vertical scaling': increasing RAM, processing power and storage. Hadoop clusters which simply add more and more machines are examples of 'horizontal scaling' and since there is virtually no limit of how many machines you can add to a cluster, there is no limit to the processing power and storage capacity you can achieve!

(Theoretically even Exadata machines can be clustered to create a super powerful cluster but it will be so costly that it won't be economically viable. Just imagine having a cluster where each machine costs a million dollars!)

Let's go back to the architecture diagram shown above.

As is very clear Data Management and Data Access modules form the core of Hadoop and are responsible for storage and processing of the data. HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator) form the crux of it.  HortonWorks site actually explains these technologies quite comprehensively so read it.

OK if you are reading this now you probably know how does HDFS work. Let's just see how it looks like. Go to Hue web UI. Click on 'File Browser'. What you see is the directory structure of the HDFS, which looks and behaves much like your local file system.  You can create and delete directories and files, change directories and alter permissions for them, just like files on your local files.


You can do all of this on terminal using shell commands as well.

hadoop fs -mkdir newDirectory //create a new directory on HDFS

hadoop fs -put /home/someFile /user/cloudera/newDirectory //copy a file from your local file system to HDFS directory

hadoop fs -get /user/cloudera/newDirectory /home/someFile //download a file from HDFS directory to local file system.

There are lot many commands which you can check here.

Comments

Popular posts from this blog

How to upload to Google Cloud Storage buckets using CURL

Signed URLs are pretty nifty feature given by Google Cloud Platform to let anyone access your cloud storage (bucket or any file in the bucket) without need to sign in. Official documentation gives step by step details as to how to read/write to the bucket using gsutil or through a program. This article will tell you how to upload a file to the bucket using curl so that any client which doesn't have cloud SDK installed can do this using a simple script. This command creates a signed PUT URL for your bucket. gsutil signurl -c 'text/plain' -m PUT serviceAccount.json gs://test_bucket_location Here is my URL: https://storage.googleapis.com/test_sl?GoogleAccessId=my-project-id@appspot.gserviceaccount.com&Expires=1490266627&Signature=UfKBNHWtjLKSBEcUQUKDeQtSQV6YCleE9hGG%2BCxVEjDOmkDxwkC%2BPtEg63pjDBHyKhVOnhspP1%2FAVSr%2B%2Fty8Ps7MSQ0lM2YHkbPeqjTiUcAfsbdcuXUMbe3p8FysRUFMe2dSikehBJWtbYtjb%2BNCw3L09c7fLFyAoJafIcnoIz7iJGP%2Br6gAUkSnZXgbVjr6wjN%2FIaudXIqA...

Running Apache Beam pipeline using Spark Runner on a local standalone Spark Cluster

The best thing about Apache Beam ( B atch + Str eam ) is that multiple runners can be plugged in and same pipeline can be run using Spark, Flink or Google Cloud Dataflow. If you are a beginner like me and want to run a simple pipeline using Spark Runner then whole setup may be tad daunting. Start with Beam's WordCount examples  which help you quickstart with running pipelines using different types of runners. There are code snippets for running the same pipeline using different types of runners but here the code is running on your local system using Spark libraries which is good for testing and debugging pipeline. If you want to run the pipeline on a Spark cluster you need to do a little more work! Let's start by setting up a simple standalone single-node cluster on our local machine. Extending the cluster is as easy as running a command on another machine, which you want to add to cluster. Start with the obvious: install spark on your machine! (Remember to have Java a...

Changing Eclipse Workspace Directory

Recently I moved my entire Eclipse installation directory but the workspace was still getting created in the older location only. And worst there was no option to select the Workspace directory in the Window->Options->Workspace menu. To change the workspace location in Eclipse do this. Goto ECLIPSE_HOME\configuration\.settings directory, edit the org.eclipse.ui.ide.prefs file and change the RECENT_WORKSPACES value to the desired location. If you want that Eclipse prompts you to select workspace when you start it, change the SHOW_WORKSPACE_SELECTION_DIALOG value to true. And you are done!