Skip to main content

Azure Data Lake Analytics: In short

Azure is a great cloud platform but its services have mostly terrible names and confusing, often heavily overlapping features between them.
And so another one in this series of confusing services is Azure Data Lake. I have tried to make sense of this service and what it does!

Azure Data Lake 

Azure Data Lake has two components, which are very tightly integrated:

a.     Azure Data Lake Analytics (ADLA)

b.     Azure Data Lake Storage (ADLS)

 

ADLA gives us the ability to run analytical jobs (query, extract, aggregate, transform output etc.) on data, which is stored in ADLS, in form of files.

 

Below ‘temp’ is the name of ADLA account. As we can see ‘tempdl’ is the name of associated ADLS account, which acts as the data source for this ADLA account.

 


As we can see below, the ‘tempdl’ has got following data stored in form of files.

ADLA lets us query, transform and do other operations on this data using USQL (properietary language, a combination of SQL and C#) and save the result data again in form of files in the ADLS account.

How does ADLA do that?

 

In form of jobs. Below we can see a job named ‘Query a TSV file’, which is a U-SQL type job.

 


Below we see details of the above-mentioned job. 


Below is the U-SQL script used to do this task of querying the file named ‘SearchLog.tsv’.


As mentioned in the script above, the output of the query is written to SearchLog_output.tsv file.


Here is what the output file looks like:


For any job there are four stages:

a.     Preparing: the script is compiled, and execution plan is created

b.     Queued: Resources are collected

c.     Running: Actual execution happens.

d.     Done: Any cleanup tasks are executed

 

More U-SQL concepts

Following script does data filtering on a file:

 

@searchlog =  

    EXTRACT UserId          int  

          , Start           DateTime  

          , Region          string  

          , Query           string  

          , Duration        int  

          , Urls            string  

          , ClickedUrls     string  

    FROM "/Samples/Data/SearchLog.tsv"  

    USING Extractors.Tsv();

 

@filter = SELECT UserId, Region, Query, Duration, Urls

FROM @searchlog WHERE Duration > 500;

 

OUTPUT @filter

TO "/Samples/Output/LongDurationSearches.tsv"

USING Outputters.Tsv();

 

 

Note that the language is case sensitive, unlike SQL.

 

Catalog, Databases and Tables

 

Whenever we create a new ADLS account, we get a U-SQL Catalog, which has got one or more databases. It has one Master database by default and many user-created databases. 


Data is stored in Tables. There is also concept of Views, Stored Procedures etc.

 

Below script creates a new database and table:

 

//Create Database SampleDBTutorials

CREATE DATABASE IF NOT EXISTS SampleDBTutorials;

 

//Create Table OlympicAthletes

CREATE TABLE IF NOT EXISTS SampleDBTutorials.dbo.SearchLog

(

        //Define schema of table

        UserId          int

        Start           DateTime

        Region          string

        Query           string

        Duration        int

        Urls            string

        ClickedUrls     string,

    INDEX idx1 //Name of index

    CLUSTERED (Region ASC//Column to cluster by

    DISTRIBUTED BY HASH (Region) //Column to partition by

);

 

 

As we can see below, running a job with above script produces no data files. 


Following script populates this newly created table using the data from SearchLog.tsv file:

 

//Read some data

@searchlog = 

    EXTRACT UserId          int

            Start           DateTime

            Region          string

            Query           string

            Duration        int

            Urls            string

            ClickedUrls     string

    FROM @"/Samples/Data/SearchLog.tsv"

    USING Extractors.Tsv();

 

//Insert it into a previously created table

INSERT INTO SampleDBTutorials.dbo.SearchLog

SELECT *

FROM @searchlog

WHERE Duration > 500;

 

 

Below is the Job Graph, along with Heat Map on the basis of ‘Written’ data. Heat Map highlights the step in the graph which has maximum value for given parameter. 


Note that output in this scenario is not a file, but a table.

But this table can’t be queried by just clicking on it. We need another job with a U-SQL query to view the data.


Following query queries the table and writes data to a file:

 

//Read from SearchLog table

@athletes =

    SELECT *

    FROM SampleDBTutorials.dbo.SearchLog;

 

//Write it to a file so we can look at it

OUTPUT @athletes

TO @"/Samples/Output/SearchLog_output.tsv"

USING Outputters.Tsv();


Here is the output file:

As we can see the analytics works very closely with the storage account. Great thing is that the data lake storage can store data files of any size and any number of files can be stored. Also, as we saw, structured (tables) and unstructured or semi-structured data can be stored in the ADLS and ADLA has capability to work on it.

 

Azure Data Lake Storage Gen 2


ADLS Gen 2 is not a separate service but is built on top of Blob Storage only.

 

Blob Storage stores data in blobs and doesn’t really have folders. ADLS stores data hierarchically in folders (which are first class citizens). ADLS Gen2 assimilates hierachical storage as well as other features of ADLS (made for analytical/Big Data services, large amount of any type of data etc.) with Blob Storage (massive availability, partner eco-system etc.). For any data/file stored in Gen 2, we can get two endpoints, one as an object store and other as a file store.


ADLA doesn’t work with Gen 2 yet. A list of Gen 1 vs Gen 2 capabilities.


Comments

Popular posts from this blog

How to upload to Google Cloud Storage buckets using CURL

Signed URLs are pretty nifty feature given by Google Cloud Platform to let anyone access your cloud storage (bucket or any file in the bucket) without need to sign in. Official documentation gives step by step details as to how to read/write to the bucket using gsutil or through a program. This article will tell you how to upload a file to the bucket using curl so that any client which doesn't have cloud SDK installed can do this using a simple script. This command creates a signed PUT URL for your bucket. gsutil signurl -c 'text/plain' -m PUT serviceAccount.json gs://test_bucket_location Here is my URL: https://storage.googleapis.com/test_sl?GoogleAccessId=my-project-id@appspot.gserviceaccount.com&Expires=1490266627&Signature=UfKBNHWtjLKSBEcUQUKDeQtSQV6YCleE9hGG%2BCxVEjDOmkDxwkC%2BPtEg63pjDBHyKhVOnhspP1%2FAVSr%2B%2Fty8Ps7MSQ0lM2YHkbPeqjTiUcAfsbdcuXUMbe3p8FysRUFMe2dSikehBJWtbYtjb%2BNCw3L09c7fLFyAoJafIcnoIz7iJGP%2Br6gAUkSnZXgbVjr6wjN%2FIaudXIqA...

Running Apache Beam pipeline using Spark Runner on a local standalone Spark Cluster

The best thing about Apache Beam ( B atch + Str eam ) is that multiple runners can be plugged in and same pipeline can be run using Spark, Flink or Google Cloud Dataflow. If you are a beginner like me and want to run a simple pipeline using Spark Runner then whole setup may be tad daunting. Start with Beam's WordCount examples  which help you quickstart with running pipelines using different types of runners. There are code snippets for running the same pipeline using different types of runners but here the code is running on your local system using Spark libraries which is good for testing and debugging pipeline. If you want to run the pipeline on a Spark cluster you need to do a little more work! Let's start by setting up a simple standalone single-node cluster on our local machine. Extending the cluster is as easy as running a command on another machine, which you want to add to cluster. Start with the obvious: install spark on your machine! (Remember to have Java a...

Changing Eclipse Workspace Directory

Recently I moved my entire Eclipse installation directory but the workspace was still getting created in the older location only. And worst there was no option to select the Workspace directory in the Window->Options->Workspace menu. To change the workspace location in Eclipse do this. Goto ECLIPSE_HOME\configuration\.settings directory, edit the org.eclipse.ui.ide.prefs file and change the RECENT_WORKSPACES value to the desired location. If you want that Eclipse prompts you to select workspace when you start it, change the SHOW_WORKSPACE_SELECTION_DIALOG value to true. And you are done!