5 Must for a Cloud Computing Job

I was recently asked to list 5 must for a cloud computing job. It can greatly differ for roles and kind of industry that you work for. Innovation and rapid delivery is a must for a software developer or architect while managing virtual environments and data management skills are necessary for a data center manager. However, I think there are a few common factors that will be applicable for all

Understanding of IT as a Service

One of the advantages for cloud offering is the service model or the shopping cart kind of experience where IT managers shop for the service using their credit card instead of building it themselves. IT managers have to transition from owner of the service to provider of the service. They have to give out their control and focus more on negotiating service quality with appropriate vendors. Prospective candidates transitioning to a cloud domain must understand cloud categories and deployment models. Understanding the business beyond the technologies and identifying ways and means to fulfill business requirements to maximize the productivity of their end-users is a common trait they should demonstrate.  Analyst Peter Christy at the Internet Research Group refers to this as  inversion of enterprise IT from an application-centric to a people centric structure.

Managing Virtual Environments

Be it a developer or an IT manager, at some point in your transition time, you will have to deal with virtualized environments.  You should be proficient in designing and managing IT infrastructure using hosted services, managing policies and configurations within private or public cloud as needed.  These skills will help an IT manager to negotiate their SLAs correctly and realistically.

Innovation and Rapid Delivery

Probably one of the most appealing aspects of the cloud computing is the agility with which one can realize their solutions. This is particularly true for developers as well as IT managers. They can fulfil business requirements via innovation and integration of cloud services with agility. Gone are the days where one raise a purchase request and wait for the approvals and subsequent delivery of hardware and software.

Security

One of the drawbacks of not having complete control over the services that you offer is security. This can be a daunting task if the services are offered from a public cloud. One should know how to manage security and compliance and should be familiar with various compliance requirements across verticals and locations. Good understanding of applications delivered from the cloud, the way users accessing them and security implications of the whole model will be a must have trait for a cloud engineer. Analyst Mark Diodati at Gartner says that the shift to the cloud and the consumerization of IT have complicated the task of identity and access management in the enterprise security environment.  Federation protocols, OAuth, REST, SCIM, BYO{D,I,A} etc are a few keywords that they can research on.

Data Management Skills

Most organizations will have a hybrid cloud approach where they will have data managed from on-premise and off-premise. Most medium and large enterprises prefer to keep their critical data such as identity and IP related information on-premise. Candidates opting for a cloud computing job must possess necessary skills to design systems to manage and integrate data from off-premise to on-premise. This would also include ability to get deeper understanding of data through analytics and management through big data technologies.
This is no way a comprehensive list of skills. I believe the above items are the intersection of skills and technologies that every individual must possess before adopting cloud technologies or services.

Elastic Map Reduce – A Practical Approach

Amazon just reminded me that my AWS free tier is getting over tomorrow.  I’ve been wanting to write about my EMR experiments for some time.  I worked on this a couple of months back when I got a chance to experiment with Hadoop.  We used twitter feeds at that time.   My objective was to run the same with a large log file from one of our products.  I’m going to explain the way EMR can be used in a very basic way by using data stored in S3 and scheduling EMR job with a bunch of scripts.

As usual, I’m going to build the whole experiment over a number of steps.  I believe that it is easier to validate your approach in smaller steps like programming.  It is always easier to test your program as you build it instead of trying to see how the program works after a few hundred thousands lines are written.

Step 1: Upload your data and scripts

pigbucketI’m going to use Amazon S3 as the storage for this example.  There could be other methods.  I think S3 is a good option for up to a few gigabytes of data.  As you notice, the pigdata
pinginput bucket pigdatbucket  has all the input and output data folders for this example.

The objective of this exercise is to do a sentiment analysis on a number of tweets from various states in USA.  The result will be placed in the folder output once the EMR job is completed.

Step 2: Crate  EMR cluster

In this step, we create an EMR cluster.  To start with, I leave logging on and use the S3 folder Logs as the place holder for log files.   I always find logging helpful to troubleshoot teething problems.  I disabled Termination Protection as I couldn’t sufficiently debug script issues when I enable this feature as the cluster terminates automatically.

Amazon provides hadoop 1.0.3 or 2.2.0 and pig 0.11.1.1 (as of this writing). This EMR cluster will be launched in one of your EC2 instances or a VPC.  Select the appropriate instance based on your subscription level.

As this example needs only basic Hadoop configuration, this was selected for the Bootstrap Actions. The core of the setup is in the next step where you select the Pig script that you uploaded as the starting Steps. 

emrscript

 

Notice the S3 locations in the above image.  Select the files from the appropriate S3 folders.

You will be able to monitor the running cluster from your Cluster List once  the cluster is created.  clusterlistSelect one of the clusters to view the status and other configuration details.

This example just uses a basic pig script which I modified for pig 0.11.1.1 that Amazon provides.  You may have a need to call an external program from your pig script to work on the data.  Amazon provides a way to upload additional jar for this purpose.

Preparation.

I would recommend testing your pig script locally on a test data before uploading to EMR.  EMR takes a while to get started and produce the output.  The cycle repeats if there are any errors.  I used Hortonworks Hadoop VM for testing my data and scripts. Hortonworks provides the entire Hadoop stack as a preconfigured sandbox which is very easy to use.  This sandbox also includes Apache Ambari for complete system monitoring.  They have a number of easy to do tutorials for anyone to get started quickly on Hadoop, Pig and Hive.

The initial data and scripts for this example came from Manaranjan.

 

Working with Big Data

I must say that the title is a little deceiving if you are looking for a technical post on Hadoop and Pig.  In the Making Sense of Data course, Google says that the data process consists of three steps

Prepare >> Analyze >> Apply

I’m going talk about is the first step, ie. Prepare.

Couple of weeks back on a Friday afternoon I got a call from one of my Friends in IIMB.  She sounded a little worried. The problem posted to me was to extract meaningful information from 100GB data that they just received from one of the retailers in Bangalore for a CLV or Survival model building. She had no clue what is there in the zip file that they just received as trying to open or copying the file itself is a time-consuming operation.

Step 1: Identify the data

Immediate task is to identify what contains in the 100Gb file.  I like some of the handy unix utilities in such situations.  For eg head and tail to get quick peek at the data.  I think Windows PowerShell has a handy API called Get-Content -totalcount n where is the number of lines you would like to see from the file. I figured that the data is nothing but an SQL dump from an Oracle database which kind of explains the size of the file.

Next task is to look at records.  Since it’s an SQL dump, each record has name of columns and it was easy to identify that each record has 70 columns with a mix of data types. Using wc, I also figured that the file has 67 million records. Objective is to extract data from this 67 million records.

Step 2: Preparation for Data Extraction

I explord the following options to extract the data:

  • Import the data into a db so that I can run simple SQL queries and retrieve data as and when required.
  • Parse the data and convert them to csv format

I chose the sql option as that gives more flexibility and mangeability than csv format. Challenge was to recreate the database schema from the records as the customer didn’t give us the schema. We can identify the data type from the columns, but it was difficult to judge the constraints as the number of records is more and there is no way I can validate the schema against the data file. Anyway, I configurd a mysqldb on my late 2011 MBP and created a schema. MySQL Workbench is a handy tool to explore the database and make minor corrections.

I extracted a test data with 1500 records (thanks to head) for validating the schema and quickly realized that I will have to clean the data as there were extra characters in some of the records. So, using the command line mysql tool was ruled out as I had to do the data cleansing in line with the import.

The easiest option was to use a python script. That threw up another challenge as there is no default mysql db connector for python. It turned out that installing mysql database connector for python on Windows is easier than OSX though I finally got it running on OSX.  Writing a 40 line python script to read the data, validate columns and write to db was easier than I thought. It took a few iterations to identify schema constraints violations and data issues before I could completely upload the data. I must say that it took around 5 hrs to upload 67 million records on my MBP that produced 30Gb database file.

Step 3: Data Extraction

Once the data is in the db, it is easier to extract the data through simple SQL queries. We wanted to find out buying patterns for 3 different items. So, I created indexes for certain fields so that the searches will be much faster. We were able to run the queries from the MySQL WorkBench and store the results as CSVs for further analysis in Excel or SAS.

It was a good exercise  to play around with this real life data and figure out how to handle such large data in a reusable way.  It is also a learning that a good data scientist should know end to end methodologies and technologies as one might spend a good amount of time in preparing the data.