5 Must for a Cloud Computing Job

I was recently asked to list 5 must for a cloud computing job. It can greatly differ for roles and kind of industry that you work for. Innovation and rapid delivery is a must for a software developer or architect while managing virtual environments and data management skills are necessary for a data center manager. However, I think there are a few common factors that will be applicable for all

Understanding of IT as a Service

One of the advantages for cloud offering is the service model or the shopping cart kind of experience where IT managers shop for the service using their credit card instead of building it themselves. IT managers have to transition from owner of the service to provider of the service. They have to give out their control and focus more on negotiating service quality with appropriate vendors. Prospective candidates transitioning to a cloud domain must understand cloud categories and deployment models. Understanding the business beyond the technologies and identifying ways and means to fulfill business requirements to maximize the productivity of their end-users is a common trait they should demonstrate.  Analyst Peter Christy at the Internet Research Group refers to this as  inversion of enterprise IT from an application-centric to a people centric structure.

Managing Virtual Environments

Be it a developer or an IT manager, at some point in your transition time, you will have to deal with virtualized environments.  You should be proficient in designing and managing IT infrastructure using hosted services, managing policies and configurations within private or public cloud as needed.  These skills will help an IT manager to negotiate their SLAs correctly and realistically.

Innovation and Rapid Delivery

Probably one of the most appealing aspects of the cloud computing is the agility with which one can realize their solutions. This is particularly true for developers as well as IT managers. They can fulfil business requirements via innovation and integration of cloud services with agility. Gone are the days where one raise a purchase request and wait for the approvals and subsequent delivery of hardware and software.

Security

One of the drawbacks of not having complete control over the services that you offer is security. This can be a daunting task if the services are offered from a public cloud. One should know how to manage security and compliance and should be familiar with various compliance requirements across verticals and locations. Good understanding of applications delivered from the cloud, the way users accessing them and security implications of the whole model will be a must have trait for a cloud engineer. Analyst Mark Diodati at Gartner says that the shift to the cloud and the consumerization of IT have complicated the task of identity and access management in the enterprise security environment.  Federation protocols, OAuth, REST, SCIM, BYO{D,I,A} etc are a few keywords that they can research on.

Data Management Skills

Most organizations will have a hybrid cloud approach where they will have data managed from on-premise and off-premise. Most medium and large enterprises prefer to keep their critical data such as identity and IP related information on-premise. Candidates opting for a cloud computing job must possess necessary skills to design systems to manage and integrate data from off-premise to on-premise. This would also include ability to get deeper understanding of data through analytics and management through big data technologies.
This is no way a comprehensive list of skills. I believe the above items are the intersection of skills and technologies that every individual must possess before adopting cloud technologies or services.

Elastic Map Reduce – A Practical Approach

Amazon just reminded me that my AWS free tier is getting over tomorrow.  I’ve been wanting to write about my EMR experiments for some time.  I worked on this a couple of months back when I got a chance to experiment with Hadoop.  We used twitter feeds at that time.   My objective was to run the same with a large log file from one of our products.  I’m going to explain the way EMR can be used in a very basic way by using data stored in S3 and scheduling EMR job with a bunch of scripts.

As usual, I’m going to build the whole experiment over a number of steps.  I believe that it is easier to validate your approach in smaller steps like programming.  It is always easier to test your program as you build it instead of trying to see how the program works after a few hundred thousands lines are written.

Step 1: Upload your data and scripts

pigbucketI’m going to use Amazon S3 as the storage for this example.  There could be other methods.  I think S3 is a good option for up to a few gigabytes of data.  As you notice, the pigdata
pinginput bucket pigdatbucket  has all the input and output data folders for this example.

The objective of this exercise is to do a sentiment analysis on a number of tweets from various states in USA.  The result will be placed in the folder output once the EMR job is completed.

Step 2: Crate  EMR cluster

In this step, we create an EMR cluster.  To start with, I leave logging on and use the S3 folder Logs as the place holder for log files.   I always find logging helpful to troubleshoot teething problems.  I disabled Termination Protection as I couldn’t sufficiently debug script issues when I enable this feature as the cluster terminates automatically.

Amazon provides hadoop 1.0.3 or 2.2.0 and pig 0.11.1.1 (as of this writing). This EMR cluster will be launched in one of your EC2 instances or a VPC.  Select the appropriate instance based on your subscription level.

As this example needs only basic Hadoop configuration, this was selected for the Bootstrap Actions. The core of the setup is in the next step where you select the Pig script that you uploaded as the starting Steps. 

emrscript

 

Notice the S3 locations in the above image.  Select the files from the appropriate S3 folders.

You will be able to monitor the running cluster from your Cluster List once  the cluster is created.  clusterlistSelect one of the clusters to view the status and other configuration details.

This example just uses a basic pig script which I modified for pig 0.11.1.1 that Amazon provides.  You may have a need to call an external program from your pig script to work on the data.  Amazon provides a way to upload additional jar for this purpose.

Preparation.

I would recommend testing your pig script locally on a test data before uploading to EMR.  EMR takes a while to get started and produce the output.  The cycle repeats if there are any errors.  I used Hortonworks Hadoop VM for testing my data and scripts. Hortonworks provides the entire Hadoop stack as a preconfigured sandbox which is very easy to use.  This sandbox also includes Apache Ambari for complete system monitoring.  They have a number of easy to do tutorials for anyone to get started quickly on Hadoop, Pig and Hive.

The initial data and scripts for this example came from Manaranjan.

 

Working with Big Data

I must say that the title is a little deceiving if you are looking for a technical post on Hadoop and Pig.  In the Making Sense of Data course, Google says that the data process consists of three steps

Prepare >> Analyze >> Apply

I’m going talk about is the first step, ie. Prepare.

Couple of weeks back on a Friday afternoon I got a call from one of my Friends in IIMB.  She sounded a little worried. The problem posted to me was to extract meaningful information from 100GB data that they just received from one of the retailers in Bangalore for a CLV or Survival model building. She had no clue what is there in the zip file that they just received as trying to open or copying the file itself is a time-consuming operation.

Step 1: Identify the data

Immediate task is to identify what contains in the 100Gb file.  I like some of the handy unix utilities in such situations.  For eg head and tail to get quick peek at the data.  I think Windows PowerShell has a handy API called Get-Content -totalcount n where is the number of lines you would like to see from the file. I figured that the data is nothing but an SQL dump from an Oracle database which kind of explains the size of the file.

Next task is to look at records.  Since it’s an SQL dump, each record has name of columns and it was easy to identify that each record has 70 columns with a mix of data types. Using wc, I also figured that the file has 67 million records. Objective is to extract data from this 67 million records.

Step 2: Preparation for Data Extraction

I explord the following options to extract the data:

  • Import the data into a db so that I can run simple SQL queries and retrieve data as and when required.
  • Parse the data and convert them to csv format

I chose the sql option as that gives more flexibility and mangeability than csv format. Challenge was to recreate the database schema from the records as the customer didn’t give us the schema. We can identify the data type from the columns, but it was difficult to judge the constraints as the number of records is more and there is no way I can validate the schema against the data file. Anyway, I configurd a mysqldb on my late 2011 MBP and created a schema. MySQL Workbench is a handy tool to explore the database and make minor corrections.

I extracted a test data with 1500 records (thanks to head) for validating the schema and quickly realized that I will have to clean the data as there were extra characters in some of the records. So, using the command line mysql tool was ruled out as I had to do the data cleansing in line with the import.

The easiest option was to use a python script. That threw up another challenge as there is no default mysql db connector for python. It turned out that installing mysql database connector for python on Windows is easier than OSX though I finally got it running on OSX.  Writing a 40 line python script to read the data, validate columns and write to db was easier than I thought. It took a few iterations to identify schema constraints violations and data issues before I could completely upload the data. I must say that it took around 5 hrs to upload 67 million records on my MBP that produced 30Gb database file.

Step 3: Data Extraction

Once the data is in the db, it is easier to extract the data through simple SQL queries. We wanted to find out buying patterns for 3 different items. So, I created indexes for certain fields so that the searches will be much faster. We were able to run the queries from the MySQL WorkBench and store the results as CSVs for further analysis in Excel or SAS.

It was a good exercise  to play around with this real life data and figure out how to handle such large data in a reusable way.  It is also a learning that a good data scientist should know end to end methodologies and technologies as one might spend a good amount of time in preparing the data.

Experimenting with Oracle Virtualbox

I have been using VMWare Fusion on my MBP for a while.  I noticed significant performance issues after upgrading to Mavericks.  That is when I decided to try out Oracle Virtualbox.  More importantly some of the devops I was trying such as Vagrant and Docker did have readily available VMs for Virtualbox.  I never bothered to checkout Virtualbox in the past as I owned licenses for VMWare Fusion and VMWare WorkStation.  Staying with VMWare was more productive as I can move around VMs between my development environments and Office work environment.

Storage

The first step was to getting all my existing VMs running on VirtualBox.  I must say that running my SLES and Ubuntu VMs were easier than I thought. All that I need to do was create a new instance and use the same vmdk image from VMWare.    By default VirtualBox will use a SATA/SCSI interface for the disk image.  It worked well for Unix/Linux virtual machines, but for Windows, I had to forcefully use IDE interface.  Do the following for Windows (I tested with 7.x and 8.1) images.

  • Once the VM is created, goto settings and Storage
  • Delete the SCSI instance associated with your vmdk file
  • Add an IDE interface and choose the same vmdk file.

Networking

The next configuration required is with respect to Networking. I normally use a NAT’d environment with specific CIDR for all my development VMs. I can access this private network from my host on VMware WorkStation or Fusion.  It appears that only way to access services running on Virtualbox image on a private interface is through port forwarding.  Even to SSH to to guest OS, you need to forward a host port to 22 on the guest.  Thankfully the network configuration dialog in the VM settings provides an option to do that.  There is an experimental NetWork Address Translation Service in VirtualBox.  I haven’t been able to get that working on my OSX yet.

Shared Folder

Shared folder concepts are a little convoluted on VirtualBox. Apparently they disable the ability to create symbolic links in a shared folder due to some bizarre security reasons.  You need to enable them manually for each shared folders in each VMs.  More importantly, you need to restart the VirtualBox application after enabling them.  Given below is the syntax for enabling the creation of symbolic links on a given volume.

VBoxManage setextradata VM_NAME VBoxInternal2/SharedFoldersEnableSymlinksCreate/SHARE_NAME 1

The SHARE_NAME at the end of the parameter should be a full path to the shared folder on your host.

Headless Mode

One of the features I liked in VirtualBox is the headless mode.  You can run a vm in the background without any UI elements.  This saves some memory on your host and typically you can run any linux instances in runlevel 3.  Push shift key while clicking on the Start button or use VBoxManage command line tool to start a VM in a headless mode.

Overall I find the performance of VirtualBox better than Fusion for my workload.  I’m also liking the command line tools and programmability via its rich set of APIs.  Tune in for more of my VirtualBox experiments.

Blogo – Updated Blog Editor

I mentioned in my previous post that I’m waiting for a private beta invitation from Blogo, one of the blog editors on OSX. Well, as everyone complains, OSX is missing Windows Live Editor, one of my favourite editors.

I received the beta invitation today and I am posting this via Blogo. These are my initial impressions. Let me state the good parts first:

  • The editor is very polished and minimalistic. I liked the overall layout.
  • Adding my WordPress based blog was a breeze.
  • Preview is decent enough

Now comes my irritations. It is still in the early beta and hence most of these pitfalls will be temporary. Also, remember that I like Windows Live Writer and Windows One Note. Microsoft has best editors possible in their applications.

  • Even the basic formatting is very erratic. Many time I literally reformatted certain sections to maintain the indentation properly
  • Editor is very minimal. All that you can do is basic formatting and lists. No support for indentation, regular blog features like predefined header styles, horizonal rules, tables etc.
  • No image support. I just couldn’t find a way to insert an image though they claim full image editing support
  • Preview requires you to be online. It doesn’t have a way to download your stylesheet and render the page locally. This makes it difficult to author posts offline
  • The editor couldn’t prefetch my tags and categories at times. I had to literally type most times.

What I would like to see?? Oh! its very simple. Give me the Windows Live Writer on OSX :-)

 

Organizing a Hackathon

We are trying to organize a hackthon at office in this month. Thinking around this has been going on for a while and finally decided to pick a month where people will be available and release pressure will be moderate and then work backwards to arrive at a schedule.
As usual, we need to be clear about the following while organizing a hackathon.

Agree upon the goals
This is the first and foremost important step. You are asking employees to spend one or more days to work on something they are passionate about by leaving their regular work. You will also have to articulate the goals very well to get support from the management. Employees planning to participate in the hackathon should know what they should achieve in these days and what they can demonstrate. I called it as their acceptance criteria.

Pick a Date
Many teams will have releases scheduled continuously. Its important to choose a convenient date to ensure maximum participation. As always more the merrier. We looked at various release schedules, local and remote holiday schedules and other local events such as school holidays, community events etc. before choosing a date.
Once the date is chosen, work backwards that helps you organize logistics such as:

  • Organizing team needs time to review and finalize proposals, organize T-shirts and other goodies.
  • Pick a demo day. It could be on the last day of the hackathon or the next day depending on convenience and number of participants
  • Provide enough time to your admin and IT teams to organize logistics such as hacker’s dens, networking and power infrastructure and lots of food and caffeine.

Market the EventHackfest-Flayer1
Though this is an office event, it is important to market the event sufficiently to build the momentum. We have been sending frequent flyers, pasting posters and encouraging managers and other senior members of the teams to help their team members come up with good ideas. We are also planning short videos that can be streamed to various monitors in the common area. Authors will talk about their ideas and how it adds values to our customers.

Event Rule
Participants should be well aware of the selection criteria. Though the genenal practice is to encourage everyone to participate (its a community event), there has to be some rules laid down for judging to be effective. There will be people participating for the spirit of the event and others who are serious about what their contributions are. Most probably the juding criteria will include:

  • Imapct to our customers
  • Innovation
  • Achievements

Schedule

Next important step is the schedule. This includes detailed schedule for the hacking days and other milestones that help you to start with the hacking. This has to be clear to everyone to manage the logistics and also to maintain the sanctity of the event.

Our hackathon is planned for 20th and 21st of this month. Watch this space for more details of the event.

Its Been A While

I just noticed that my blog was dormant for a while.  I’m not sure about the reasons.  It could be that I moved to MBP and I was really missing Windows Live Writer. Oh!! Boy, I really fell in love with that little application that Microsoft gave away for free.  It was just a breeze to publish from WLW.  I often searched for similar apps on OSX and the popular suggestion was to use MarsEdit.  Honestly, I was a little intimidated by this editor and couldn’t post even a single one.  I used WLW in a virtual machine for the last two posts. 

Days and months passed by.  Then it occurred to me that I must write.  Isn’t writing the best way to express yourself?  Fresh searches for a better editor with the hope that people still prefer to write regular blogs though every one can express themselves in 144 characters today.  I came across PixelPumper and Blogo. I’m writing this using PixelPumper and kind of liking it for the simple editing.  I’m hoping that Blogo team will signs me up for the beta testing.  Anyway, I’m glad that I started writing again!!!.

Never Say Good Bye

We parted our ways yesterday

scorpio-1

He came to me 9 years back.  He was a family since then.

Scan-120311-0006

You were like this when I got you…

He took me to places.  We made our own roads.

He saved me from road rages, accidents..

Scan-120311-0003

Scan-120311-0004

Scan-120311-0005

He came with me for Tsunami Relief work.  He was happy to be part of the team

Scan-120311-0002

Dear Scorpio, Donnie misses you the most. He liked to watch other dogs on the road from your height.  He liked the wind blowing in his face through your window.

donnie-scorpio-2

We will remember you always…

Linux Advanced Routing: Setting up a Mixed Public-Private Network

Recently I had a unique need to have a mix of public and private network on a particular server for some testing.  A number of services were already configured for the public interface. I had to test a particular feature using a NAT environment and the easiest I could think of was to configure the same server with a NAT ifc in the VMWare environment and configure that feature to use this private interface.  Setting up the proper routes where I can reach the server through the public interface or through the router’s port forwarding via the NAT interface was a challenge in this case. 

Network-ppMy networking requirement is something like this.  As the diagram suggests, 164.99.89.77 is the public interface (eth1) and 172.17.2.80 (eth0) is the private interface.  vmnet5 provides the NAT environment with the network 17217.2.0.  My requirement was to reach the guest via eth0 or eth1 from the 164.99 network.  The host (164.99.89.74) also provides port forwarding so that I can connect to the gust via the private interface. 

I realized that I need to make sure that all answers to traffic coming in on a particular interface get answered from that interface. 

After a little research on Linux advanced routing, I stumbled upon this page.

I designed my routing table based on the recommendations from there.  I’m listing the steps I followed for future reference.

  1. Disable reverse-path filtering for both interfaces.  When source and destination traffic to the same IP using different interface occurs, the Linux kernel drop the traffic as potentially spoofed.  This is called reverse-path filtering. 
  2. Create two additional routing tables, say T1 and T2 in /etc/iproute2/rt_tables.   This file will look something like this

    image

  3. Then populate these tables as given below

    ip route add 164.99.0.0 dev eth1 src 164.99.89.77 tabel T1
    ip route add default via 164.99.89.254 table T1
    ip route add 172.17.2.0 dev eth0 src 172.17.2.80 table T2
    ip route add default via 172.17.2.2 table T2

    164.99.0.0 => public network
    164.99.89.77 => IP address of the public interface
    164.99.89.254 => Gateway address for the public network
    172.17.2.0 => Private network
    172.17.2.80 => IP address of the private interface
    172.17.2.2 => Gateway address for the public network

  4. Set up the main routing table.

    ip route add 164.99.0.0 dev eth1 src 164.99.89.77
    ip route add 172.17.2.0 dev eth0 src 172.17.2.80

  5. Then a preferred default route

    ip route add default via 172.17.2.2

  6. Next set up the routing rules

    ip rule add from 164.99.89.77 table T1
    ip rule add from 172.17.2.80 table T2

Above rules will make sure all answers to traffic coming in on a particular interface get answered from that interface 

My routing table looks something like this with the above changes

image

There are a few more desirable routing additions mentioned here

With these changes, I can connect to the server via the public interface or via the private interface with the port forwarding in the router.

image

image

Introducing IWM at NMAMIT

NMAM Institute of Technology, Nitte held an International Level Conference on “Computer Architecture, Networking and Applications” (IC-CANA 2011) at Mangalore on 7th and 8th January, in association with Penn State University, Harrisburg USA. The conference was co-sponsored by Nitte University, Nitte Education Trust, ISTE, NewDelhi. CSI Division-V (E&R), VTU Belgaum, TCS, EMC, Veriguide and Robosoft Technologies. I was invited as one of the speakers in the tutorial session. I presented Intelligent Workload Management in the cloud track on the first day. I co-chaired the thesis presentation along with Dr. Swarnalatha on the second day and participated in a panel discussion on “Relevance of Industry Institute Interaction in the Global Education Scenario”.

My dreams, random thoughts…