Working with Big Data

I must say that the title is a little deceiving if you are looking for a technical post on Hadoop and Pig.  In the Making Sense of Data course, Google says that the data process consists of three steps

Prepare >> Analyze >> Apply

I’m going talk about is the first step, ie. Prepare.

Couple of weeks back on a Friday afternoon I got a call from one of my Friends in IIMB.  She sounded a little worried. The problem posted to me was to extract meaningful information from 100GB data that they just received from one of the retailers in Bangalore for a CLV or Survival model building. She had no clue what is there in the zip file that they just received as trying to open or copying the file itself is a time-consuming operation.

Step 1: Identify the data

Immediate task is to identify what contains in the 100Gb file.  I like some of the handy unix utilities in such situations.  For eg head and tail to get quick peek at the data.  I think Windows PowerShell has a handy API called Get-Content -totalcount n where is the number of lines you would like to see from the file. I figured that the data is nothing but an SQL dump from an Oracle database which kind of explains the size of the file.

Next task is to look at records.  Since it’s an SQL dump, each record has name of columns and it was easy to identify that each record has 70 columns with a mix of data types. Using wc, I also figured that the file has 67 million records. Objective is to extract data from this 67 million records.

Step 2: Preparation for Data Extraction

I explord the following options to extract the data:

  • Import the data into a db so that I can run simple SQL queries and retrieve data as and when required.
  • Parse the data and convert them to csv format

I chose the sql option as that gives more flexibility and mangeability than csv format. Challenge was to recreate the database schema from the records as the customer didn’t give us the schema. We can identify the data type from the columns, but it was difficult to judge the constraints as the number of records is more and there is no way I can validate the schema against the data file. Anyway, I configurd a mysqldb on my late 2011 MBP and created a schema. MySQL Workbench is a handy tool to explore the database and make minor corrections.

I extracted a test data with 1500 records (thanks to head) for validating the schema and quickly realized that I will have to clean the data as there were extra characters in some of the records. So, using the command line mysql tool was ruled out as I had to do the data cleansing in line with the import.

The easiest option was to use a python script. That threw up another challenge as there is no default mysql db connector for python. It turned out that installing mysql database connector for python on Windows is easier than OSX though I finally got it running on OSX.  Writing a 40 line python script to read the data, validate columns and write to db was easier than I thought. It took a few iterations to identify schema constraints violations and data issues before I could completely upload the data. I must say that it took around 5 hrs to upload 67 million records on my MBP that produced 30Gb database file.

Step 3: Data Extraction

Once the data is in the db, it is easier to extract the data through simple SQL queries. We wanted to find out buying patterns for 3 different items. So, I created indexes for certain fields so that the searches will be much faster. We were able to run the queries from the MySQL WorkBench and store the results as CSVs for further analysis in Excel or SAS.

It was a good exercise  to play around with this real life data and figure out how to handle such large data in a reusable way.  It is also a learning that a good data scientist should know end to end methodologies and technologies as one might spend a good amount of time in preparing the data.

Intel Buys McAfee: Security in the chip vs. virtual environment

This was the big news I read yesterday.  Everyone talked about the price Intel is paying – $48 per share – which is 60% premium over McAfee’s previous day close of $29.93.  What is Intel going to get out of this deal?  I found a not-so-encouraging analysis from Forrester. One of the reasons everyone talked about is, if Intel wants to embed security into the chip.  While it makes perfect sense for a desktop dominated market we used to have a few years back, I can’t agree if its valid now where we are moving to Virtual Desktops and data center virtualization.  Why would I have my security in the chip level in a virtualized world?.  Why not we take care of that in the Virtual machine level? Intelligent Workload Management is such an effort where we try to integrate identity and security into virtual machine level instead of the hardware level.  The software security is another dimension that can be added in the IWM framework level.

Hospitality Suite: Day 3

Its more like the meet the expert day of Brainshare.   Each vendors get a room where they can showcase their products and talk to customers; you can attract more people if the food and drinks are better in your room!!!

Presentations also started today.  I must admit that the sessions were targeted mostly towards CTOs and CFOs.   These are not very technical in nature, but strategic and direction centric.  There were case studies by organizations and early adopters of cloud.  The day started with a number of presentations from Burton Group on Application architecture for the cloud.  A few interesting suggestions were:

  • Consumers should build their applications by consuming various services available in the cloud rather than building themselves
  • Simply migrating applications is not a a Cloud Application Architecture
  • Lack of clarity and hype slows adoption of cloud and increased skepticism
  • Bill Peter of Intercontinental Hotels Group said SLAs not sufficient in Public, one of the main reasons they went ahead with private cloud
  • Eli Lilly came up with the challenges they faced while moving to cloud

Vendor Hospitality Suites is one of the attractions for this conference.   Vendors are given a chance  to demonstrate their technologies and talk to customers here.  Novell was unveiling the Cloud Security Services.  Novell was given a room and a demo place in the interoperability lab where Novell demonstrated SSO to PivotLink through NCS.

Novell NCS Kiosk

Tom Cecere talking about NCS in the vendor lightning round

Tom Cecere explaining the product to a number of potential customers

Dale Olds and Mike from PivotLink talking to some of our Quest friends

Young at heart: How to be an innovator for life

Here is a podcast from Entrepreneurial Thought Leader series. This is delivered by Tom Kelly, author of “The Art of Innovation” and co-founder and GM of IDEO, world famous design center. Various researches show that you need to have 5+/-2 mental or behavioral habits that help you to be an innovator for life. He talks about the following 5 habits:

  • Think like a traveler
  • Treat life as an experiment
  • Cultivate an attitude of wisdom
  • Use your whole brain/Tortoise mind
  • Follow your passion

Tom Kelly is terrific writer and presenter. I personally like his books and presentations.
Listen to the audio or watch the video when you get time.

Today’s visit to Valley School

It’s been a while since I visited Valley School. I decided to try my luck today morning.  I met Basavaraj on the way to Valley school. I could find a few birds on the way to Valley school. Success rate was near to zero inside the school. Here are some photos from today’s trip.


Creating Nokia playlist

I’ve been exploring ways to encode videos and create playlist for my Nokia N96 without using Nokia’s PC Suite. I use openSUSE11.1 and hardly boot into my dual boot windows. Nokia implements usb storage and so E:\ gets mounted as a USB volume in linux. You can copy or move music and video files easily once the volume is mounted. I use the following code to create a playlist once the music files are copied. I’m still trying to figure out a way for this playlist to appear in Nokia’s Music player. Currently, I use the file manager, browse to the playlist and play from there.

#!/usr/bin/perl -w

# Utility to create a Nokia playlist
# Usage: .....
# Not much error checking. Make sure the first file is always the playlist file
# Also, this script needs to be executed from the directory where mp3 files are present.
# playlist name can contain a directory name, but mp3 files should always be in the current directory.
# This script assumes that all files are in a directory called Music in your Nokia's Mass Storage drive (E:\)

use Cwd;

# Get the current working directory. Change the directory seperator or '\' and prepend E:\
sub getCurrentWorkingDirectory
$_ = getcwd;

if (m{=[0-9A-F]{2}}) {

my $s = eval { decode('utf-8', $_); };
$_ = $s unless ($@);
# Note: Change the following line if you have a different directory or drive structure.
print "Directory is $_\n";
$_[0] = $_;

print "Number of arguments: $#ARGV \n";

my @files;
my $playlist = $ARGV[0];
my $PLFH;
if ( ! -e $playlist )
print"Playlist file $playlist does not exist. Creating new...\n";
open(PLFH, ">$playlist");
print PLFH "#EXTM3U\n";
open(PLFH, ">>$playlist");

foreach $argnum (1 .. $#ARGV)
if ( -e $ARGV[$argnum] ) { push (@files, $ARGV[$argnum]); }
else { push(@files, <$ARGV[$argnum]>); }


$pwd = getCurrentWorkingDirectory();
foreach (@files) {

# $_ = $file;
print PLFH '#EXTINF:0, -';
print "...Adding $pwd\\$_ . \n";
print PLFH "$pwd\\$_";


Looking back

It has been a while I blogged something constructively.  I know it doesn’t take much time.  Its just a sheer laziness of jotting something down that I have been doing.

Many things changed in the last 3-4 months.  Some of the important events in my life (that I think worth sharing) are

I know that I have to write about each of these events.  Just wait and watch this space.  There are more to come….

Weekend at Kabini

Anu and myself have been planning for a long time to visit Kabini.  We have heard a lot about Orange County and JLR.  Finally, we decided to take off to JLR Kabini for a week-end.  It was also Anu’s birthday.

Route to Kabini is very easy.  You hit Mysore Road, and just before Mysore city, take the Ring Road (Right turn while coming from Bangalore) and turn towards HD Kote.  The route all the way to Kabini is good.  There will be a deviation once you hit the forest range.  This is around 15kms all the way to JLR.  We took around 3hrs 30mins to reach JLR from Bangalore.

Everything is pretty much organized at JLR.  They will tell you which Jeep to take for the Safari and what time it starts.  You will miss the safari if you are not on time!!! The roads inside the forest are horrible.  You will never enjoy the safari unless you get a good jeep. (Try to take one of those Tata pickup).  In the evening, one can take a Jeep Safari or a safari in the Riven on a motor boat.  We decided to take the Jeep safari as wild life is not that active near water hole at this time.

Though we stayed for two days, our trips inside the forests were not very productive. We hardly noticed any animals except a lot of spotted deers and occasional elephants.  The right time to visit this place would be summer.

Fact file: JLR Kabini is around 230kms from Bangalore
Route: Have a look at my route map.
Here are some photographs from Kabini