I must say that the title is a little deceiving if you are looking for a technical post on Hadoop and Pig. In the Making Sense of Data course, Google says that the data process consists of three steps
Prepare >> Analyze >> Apply
I’m going talk about is the first step, ie. Prepare.
Couple of weeks back on a Friday afternoon I got a call from one of my Friends in IIMB. She sounded a little worried. The problem posted to me was to extract meaningful information from 100GB data that they just received from one of the retailers in Bangalore for a CLV or Survival model building. She had no clue what is there in the zip file that they just received as trying to open or copying the file itself is a time-consuming operation.
Step 1: Identify the data
Immediate task is to identify what contains in the 100Gb file. I like some of the handy unix utilities in such situations. For eg
tail to get quick peek at the data. I think
Windows PowerShell has a handy API called
Get-Content -totalcount n where n is the number of lines you would like to see from the file. I figured that the data is nothing but an SQL dump from an Oracle database which kind of explains the size of the file.
Next task is to look at records. Since it’s an SQL dump, each record has name of columns and it was easy to identify that each record has 70 columns with a mix of data types. Using
wc, I also figured that the file has 67 million records. Objective is to extract data from this 67 million records.
Step 2: Preparation for Data Extraction
I explord the following options to extract the data:
- Import the data into a db so that I can run simple SQL queries and retrieve data as and when required.
- Parse the data and convert them to csv format
I chose the sql option as that gives more flexibility and mangeability than csv format. Challenge was to recreate the database schema from the records as the customer didn’t give us the schema. We can identify the data type from the columns, but it was difficult to judge the constraints as the number of records is more and there is no way I can validate the schema against the data file. Anyway, I configurd a mysqldb on my late 2011 MBP and created a schema. MySQL Workbench is a handy tool to explore the database and make minor corrections.
I extracted a test data with 1500 records (thanks to
head) for validating the schema and quickly realized that I will have to clean the data as there were extra characters in some of the records. So, using the command line mysql tool was ruled out as I had to do the data cleansing in line with the import.
The easiest option was to use a python script. That threw up another challenge as there is no default mysql db connector for python. It turned out that installing mysql database connector for python on Windows is easier than OSX though I finally got it running on OSX. Writing a 40 line python script to read the data, validate columns and write to db was easier than I thought. It took a few iterations to identify schema constraints violations and data issues before I could completely upload the data. I must say that it took around 5 hrs to upload 67 million records on my MBP that produced 30Gb database file.
Step 3: Data Extraction
Once the data is in the db, it is easier to extract the data through simple SQL queries. We wanted to find out buying patterns for 3 different items. So, I created indexes for certain fields so that the searches will be much faster. We were able to run the queries from the MySQL WorkBench and store the results as CSVs for further analysis in Excel or SAS.
It was a good exercise to play around with this real life data and figure out how to handle such large data in a reusable way. It is also a learning that a good data scientist should know end to end methodologies and technologies as one might spend a good amount of time in preparing the data.