TL;DR: Step by step guide on how to replicate your Mysql database to Apache Hive (Sql on hadoop) to run offline jobs to generate reports/insights
Background:
In every startup, as time goes by the database size keeps increasing and daily cron jobs to calculate daily reports and do offline tasks keep getting slower and slower until one day you raise your hand and say - this is f**ked up and I better fix this shit.
You open your cron jobs and it’s filled with complex joins and running them on mysql is going to take hours everyday which might be fine today but when you are growing 2x every month, you will again hit the limit soon.
Mysql Sharding-Memcache combo works great for front-end. For backend analytics, you need a specialist. His name is Hive. Apache Hive. From wikipedia:
Apache Hive is a data warehouse infrastructure built on top of Hadoop.
Think of this, all your GBs or TBs of data is saved in Hadoop. Even though the data is saved in text files, to you it will appear as a database composed of multiple tables on which you can run as complex queries as you want. Want to take a join of 3 tables of 1 GB each, no problem. Queries are taking longer than you want them to take, add ten more servers to the cluster (horizontal scaling).
In short, Hive takes your sql queries, converts into map-reduce jobs, runs them and gives you the final answer. Remember Hive is for offline processing only as even a simple query will take few seconds.
This is exactly what we needed :).
Architecture:
In our current architecture, we have cron jobs which run daily, query mysql for data and generate the reports which are sent to clients and analysts.

In our new architecture, we will have all our data in both hive and mysql. So the cron jobs can query mysql for simple queries like which reports need to be generated but for complex queries, it will query hive.

Step 1 - Installation:
To keep the guide simple, lets assume your cluster consists of only one server. So lets get started:
i) Install Hadoop and all 5 daemon packages (Hadoop Installation - Cloudera)
ii) Install hive and set hive to use mysql as metastore (Hive Installation - Cloudera)
iii) chmod g+w /user/hive/warehouse (this is where hive is going save all your data)
iv) Define the following environment variables
HADOOP_HOME=”/usr/lib/hadoop-0.20”
HIVE_HOME=”/usr/lib/hive/”
HIVE_PORT=10000
Type “hive” on terminal and you will get a hive prompt. That means everything went well.
Step 2 - Let’s Hive:
To learn to play around with hive lets insert the movie lens dataset into hive and run some queries. (Source : Apache Hive - Getting Started)
i) First, download and extract the movie lens data set:
$ wget http://www.grouplens.org/system/files/ml-data.tar+0.gz
$ tar xvzf ml-data.tar+0.gz
ii) Create a table with tab-delimited text file format:
hive> CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
iii) And load it into the table that was just created:
hive> LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data;
iv) Count the number of rows in table u_data:
hive> SELECT COUNT(*) FROM u_data;
You might want to do try some complex queries like Join, Group By etc.
Step 3 - Where’s my data?
We have all our data in mysql. To copy complete data into hive we will use Sqoop by Cloudera (These guys rock!).

i) Lets first install sqoop (Sqoop Installation - Cloudera) ii) Now you can copy everything to hive by one simple command:
$ sqoop import-all-tables --hive-import --connect jdbc:mysql://mysqlserver/databasename --username mysqluser
iii) But let’s write a script to copy only the tables we need and do it one by one. The script will be:
i) Get list of tables from mysql (hint: show tables)
ii) For each table, check if you want to not copy this table and skip it
iii) Drop the table on hive if it exists (hint: hive -e)
iv) Copy the table data to sqoop (hint: use sqoop import)
iv) Insert this script in crontab and execute it once everyday
Note that we are copying everything to hive from scratch. This is in-efficient. There must be a better way. Yes, there is!
Step 4 - Thou shall copy only what is updated
Instead of copying everything, we will only copy the rows which have been updated. To each of the tables in mysql add a “updated_at” column (timestamp) which is updated each time the row is updated and create an index on it. We will now save the timestamp each time we import data and then next time only copy the rows which have been updated after that timestamp. This is made even easier by using sqoop-jobs. Read more about sqoop-jobs here. (Tip: always have a created_at and updated_at in all tables. You will need it so many times for debugging and specially when a bug is pushed to production)

We will now have two scripts. One called hiveFirstRun which will create all jobs and copy everything from scratch. Second script called hiveUpdate which re-run the jobs and copy only the updated data. Lets write hiveFirstRun first:
For each table we want to copy
i) Delete job and drop hive table
ii) Create empty table (hint: use sqoop create-hive-table)
iii) Create job which takes the data from mysql and writes to a file (hint: use "--incremental lastmodified --check-column updated_at")
iv) Execute job and write data to temp file
v) Make a copy of temp file to "tableName_lastDump" vi) Load the temp file into hive table (hint: LOAD DATA INPATH)
vii) Generate Java class and source file for each table which we will be using in merge (hint: use sqoop codegen)
You might be thinking why we wrote to a file first and then copied to hive. We will be using “sqoop merge” to merge the new data with older data which operates on files afaik. Also, we created a copy of file in step (v) because load data will move the file into hadoop and we want to keep a copy to use in next merge. To merge the data, we will be using primary key of each table (id column). Also, you need to set “sqoop.metastore.client.record.password” to true in sqoop config file to save mysql password for each job. Lets write hiveUpdate now:
For each table we want to copy
i) Execute job and write the data to "updated_data" file
ii) Run "sqoop merge" to merge "tableName_lastDump" and "updated_data", and write the merged data to temp file (hint: you will need the java files created in (vii) of hiveFirstRun)
iii) Make a copy of temp file and overwrite "tableName_lastDump"
iv) Load the temp file into hive
Add hiveUpdate to your crontab and execute it every night.
Step 5 - Not so corner cases
i) When your database schema changes, updated_at is not changed
In hiveFirstRun, I save the schema of each table (hint: use mysql desc) and in hiveUpdate I check if schema is same. If it’s not same, I run hiveFirstRun for that table again which regenerates it from scratch. A better approach would be execute “Alter table” on hive also when it’s updated on mysql. Planning to incorporate this into our database schema versioning system soon
ii) While creating table in hive, use “mysql-delimiters” option. Also, before loading the file into hive, you will have rewrite the file and escape commas. I plan to look into this more and find a better solution
iii) In hive, I had created few de-normalized tables which were joins of multiple tables. These table were used by multiple reports so it made sense to pre-compute them instead of each report computing it
Final Words:
So now we have all our data in hive which is updated daily and at the end of day. But how do we use this data. There are two approaches and we are actually using both:
i) To run hive queries from your code, use hive clients which uses thrift. Code samples for all major languages available here.
ii) For manual one time queries - install Cloudera Hue. It gives a nice web interface to run queries and export results in csv or excel file. Think of hue as phpmyadmin for Hive.
If you have questions or something is not clear, post a comment here or search on google, stackoverflow and cloudera forums. Almost every issue I ran into, somebody else had already encountered and posted a solution.
p.s. If you love numbers and writing algorithms to play with numbers, drop me an email at himanshu.baweja@gmail.com . I will be more than happy to refer you to Facebook (which initially developed Hive) and/or to start-up for which I built the above system as a consultant. Both are growing like crazy and need people like you :).