How much equity should I get as an employee in startup when taking a pay cut?

Few days back, there was an interesting discussion on Hacker News on are the first few employees of startup getting screwed and how much equity should be given to them when taking a pay cut. After having been on both sides of tables (getting hired and hiring people), here is my 2 cents.

Underlying Basis:

Be it a small startup hiring employee No. 1 or a 100 people company, always pay equal to or more than the market salary. If you don’t, sooner and later he will realize and then he will lose his motivation and leave or worse still, stick around but not be as productive.

Basic Equation:

Cash compensation + Equity = Market Rate

Let’s use an example to understand it:

Two of you started a company and you just raised a seed round of 150k at 1 million post-money valuation and now its time to bring in employee No. 1. He is a really good developer and if you had lot of money in the bank you would have paid him like 100k per year. But given that you just have 150k in bank, you both feel 60k per year cash is more than enough for a single person to live in Bay Area. So, the new equation is:

60k (Cash) + 40k(Equity) = 100k (Market Rate)

How do we convert 60k in equity to % of company. Isn’t it simple, you and your investor just agreed to 1 million as company valuation. But!! Equity is granted over four years. So how do know our valuation for next 4 years. You can either use projections (realistic projections and not your pitch deck ones :) ) or use a simplification like startup valuation grows 3x a year from Paul Graham’s: The Equity Equation:

Year Valuation Equity ($) %
1 (Current) 1 million 40k 4.00
2 3 million 40k 1.33
3 9 million 40k 0.44
4 27 million 40k 0.15
 
So in effect we are paying the employee, 60k per year based on fair value of equity that year. So in total 5.93% for employee #1 at a seed stage company. I can hear all the Founder CEOs screaming 6% to an employee #1, are you kidding? Numbers don’t lie.
 
Founder : But he will learn so much. He is being compensated for experience and network he is building. Where does that fit into the equation?
A: The part of the equation where he works 100 hours a week and not ask for overtime
 
Founder : I want to calculate value of equity based on our 200m exit 4 years down the line. And I am being generous by giving him 1%.
A: Value of anything is what other’s are willing to pay for it. You and a guy who knows more about valuations than me called an angel investor/VC, decided the company is worth 1 million right now and I trust you guys :). Just in case you don’t know, I should mention, probability of that happening is ~2%, hence the current valuation.
 
Founder: What if he is not good and I have to let him go after 6 months. He owns 2% of my company!
A: Did you read the offer letter. That is 1 year cliff is for. And if you can’t figure out if a person is good or bad in 1 year, this startup thing is not for you. You better join Computer Associates.
 
Founder: But he should be doing it for Love and not for Money.
A: He is still doing it for Love. Trust me there are lot of Banks and big companies willing to pay at least 150k per year if not not more. Love is what makes him work with you at 100k.
 
Founder: How do I determine valuation if my last round was a long time back?
A: Sit with your advisors and your board members and figure it out. It is not that difficult.
 
Employee: Where do all the benefits I get at a big company go? Who will pay for my laundry?
A: Aren’t you are doing it for the Love and not Money :). Would you rather be a no-name at a big company or pay for your own laundry and shoot for the moon.
 
Founder & Employee: How do we determine market rate?
A: Come on! you are starting a company. Ask your friends/network or at least try to Google.
 
 



Quote: Why not to do consensus based hiring

Quote of the day by Ben Horowitz:

“You hired for lack of weakness rather than strength—This is especially common when you run a consensus-based hiring process. The group will often find the candidate’s weaknesses, but they won’t place a high enough value on the areas where you need the executive to be world class. As a result, you hire an executive with no sharp weaknesses, but who is mediocre where you need her to be great. If you don’t have world-class strengths where you need them, you won’t be a world-class company.”

Complete post is about “Preparing To Fire an Executive”:

http://bhorowitz.com/2011/08/24/preparing-to-fire-an-executive/




Analytics: Reached mysql limit? Let’s hive

TL;DR: Step by step guide on how to replicate your Mysql database to Apache Hive (Sql on hadoop) to run offline jobs to generate reports/insights

Background:

In every startup, as time goes by the database size keeps increasing and daily cron jobs to calculate daily reports and do offline tasks keep getting slower and slower until one day you raise your hand and say - this is f**ked up and I better fix this shit.

You open your cron jobs and it’s filled with complex joins and running them on mysql is going to take hours everyday which might be fine today but when you are growing 2x every month, you will again hit the limit soon.

Mysql Sharding-Memcache combo works great for front-end. For backend analytics, you need a specialist. His name is Hive. Apache Hive. From wikipedia:

Apache Hive is a data warehouse infrastructure built on top of Hadoop.

Think of this, all your GBs or TBs of data is saved in Hadoop. Even though the data is saved in text files, to you it will appear as a database composed of multiple tables on which you can run as complex queries as you want. Want to take a join of 3 tables of 1 GB each, no problem. Queries are taking longer than you want them to take, add ten more servers to the cluster (horizontal scaling).

In short, Hive takes your sql queries, converts into map-reduce jobs, runs them and gives you the final answer. Remember Hive is for offline processing only as even a simple query will take few seconds.

This is exactly what we needed :).

Architecture:

In our current architecture, we have cron jobs which run daily, query mysql for data and generate the reports which are sent to clients and analysts.

Current Architecture

In our new architecture, we will have all our data in both hive and mysql. So the cron jobs can query mysql for simple queries like which reports need to be generated but for complex queries, it will query hive. 

Hive New Architecture

Step 1 - Installation:

To keep the guide simple, lets assume your cluster consists of only one server. So lets get started:

i) Install Hadoop and all 5 daemon packages (Hadoop Installation - Cloudera)

ii) Install hive and set hive to use mysql as metastore (Hive Installation - Cloudera)

iii) chmod g+w /user/hive/warehouse (this is where hive is going save all your data)

iv) Define the following environment variables

HADOOP_HOME=”/usr/lib/hadoop-0.20”
HIVE_HOME=”/usr/lib/hive/”
HIVE_PORT=10000

Type “hive” on terminal and you will get a hive prompt. That means everything went well.

Step 2 - Let’s Hive:

To learn to play around with hive lets insert the movie lens dataset into hive and run some queries. (Source : Apache Hive - Getting Started)

i) First, download and extract the movie lens data set:

$ wget http://www.grouplens.org/system/files/ml-data.tar+0.gz
$ tar xvzf ml-data.tar+0.gz

ii) Create a table with tab-delimited text file format:

hive> CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

iii) And load it into the table that was just created:

hive> LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data;

iv) Count the number of rows in table u_data:

hive> SELECT COUNT(*) FROM u_data;

You might want to do try some complex queries like Join, Group By etc.

Step 3 - Where’s my data?

We have all our data in mysql. To copy complete data into hive we will use Sqoop by Cloudera (These guys rock!).

Sqoop

i) Lets first install sqoop (Sqoop Installation - Cloudera) ii) Now you can copy everything to hive by one simple command:

$ sqoop import-all-tables --hive-import --connect jdbc:mysql://mysqlserver/databasename --username mysqluser

iii) But let’s write a script to copy only the tables we need and do it one by one. The script will be:

i) Get list of tables from mysql (hint: show tables)
ii) For each table, check if you want to not copy this table and skip it
iii) Drop the table on hive if it exists (hint: hive -e)
iv) Copy the table data to sqoop (hint: use sqoop import)
iv) Insert this script in crontab and execute it once everyday

Note that we are copying everything to hive from scratch. This is in-efficient. There must be a better way. Yes, there is!

Step 4 - Thou shall copy only what is updated

Instead of copying everything, we will only copy the rows which have been updated. To each of the tables in mysql add a “updated_at” column (timestamp) which is updated each time the row is updated and create an index on it. We will now save the timestamp each time we import data and then next time only copy the rows which have been updated after that timestamp. This is made even easier by using sqoop-jobs. Read more about sqoop-jobs here. (Tip: always have a created_at and updated_at in all tables. You will need it so many times for debugging and specially when a bug is pushed to production)

Hive Merge Updated Data

We will now have two scripts. One called hiveFirstRun which will create all jobs and copy everything from scratch. Second script called hiveUpdate which re-run the jobs and copy only the updated data. Lets write hiveFirstRun first:

For each table we want to copy
i) Delete job and drop hive table
ii) Create empty table (hint: use sqoop create-hive-table)
iii) Create job which takes the data from mysql and writes to a file (hint: use "--incremental lastmodified --check-column updated_at")
iv) Execute job and write data to temp file
v) Make a copy of temp file to "tableName_lastDump" vi) Load the temp file into hive table (hint: LOAD DATA INPATH)
vii) Generate Java class and source file for each table which we will be using in merge (hint: use sqoop codegen)

You might be thinking why we wrote to a file first and then copied to hive. We will be using “sqoop merge” to merge the new data with older data which operates on files afaik. Also, we created a copy of file in step (v) because load data will move the file into hadoop and we want to keep a copy to use in next merge. To merge the data, we will be using primary key of each table (id column). Also, you need to set “sqoop.metastore.client.record.password” to true in sqoop config file to save mysql password for each job. Lets write hiveUpdate now:

For each table we want to copy
i) Execute job and write the data to "updated_data" file
ii) Run "sqoop merge" to merge "tableName_lastDump" and "updated_data", and write the merged data to temp file (hint: you will need the java files created in (vii) of hiveFirstRun)
iii) Make a copy of temp file and overwrite "tableName_lastDump"
iv) Load the temp file into hive

Add hiveUpdate to your crontab and execute it every night.

Step 5 - Not so corner cases

i) When your database schema changes, updated_at is not changed

In hiveFirstRun, I save the schema of each table (hint: use mysql desc) and in hiveUpdate I check if schema is same. If it’s not same, I run hiveFirstRun for that table again which regenerates it from scratch. A better approach would be execute “Alter table” on hive also when it’s updated on mysql. Planning to incorporate this into our database schema versioning system soon

ii) While creating table in hive, use “mysql-delimiters” option. Also, before loading the file into hive, you will have rewrite the file and escape commas. I plan to look into this more and find a better solution

iii) In hive, I had created few de-normalized tables which were joins of multiple tables. These table were used by multiple reports so it made sense to pre-compute them instead of each report computing it

Final Words:

So now we have all our data in hive which is updated daily and at the end of day. But how do we use this data. There are two approaches and we are actually using both:

i) To run hive queries from your code, use hive clients which uses thrift. Code samples for all major languages available here

ii) For manual one time queries - install Cloudera Hue. It gives a nice web interface to run queries and export results in csv or excel file. Think of hue as phpmyadmin for Hive.

If you have questions or something is not clear, post a comment here or search on google, stackoverflow and cloudera forums. Almost every issue I ran into, somebody else had already encountered and posted a solution.

p.s. If you love numbers and writing algorithms to play with numbers, drop me an email at himanshu.baweja@gmail.com . I will be more than happy to refer you to Facebook (which initially developed Hive) and/or to start-up for which I built the above system as a consultant. Both are growing like crazy and need people like you :).




Moving to tumblr

Finally decided to shift my blog to tumblr from wordpress. So this will be my new home. Lets hope I become regular poster instead of posting once every few months.

Tags: personal