Hilary Mason asked a really interesting question yesterday:
Data people: What is the very first thing you do when you get your hands on a new data set?
— Hilary Mason (@hmason) June 12, 2014
You should really consider reading the whole discussion here it is amazing. But it also inspired me to write a post about what I do, as told by other people on Twitter. I apologize in advance if I missed your tweet, there was way too much good stuff to get them all.
Step 0: Figure out what I’m trying to do with the data
At least for me I come to a new data set in one of three ways: (1) I made it myself, (2) a collaborator created a data set with a specific question in mind, or (3) a collaborator created a data set and just wants to explore it. In the first case and the second case I already know what the question is, although sometimes in case (2) I still spend a little more time making sure I understand the question before diving in. @visualisingdata and I think alike here:
@hmason this will sound textbooky but I stop, look and think about “what’s it about (phenomena, activity, entity etc). Look before see.
— Andy Kirk (@visualisingdata) June 12, 2014
Usually this involves figuring out what the variables mean like @_jden does:
@hmason try to figure out what the fields mean and how it’s coded — :sandwich emoji: (@_jden) June 12, 2014
If I’m working with a collaborator I do what @evanthomaspaul does:
@hmason Interview the source, if possible, to know all of the problems with the data, use limitations, caveats, etc. — Evan Thomas Paul (@evanthomaspaul) June 12, 2014
If the data don’t have a question yet, I usually start thinking right away about what questions can actually be answered with the data and what can’t. This prevents me from wasting a lot of time later chasing trends. @japerk does something similar:
@hmason figure out the format & how to read it. Then ask myself, what can be learned from this data? — Jacob (@japerk) June 12, 2014
Step 1: Learn about the elephant Unless the data is something I’ve analyzed a lot before, I usually feel like the blind men and the elephant.
So the first thing I do is fool around a bit to try to figure out what the data set “looks” like by doing things like what @jasonpbecker does looking at the types of variables I have, what the first few observations and last few observations look like.
@hmason sapply(df, class); head(df); tail(df) — Jason Becker (@jasonpbecker) June 12, 2014
If it is medical/social data I usually use this to look for personally identifiable information and then do what @peteskomoroch does:
@hmason remove PII and burn it with fire — Peter Skomoroch (@peteskomoroch) June 12, 2014
If the data set is really big, I usually take a carefully chosen random subsample to make it possible to do my exploration interactively like @richardclegg
@hmason unless it is big data in which case sample then import to R and look for NAs… — Richard G. Clegg (@richardclegg) June 12, 2014
After doing that I look for weird quirks, like if there are missing values or outliers like @feralparakeet
@hmason ALL THE DESCRIPTIVES. Well, after reviewing the codebook, of course. — Vickie Edwards (@feralparakeet) June 12, 2014
and like @cpwalker07
@hmason count # rows, read every column header — Chris Walker (@cpwalker07) June 12, 2014
and like @toastandcereal
@hmason@mispagination jot down number of rows. That way I can assess right away whether I’ve done something dumb later on. — Jessica Balsam (@toastandcereal) June 12, 2014
and like @cld276
@hmason run a bunch of count/groupby statements to gauge if I think it’s corrupt. — Carol Davidsen (@cld276) June 12, 2014
and @adamlaiacano
@hmason summary() — Adam Laiacano (@adamlaiacano) June 12, 2014
Step 2: Clean/organize I usually use the first exploration to figure out things that need to be fixed so that I can mess around with a tidy data set. This includes fixing up missing value encoding like @chenghlee
.@hmason Often times, “fix” various codings, esp. for missing data (e.g., mixed strings & ints for coded vals; decide if NAs, “” are equiv.) — Cheng H. Lee (@chenghlee) June 12, 2014
or more generically like: @RubyChilds
@hmason clean it — Ruby ˁ˚ᴥ˚ˀ (@RubyChilds) June 12, 2014
I usually do a fair amount of this, like @the_turtle too:
@hmason Spend the next two days swearing because nobody cleaned it. — The Turtle (@the_turtle) June 12, 2014
When I’m done I do a bunch of sanity checks and data integrity checks like @deaneckles and if things are screwed up I got back and fix them:
@treycausey @hmason Test really boring hypotheses. Like num_mobile_comments <= num_comments. — Dean Eckles (@deaneckles) June 12, 2014
Step 3: Plot. That. Stuff. After getting a handle with mostly text based tables and output (things that don’t require a graphics device) and cleaning things up a bit I start with plotting everything like @hspter
@hmason usually head(data) then straight to visualization. Have been working on some “unit tests” for data as well https://t.co/6Qd3URmzpe — Hilary Parker (@hspter) June 12, 2014
At this stage my goal is to get the maximum amount of information about the data set in the minimal amount of time. So I do not make the graphs pretty (I think there is a distinction between exploratory and expository graphics). I do histograms and jittered one d plots to look at variables one by one like @FisherDanyel
@TwoHeadlines@hmason After looking at a few hundred random rows? Histograms & scatterplots of columns to understand what I have. — Danyel Fisher (@FisherDanyel) June 12, 2014
To compare the distributions of variables I usually use overlayed density plots like @sjwhitworth
@hmason density plot all the things!
— Stephen Whitworth (@sjwhitworth) June 12, 2014
I make tons of scatterplots to look at relationships between variables like @wduyck
@hmason plot scatterplots and distributions
— Wouter Duyck (@wduyck) June 12, 2014
I usually color/size the dots in the scatterplots by other variables to see if I can identify any confounding relationships that might screw up analyses downstream. Then, if the data are multivariate, I do some dimension reduction to get a feel for high dimensional structure. Nobody mentioned principal components or hierarchical clustering in the Twitter conversation, but I end up using these a lot to just figure out if there are any weird multivariate dependencies I might have missed.
Step 4: Get a quick and dirty answer to the question from Step 1
After I have a feel for the data I usually try to come up with a quick and dirty answer to the question I care about. This might be a simple predictive model (I usually use 60% training, 40% test) or a really basic regression model when possible, just to see if the signal is huge, medium or subtle. I use this as a place to start when doing the rest of the analysis. I also often check this against the intuition of the person who generated the data to make sure something hasn’t gone wrong in the data set.