What I do when I get a new data set as told through tweets13 Jun 2014
Hilary Mason asked a really interesting question yesterday:
Data people: What is the very first thing you do when you get your hands on a new data set?
— Hilary Mason (@hmason) June 12, 2014
You should really consider reading the whole discussion here it is amazing. But it also inspired me to write a post about what I do, as told by other people on Twitter. I apologize in advance if I missed your tweet, there was way too much good stuff to get them all.
Step 0: Figure out what I’m trying to do with the data
At least for me I come to a new data set in one of three ways: (1) I made it myself, (2) a collaborator created a data set with a specific question in mind, or (3) a collaborator created a data set and just wants to explore it. In the first case and the second case I already know what the question is, although sometimes in case (2) I still spend a little more time making sure I understand the question before diving in. @visualisingdata and I think alike here:
@hmason this will sound textbooky but I stop, look and think about "what's it about (phenomena, activity, entity etc). Look before see.
— Andy Kirk (@visualisingdata) June 12, 2014
Usually this involves figuring out what the variables mean like @_jden does:
If I’m working with a collaborator I do what @evanthomaspaul does:
If the data don’t have a question yet, I usually start thinking right away about what questions can actually be answered with the data and what can’t. This prevents me from wasting a lot of time later chasing trends. @japerk does something similar:
Step 1: Learn about the elephant Unless the data is something I’ve analyzed a lot before, I usually feel like the blind men and the elephant.
So the first thing I do is fool around a bit to try to figure out what the data set “looks” like by doing things like what @jasonpbecker does looking at the types of variables I have, what the first few observations and last few observations look like.
If it is medical/social data I usually use this to look for personally identifiable information and then do what @peteskomoroch does:
If the data set is really big, I usually take a carefully chosen random subsample to make it possible to do my exploration interactively like @richardclegg
After doing that I look for weird quirks, like if there are missing values or outliers like @feralparakeet
and like @cpwalker07
and like @toastandcereal
and like @cld276
Step 2: Clean/organize I usually use the first exploration to figure out things that need to be fixed so that I can mess around with a tidy data set. This includes fixing up missing value encoding like @chenghlee
or more generically like: @RubyChilds
I usually do a fair amount of this, like @the_turtle too:
When I’m done I do a bunch of sanity checks and data integrity checks like @deaneckles and if things are screwed up I got back and fix them:
Step 3: Plot. That. Stuff. After getting a handle with mostly text based tables and output (things that don’t require a graphics device) and cleaning things up a bit I start with plotting everything like @hspter
At this stage my goal is to get the maximum amount of information about the data set in the minimal amount of time. So I do not make the graphs pretty (I think there is a distinction between exploratory and expository graphics). I do histograms and jittered one d plots to look at variables one by one like @FisherDanyel
To compare the distributions of variables I usually use overlayed density plots like @sjwhitworth
@hmason density plot all the things!
— Stephen Whitworth (@sjwhitworth) June 12, 2014
I make tons of scatterplots to look at relationships between variables like @wduyck
@hmason plot scatterplots and distributions
— Wouter Duyck (@wduyck) June 12, 2014
I usually color/size the dots in the scatterplots by other variables to see if I can identify any confounding relationships that might screw up analyses downstream. Then, if the data are multivariate, I do some dimension reduction to get a feel for high dimensional structure. Nobody mentioned principal components or hierarchical clustering in the Twitter conversation, but I end up using these a lot to just figure out if there are any weird multivariate dependencies I might have missed.
Step 4: Get a quick and dirty answer to the question from Step 1
After I have a feel for the data I usually try to come up with a quick and dirty answer to the question I care about. This might be a simple predictive model (I usually use 60% training, 40% test) or a really basic regression model when possible, just to see if the signal is huge, medium or subtle. I use this as a place to start when doing the rest of the analysis. I also often check this against the intuition of the person who generated the data to make sure something hasn’t gone wrong in the data set.