What I do when I get a new data set as told through tweets

Jeff Leek

Hilary Mason asked a really interesting question yesterday:

Data people: What is the very first thing you do when you get your hands on a new data set?

— Hilary Mason (@hmason) June 12, 2014

You should really consider reading the whole discussion here it is amazing. But it also inspired me to write a post about what I do, as told by other people on Twitter. I apologize in advance if I missed your tweet, there was way too much good stuff to get them all.

Step 0: Figure out what I’m trying to do with the data

At least for me I come to a new data set in one of three ways: (1) I made it myself, (2) a  collaborator created a data set with a specific question in mind, or (3) a collaborator created a data set and just wants to explore it. In the first case and the second case I already know what the question is, although sometimes in case (2) I still spend a little more time making sure I understand the question before diving in. @visualisingdata and I think alike here:

@hmason this will sound textbooky but I stop, look and think about “what’s it about (phenomena, activity, entity etc). Look before see.

— Andy Kirk (@visualisingdata) June 12, 2014

  Usually this involves figuring out what the variables mean like @_jden does:

@hmason try to figure out what the fields mean and how it’s coded — :sandwich emoji: (@_jden) June 12, 2014

If I’m working with a collaborator I do what @evanthomaspaul does:

@hmason Interview the source, if possible, to know all of the problems with the data, use limitations, caveats, etc. — Evan Thomas Paul (@evanthomaspaul) June 12, 2014

If the data don’t have a question yet, I usually start thinking right away about what questions can actually be answered with the data and what can’t. This prevents me from wasting a lot of time later chasing trends. @japerk does something similar:

@hmason figure out the format & how to read it. Then ask myself, what can be learned from this data? — Jacob (@japerk) June 12, 2014

Step 1: Learn about the elephant Unless the data is something I’ve analyzed a lot before, I usually feel like the blind men and the elephant.

So the first thing I do is fool around a bit to try to figure out what the data set “looks” like by doing things like what @jasonpbecker does looking at the types of variables I have, what the first few observations and last few observations look like.

@hmason sapply(df, class); head(df); tail(df) — Jason Becker (@jasonpbecker) June 12, 2014

If it is medical/social data I usually use this to look for personally identifiable information and then do what @peteskomoroch does:

@hmason remove PII and burn it with fire — Peter Skomoroch (@peteskomoroch) June 12, 2014

If the data set is really big, I usually take a carefully chosen random subsample to make it possible to do my exploration interactively like @richardclegg

@hmason unless it is big data in which case sample then import to R and look for NAs… :-) — Richard G. Clegg (@richardclegg) June 12, 2014

After doing that I look for weird quirks, like if there are missing values or outliers like @feralparakeet

@hmason ALL THE DESCRIPTIVES. Well, after reviewing the codebook, of course. — Vickie Edwards (@feralparakeet) June 12, 2014

and like @cpwalker07

@hmason count # rows, read every column header — Chris Walker (@cpwalker07) June 12, 2014

and like @toastandcereal

@hmason@mispagination jot down number of rows. That way I can assess right away whether I’ve done something dumb later on. — Jessica Balsam (@toastandcereal) June 12, 2014

and like @cld276

@hmason run a bunch of count/groupby statements to gauge if I think it’s corrupt. — Carol Davidsen (@cld276) June 12, 2014

and @adamlaiacano

@hmason summary() — Adam Laiacano (@adamlaiacano) June 12, 2014

Step 2: Clean/organize I usually use the first exploration to figure out things that need to be fixed so that I can mess around with a tidy data set. This includes fixing up missing value encoding like @chenghlee

.@hmason Often times, “fix” various codings, esp. for missing data (e.g., mixed strings & ints for coded vals; decide if NAs, “” are equiv.) — Cheng H. Lee (@chenghlee) June 12, 2014

or more generically like: @RubyChilds

@hmason clean it — Ruby ˁ˚ᴥ˚ˀ (@RubyChilds) June 12, 2014

I usually do a fair amount of this, like @the_turtle too:

@hmason Spend the next two days swearing because nobody cleaned it. — The Turtle  (@the_turtle) June 12, 2014

When I’m done I do a bunch of sanity checks and data integrity checks like @deaneckles and if things are screwed up I got back and fix them:

@treycausey @hmason Test really boring hypotheses. Like num_mobile_comments <= num_comments. — Dean Eckles (@deaneckles) June 12, 2014

 Step 3: Plot. That. Stuff. After getting a handle with mostly text based tables and output (things that don’t require a graphics device) and cleaning things up a bit I start with plotting everything like @hspter

@hmason usually head(data) then straight to visualization. Have been working on some “unit tests” for data as well https://t.co/6Qd3URmzpe — Hilary Parker (@hspter) June 12, 2014

At this stage my goal is to get the maximum amount of information about the data set in the minimal amount of time. So I do not make the graphs pretty (I think there is a distinction between exploratory and expository graphics). I do histograms and jittered one d plots to look at variables one by one like @FisherDanyel

@TwoHeadlines@hmason After looking at a few hundred random rows? Histograms & scatterplots of columns to understand what I have. — Danyel Fisher (@FisherDanyel) June 12, 2014

To compare the distributions of variables I usually use overlayed density plots like @sjwhitworth

@hmason density plot all the things!

— Stephen Whitworth (@sjwhitworth) June 12, 2014

I make tons of scatterplots to look at relationships between variables like @wduyck

@hmason plot scatterplots and distributions

— Wouter Duyck (@wduyck) June 12, 2014

I usually color/size the dots in the scatterplots by other variables to see if I can identify any confounding relationships that might screw up analyses downstream. Then, if the data are multivariate, I do some dimension reduction to get a feel for high dimensional structure. Nobody mentioned principal components or hierarchical clustering in the Twitter conversation, but I end up using these a lot to just figure out if there are any weird multivariate dependencies I might have missed.

Step 4: Get a quick and dirty answer to the question from Step 1

After I have a feel for the data I usually try to come up with a quick and dirty answer to the question I care about. This might be a simple predictive model (I usually use 60% training, 40% test) or a really basic regression model when possible, just to see if the signal is huge, medium or subtle. I use this as a place to start when doing the rest of the analysis. I also often check this against the intuition of the person who generated the data to make sure something hasn’t gone wrong in the data set.