Jan de Leeuw owns the Internet

One of the best things to happen on the Internet recently is that Jan de Leeuw has decided to own the Twitter/Facebook universe. If you do not already, you should be following him. Among his many accomplishments, he founded the Department of Statistics at UCLA (my alma mater), which is currently thriving. On the occasion of the Department's 10th birthday, there was a small celebration, and I recall Don Ylvisaker mentioning that the reason they invited Jan to UCLA way back when was because he "knew everyone and knew everything". Pretty accurate description, in my opinion.

Jan's been tweeting quite a bit of late, but recently had this gem:

followed by

I'm not sure what Jan's thinking behind the first tweet was, but I think many in statistics would consider it a "good thing" to be a minor subfield of data science. Why get involved in that messy thing called data science where people are going wild with data in an unprincipled manner?

This is a situation where I think there is a large disconnect between what "should be" and what "is reality". What should be is that statistics should include the field of data science. Honestly, that would be beneficial to the field of statistics and would allow us to provide a home to many people who don't necessarily have one (primarily, people working not he border between two fields). Nate Silver made reference to this in his keynote address to the Joint Statistical Meetings last year when he said data science was just a fancy term for statistics.

The reality though is the opposite. Statistics has chosen to limit itself to a few areas, such as inference, as Jan mentions, and to willfully ignore other important aspects of data science as "not statistics". This is unfortunate, I think, because unlike many in the field of statistics, I believe data science is here to stay. The reason is because statistics has decided not to fill the spaces that have been created by the increasing complexity of modern data analysis. The needs of modern data analyses (reproducibility, computing on large datasets, data preprocessing/cleaning) didn't fall into the usual statistics curriculum, and so they were ignored. In my view, data science is about stringing together many different tools for many different purposes into an analytic whole. Traditional statistical modeling is a part of this (often a small part), but statistical thinking plays a role in all of it.

Statisticians should take on the challenge of data science and own it. We may not be successful in doing so, but we certainly won't be if we don't try.

Posted in Uncategorized | Leave a comment

Piketty in R markdown - we need some help from the crowd

Thomas Piketty's book Capital in the 21st Century was a surprise best seller and the subject of intense scrutiny. A few weeks ago the Financial Times claimed that the analysis was riddled with errors, leading to a firestorm of discussion. A few days ago the London School of economics posted a similar call to make the data open and machine readable saying.

None of this data is explicitly open for everyone to reuse, clearly licenced and in machine-readable formats.

A few friends of Simply Stats  had started on a project to translate his work from the excel files where the original analysis resides into R. The people that helped were Alyssa Frazee, Aaron Fisher, Bruce Swihart, Abhinav Nellore, Hector Corrada Bravo, John Muschelli, and me. We haven't finished translating all chapters, so we are asking anyone who is interested to help contribute to translating the book's technical appendices into R markdown documents. If you are interested, please send pull requests to the gh-pages branch of this Github repo.

As a way to entice you to participate, here is one interesting thing we found. We don't know enough economics to know if what we are finding is "right" or not, but one interesting thing I found is that the x-axes in the excel files are really distorted. For example here is Figure 1.1 from the Excel files where the ticks on the x-axis are separated by 20, 50, 43, 37, 20, 20, and 22 years.



Here is the same plot with an equally spaced x-axis.


I'm not sure if it makes any difference but it is interesting. It sounds like on measure, the Piketty analysis was mostly reproducible and reasonable.  But having the data available in a more readily analyzable format will allow for more concrete discussion based on the data. So consider contributing to our github repo.

Posted in Uncategorized | Leave a comment

Privacy as a function of sample size

The U.S. Supreme Court just made a unanimous ruling in Riley v. California making it clear that police officers must get a warrant before searching through the contents of a cell phone obtained incident to an arrest. The message was put pretty clearly in the decision:

 Our answer to the question of what police must do before searching a cell phone seized incident to an arrest is accordingly simple — get a warrant.

But I was more fascinated by this quote:

The sum of an individual’s private life can be reconstructed through a thousand photographs labeled with dates, locations, and descriptions; the same cannot be said of a photograph or two of loved ones tucked into a wallet.

So n = 2 is not enough to recreate a private life, but n = 2,000 (with associated annotation) is enough.  I wonder what the minimum sample size needed is to officially violate someone's privacy. I'd be curious get Cathy O'Neil's opinion on that question, she seems to have thought very hard about the relationship between data and privacy.

This is another case where I think that, to some extent, the Supreme Court made a decision on the basis of a statistical concept. Last time it was correlation, this time it is inference. As I read the opinion, part of the argument hinged on how much information do you get by searching a cell phone versus a wallet? Importantly, how much can you infer from those two sets of data?

If any of the Supreme's want a primer in statistics, I'm available.

Posted in Uncategorized | Leave a comment

New book on implementing reproducible research

9781466561595I have mentioned this in a few places but my book edited with Victoria Stodden and Fritz Leisch, Implementing Reproducible Research, has just been published by CRC Press. Although it is technically in their "R Series", the chapters contain information on a wide variety of useful tools, not just R-related tools. 

There is also a supplementary web site hosted through Open Science Framework that contains a lot of additional information, including the list of chapters.

Posted in Uncategorized | Leave a comment

The difference between data hype and data hope

I was reading one of my favorite stats blogs, StatsChat, where Thomas points to this article in the Atlantic and highlights this quote:

Dassault Systèmes is focusing on that level of granularity now, trying to simulate propagation of cholesterol in human cells and building oncological cell models. "It's data science and modeling," Charlès told me. "Coupling the two creates a new environment in medicine."

I think that is a perfect example of data hype. This is a cool idea and if it worked would be completely revolutionary. But the reality is we are not even close to this. In very simple model organisms we can predict very high level phenotypes some of the time with whole cell modeling. We aren't anywhere near the resolution we'd need to model the behavior of human cells, let alone the complex genetic, epigenetic, genomic, and environmental components that likely contribute to complex diseases. It is awesome that people are thinking about the future and the fastest way to science future is usually through science fiction, but this is way overstating the power of current or even currently achievable data science.

So does that mean data science for improving clinical trials right now should be abandoned?


There is tons of currently applicable and real world data science being done in sequential analysis,  adaptive clinical trials, and dynamic treatment regimes. These are important contributions that are impacting clinical trials right now and where advances can reduce costs, save patient harm, and speed the implementation of clinical trials. I think that is the hope of data science - using statistics and data to make steady, realizable improvement in the way we treat patients.

Posted in Uncategorized | Leave a comment

Heads up if you are going to submit to the Journal of the National Cancer Institute

Update (6/19/14): The folks at JNCI and OUP have kindly confirmed that they will consider manuscripts that have been posted to preprint servers. 

I just got this email about a paper we submitted to JNCI

Dear Dr. Leek:

I am sorry that we will not be able to use the above-titled manuscript. Unfortunately, the paper was published online on a site called bioRXiv, The Preprint Server for Biology, hosted by Cold Spring Harbor Lab. JNCI does not publish previously published work.

Thank you for your submission to the Journal.

I have to say I'm not totally surprised, but I am a little disappointed, the future of academic publishing is definitely not evenly distributed.

Posted in Uncategorized | 10 Comments

The future of academic publishing is here, it just isn't evenly distributed

Academic publishing has always been a slow process. Typically you would submit a paper for publication and then wait a few months to more than a year (statistics journals can be slow!) for a review. Then you'd revise the paper in a process that would take another couple of months, resubmit it and potentially wait another few months while this second set of reviews came back.

Lately statistics and statistical genomics have been doing more of what math does and posting papers to the arxiv or to biorxiv. I don't know if it is just me, but using this process has led to a massive speedup in the rate that my academic work gets used/disseminated. Here are a few examples of how crazy it is out there right now.

I started a post on giving talks on Github. It was tweeted before I even finished!

I really appreciate the compliment, especially coming from someone whose posts I read all the time, but it was wild to me that I hadn't even finished the post yet (still haven't) and it was already public.

Another example is that we have posted several papers on biorxiv and they all get tweeted/read. When we posted the Ballgown paper it was rapidly discussed. The day after it was posted, there were already blog posts about the paper up.

We also have been working on another piece of software on Github that hasn't been published yet, but have already had multiple helpful contributions from people outside our group.

While all of this is going on, we have a paper out to review that we have been waiting to hear about for multiple months. So while open science is dramatically speeding up the rate at which we disseminate our results, the speed isn't evenly distributed.

Posted in Uncategorized | Leave a comment

What I do when I get a new data set as told through tweets

Hilary Mason asked a really interesting question yesterday:

You should really consider reading the whole discussion here it is amazing. But it also inspired me to write a post about what I do, as told by other people on Twitter. I apologize in advance if I missed your tweet, there was way too much good stuff to get them all.

Step 0: Figure out what I'm trying to do with the data

At least for me I come to a new data set in one of three ways: (1) I made it myself, (2) a  collaborator created a data set with a specific question in mind, or (3) a collaborator created a data set and just wants to explore it. In the first case and the second case I already know what the question is, although sometimes in case (2) I still spend a little more time making sure I understand the question before diving in. @visualisingdata and I think alike here:

  Usually this involves figuring out what the variables mean like @_jden does:

If I'm working with a collaborator I do what @evanthomaspaul does:

If the data don't have a question yet, I usually start thinking right away about what questions can actually be answered with the data and what can't. This prevents me from wasting a lot of time later chasing trends. @japerk does something similar:

Step 1: Learn about the elephant Unless the data is something I've analyzed a lot before, I usually feel like the blind men and the elephant.

So the first thing I do is fool around a bit to try to figure out what the data set "looks" like by doing things like what @jasonpbecker does looking at the types of variables I have, what the first few observations and last few observations look like.

If it is medical/social data I usually use this to look for personally identifiable information and then do what @peteskomoroch does:

If the data set is really big, I usually take a carefully chosen random subsample to make it possible to do my exploration interactively like @richardclegg

After doing that I look for weird quirks, like if there are missing values or outliers like @feralparakeet

and like @cpwalker07

and like @toastandcereal

and like @cld276

and @adamlaiacano

Step 2: Clean/organize I usually use the first exploration to figure out things that need to be fixed so that I can mess around with a tidy data set. This includes fixing up missing value encoding like @chenghlee

or more generically like: @RubyChilds

I usually do a fair amount of this, like @the_turtle too:

When I'm done I do a bunch of sanity checks and data integrity checks like @deaneckles and if things are screwed up I got back and fix them:

 Step 3: Plot. That. Stuff. After getting a handle with mostly text based tables and output (things that don't require a graphics device) and cleaning things up a bit I start with plotting everything like @hspter

At this stage my goal is to get the maximum amount of information about the data set in the minimal amount of time. So I do not make the graphs pretty (I think there is a distinction between exploratory and expository graphics). I do histograms and jittered one d plots to look at variables one by one like @FisherDanyel

To compare the distributions of variables I usually use overlayed density plots like @sjwhitworth

I make tons of scatterplots to look at relationships between variables like @wduyck

I usually color/size the dots in the scatterplots by other variables to see if I can identify any confounding relationships that might screw up analyses downstream. Then, if the data are multivariate, I do some dimension reduction to get a feel for high dimensional structure. Nobody mentioned principal components or hierarchical clustering in the Twitter conversation, but I end up using these a lot to just figure out if there are any weird multivariate dependencies I might have missed.

Step 4: Get a quick and dirty answer to the question from Step 1

After I have a feel for the data I usually try to come up with a quick and dirty answer to the question I care about. This might be a simple predictive model (I usually use 60% training, 40% test) or a really basic regression model when possible, just to see if the signal is huge, medium or subtle. I use this as a place to start when doing the rest of the analysis. I also often check this against the intuition of the person who generated the data to make sure something hasn't gone wrong in the data set.

Posted in Uncategorized | 4 Comments

The Real Reason Reproducible Research is Important

Reproducible research has been on my mind a bit these days, partly because it has been in the news with the Piketty stuff, and also perhaps because I just published a book on it and I'm teaching a class on it as we speak (as well as next month and the month after...).

However, as I watch and read many discussions over the role of reproducibility in science, I often feel that many people miss the point. Now, just to be clear, when I use the word "reproducibility" or say that a study is reproducible, I do not mean "independent verification" as in a separate investigator conducted an independent study and came to the same conclusion as the original study (that is what I refer to as "replication"). By using the word reproducible, I mean that the original data (and original computer code) can be analyzed (by an independent investigator) to obtain the same results of the original study. In essence, it is the notion that the data analysis can be successfully repeatedReproducibility is particularly important in large computational studies where the data analysis can often play an outsized role in supporting the ultimate conclusions.

Many people seem to conflate the ideas of reproducible and correctness, but they are not the same thing. One must always remember that a study can be reproducible and still be wrong. By "wrong", I mean that the conclusion or claim can be wrong. If I claim that X causes Y (think "sugar causes cancer"), my data analysis might be reproducible, but my claim might ultimately be incorrect for a variety of reasons. If my claim has any value, then others will attempt to replicate it and the correctness of the claim will be determined by whether others come to similar conclusions.

Then why is reproducibility so important? Reproducibility is important because it is the only thing that an investigator can guarantee about a study.

Contrary to what most press releases would have you believe, an investigator cannot guarantee that the claims made in a study are correct (unless they are purely descriptive). This is because in the history of science, no meaningful claim has ever been proven by a single study. (The one exception might be mathematics, whether they are literally proving things in their papers.) So reproducibility is important not because it ensures that the results are correct, but rather because it ensures transparency and gives us confidence in understanding exactly what was done.

These days, with the complexity of data analysis and the subtlety of many claims (particularly about complex diseases), reproducibility is pretty much the only thing we can hope for. Time will tell whether we are ultimately right or wrong about any claims, but reproducibility is something we can know right now.

Posted in Uncategorized | 11 Comments

Post-Piketty Lessons

The latest crisis in data analysis comes to us (once again) from the field of Economics. Thomas Piketty, a French economist recently published a book titled Capital in the 21st Century that has been a best-seller. I have not read the book, but based on media reports, it appears to make the claim that inequality has increased in recent years and will likely increase into the future. The book argues that this increase in inequality is driven by capitalism’s tendency to reward capital more than labor. This is my non-economist’s understanding of the book, but the details specific claims of the book are not what I want to discuss here (there is much discussion elsewhere).

An interesting aspect of Piketty’s work, from my perspective, is that he has made all of his data and analysis available on the web. From what I can tell, his analysis was not trivial—data were collected and merged from multiple disparate sources and adjustments were made to different data series to account for various incompatibilities. To me, this sounds like a standard data analysis, in the sense that all meaningful data analyses are complicated. As noted by Nate Silver, data do not arise from a “virgin birth”, and in any example worth discussing, much work has to be done to get the data into a state in which statistical models can be fit, or even more simply, plots can be made.

Chris Giles, a journalist for the Financial Times, recently published a column (unfortunately blocked by paywall) in which he claimed that much of the analysis that Piketty had done was flawed or incorrect. In particular, he claimed that based on his (Giles’) analysis, inequality was not growing as much over time as Piketty claimed. Among other points, Giles claims that numerous errors were made in assembling the data and in Piketty’s original analysis.

This episode smacked of the recent Reinhart-Rogoff kerfuffle in which some fairly basic errors were discovered in those economists' Excel spreadsheets. Some of those errors only made small differences to the results, but a critical methodological component, in which the data were weighted in a special way, appeared to have a significant impact on the results if alternate approaches were taken.

Piketty has since responded forcefully to the FT's column, defending all of the work he has done and addressing the criticisms one by one. To me, the most important result of the FT analysis is that Piketty’s work appears to be largely reproducible. Piketty made his data available, with reasonable documentation (in addition to his book), and Giles was able to come up with the same numbers Piketty came up with. This is a good thing. Piketty’s work was complex, and the only way to communicate the entirety of it was to make the data and code available.

The other aspects of Giles’ analysis are, from an academic standpoint, largely irrelevant to me, particularly because I am not an economist. The reason I find them irrelevant is because the objections are largely over whether he is correct or not. This is an obviously important question, but in any field, no single study or even synthesis can be determined to be "correct" at that instance. Time will tell, and if his work is "correct", his predictions will be borne out by nature. It's not so satisfying to have to wait many years to know if you are correct, but that's how science works.

In the meantime, economists will have a debate over the science and the appropriate methods and data used for analysis. This is also how science works, and it is only (really) possible because Piketty made his work reproducible. Otherwise, the debate would be largely uninformed.

Posted in Uncategorized | 6 Comments