Academic statisticians: there is no shame in developing statistical solutions that solve just one problem

I think that the main distinction between academic statisticians and those calling themselves data scientists is that the latter are very much willing to invest most of their time and energy into solving specific problems by analyzing specific data sets. In contrast, most academic statisticians strive to develop methods that can be very generally applied across problems and data types. There is a reason for this of course:  historically statisticians have had enormous influence by developing general theory/methods/concepts such as the p-value, maximum likelihood estimation, and linear regression. However, these types of success stories are becoming more and more rare while data scientists are becoming increasingly influential in their respective areas of applications by solving important context-specific problems. The success of Money Ball and the prediction of election results are two recent widely publicized examples.

A survey of papers published in our flagship journals make it quite clear that context-agnostic methodology are valued much more than detailed descriptions of successful solutions to specific problems. These applied papers tend to get published in subject matter journals and do not usually receive the same weight in appointments and promotions. This culture has therefore kept most statisticians holding academic position away from collaborations that require substantial time and energy investments in understanding and attacking the specifics of the problem at hand. Below I argue that to remain relevant as a discipline we need a cultural shift.

It is of course understandable that to remain a discipline academic statisticians can’t devote all our effort to solving specific problems and none to trying to the generalize these solutions. It is the development of these abstractions that defines us as an academic discipline and not just a profession. However, if our involvement with real problems is too superficial, we run the risk of developing methods that solve no problem at all which will eventually render us obsolete. We need to accept that as data and problems become more complex, more time will have to be devoted to understanding the gory details.

But what should the balance be?

Note that many of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.

Posted in Uncategorized | 7 Comments

Jan de Leeuw owns the Internet

One of the best things to happen on the Internet recently is that Jan de Leeuw has decided to own the Twitter/Facebook universe. If you do not already, you should be following him. Among his many accomplishments, he founded the Department of Statistics at UCLA (my alma mater), which is currently thriving. On the occasion of the Department's 10th birthday, there was a small celebration, and I recall Don Ylvisaker mentioning that the reason they invited Jan to UCLA way back when was because he "knew everyone and knew everything". Pretty accurate description, in my opinion.

Jan's been tweeting quite a bit of late, but recently had this gem:

followed by

I'm not sure what Jan's thinking behind the first tweet was, but I think many in statistics would consider it a "good thing" to be a minor subfield of data science. Why get involved in that messy thing called data science where people are going wild with data in an unprincipled manner?

This is a situation where I think there is a large disconnect between what "should be" and what "is reality". What should be is that statistics should include the field of data science. Honestly, that would be beneficial to the field of statistics and would allow us to provide a home to many people who don't necessarily have one (primarily, people working not he border between two fields). Nate Silver made reference to this in his keynote address to the Joint Statistical Meetings last year when he said data science was just a fancy term for statistics.

The reality though is the opposite. Statistics has chosen to limit itself to a few areas, such as inference, as Jan mentions, and to willfully ignore other important aspects of data science as "not statistics". This is unfortunate, I think, because unlike many in the field of statistics, I believe data science is here to stay. The reason is because statistics has decided not to fill the spaces that have been created by the increasing complexity of modern data analysis. The needs of modern data analyses (reproducibility, computing on large datasets, data preprocessing/cleaning) didn't fall into the usual statistics curriculum, and so they were ignored. In my view, data science is about stringing together many different tools for many different purposes into an analytic whole. Traditional statistical modeling is a part of this (often a small part), but statistical thinking plays a role in all of it.

Statisticians should take on the challenge of data science and own it. We may not be successful in doing so, but we certainly won't be if we don't try.

Posted in Uncategorized | 1 Comment

Piketty in R markdown - we need some help from the crowd

Thomas Piketty's book Capital in the 21st Century was a surprise best seller and the subject of intense scrutiny. A few weeks ago the Financial Times claimed that the analysis was riddled with errors, leading to a firestorm of discussion. A few days ago the London School of economics posted a similar call to make the data open and machine readable saying.

None of this data is explicitly open for everyone to reuse, clearly licenced and in machine-readable formats.

A few friends of Simply Stats  had started on a project to translate his work from the excel files where the original analysis resides into R. The people that helped were Alyssa Frazee, Aaron Fisher, Bruce Swihart, Abhinav Nellore, Hector Corrada Bravo, John Muschelli, and me. We haven't finished translating all chapters, so we are asking anyone who is interested to help contribute to translating the book's technical appendices into R markdown documents. If you are interested, please send pull requests to the gh-pages branch of this Github repo.

As a way to entice you to participate, here is one interesting thing we found. We don't know enough economics to know if what we are finding is "right" or not, but one interesting thing I found is that the x-axes in the excel files are really distorted. For example here is Figure 1.1 from the Excel files where the ticks on the x-axis are separated by 20, 50, 43, 37, 20, 20, and 22 years.



Here is the same plot with an equally spaced x-axis.


I'm not sure if it makes any difference but it is interesting. It sounds like on measure, the Piketty analysis was mostly reproducible and reasonable.  But having the data available in a more readily analyzable format will allow for more concrete discussion based on the data. So consider contributing to our github repo.

Posted in Uncategorized | 1 Comment

Privacy as a function of sample size

The U.S. Supreme Court just made a unanimous ruling in Riley v. California making it clear that police officers must get a warrant before searching through the contents of a cell phone obtained incident to an arrest. The message was put pretty clearly in the decision:

 Our answer to the question of what police must do before searching a cell phone seized incident to an arrest is accordingly simple — get a warrant.

But I was more fascinated by this quote:

The sum of an individual’s private life can be reconstructed through a thousand photographs labeled with dates, locations, and descriptions; the same cannot be said of a photograph or two of loved ones tucked into a wallet.

So n = 2 is not enough to recreate a private life, but n = 2,000 (with associated annotation) is enough.  I wonder what the minimum sample size needed is to officially violate someone's privacy. I'd be curious get Cathy O'Neil's opinion on that question, she seems to have thought very hard about the relationship between data and privacy.

This is another case where I think that, to some extent, the Supreme Court made a decision on the basis of a statistical concept. Last time it was correlation, this time it is inference. As I read the opinion, part of the argument hinged on how much information do you get by searching a cell phone versus a wallet? Importantly, how much can you infer from those two sets of data?

If any of the Supreme's want a primer in statistics, I'm available.

Posted in Uncategorized | Leave a comment

New book on implementing reproducible research

9781466561595I have mentioned this in a few places but my book edited with Victoria Stodden and Fritz Leisch, Implementing Reproducible Research, has just been published by CRC Press. Although it is technically in their "R Series", the chapters contain information on a wide variety of useful tools, not just R-related tools. 

There is also a supplementary web site hosted through Open Science Framework that contains a lot of additional information, including the list of chapters.

Posted in Uncategorized | Leave a comment

The difference between data hype and data hope

I was reading one of my favorite stats blogs, StatsChat, where Thomas points to this article in the Atlantic and highlights this quote:

Dassault Systèmes is focusing on that level of granularity now, trying to simulate propagation of cholesterol in human cells and building oncological cell models. "It's data science and modeling," Charlès told me. "Coupling the two creates a new environment in medicine."

I think that is a perfect example of data hype. This is a cool idea and if it worked would be completely revolutionary. But the reality is we are not even close to this. In very simple model organisms we can predict very high level phenotypes some of the time with whole cell modeling. We aren't anywhere near the resolution we'd need to model the behavior of human cells, let alone the complex genetic, epigenetic, genomic, and environmental components that likely contribute to complex diseases. It is awesome that people are thinking about the future and the fastest way to science future is usually through science fiction, but this is way overstating the power of current or even currently achievable data science.

So does that mean data science for improving clinical trials right now should be abandoned?


There is tons of currently applicable and real world data science being done in sequential analysis,  adaptive clinical trials, and dynamic treatment regimes. These are important contributions that are impacting clinical trials right now and where advances can reduce costs, save patient harm, and speed the implementation of clinical trials. I think that is the hope of data science - using statistics and data to make steady, realizable improvement in the way we treat patients.

Posted in Uncategorized | Leave a comment

Heads up if you are going to submit to the Journal of the National Cancer Institute

Update (6/19/14): The folks at JNCI and OUP have kindly confirmed that they will consider manuscripts that have been posted to preprint servers. 

I just got this email about a paper we submitted to JNCI

Dear Dr. Leek:

I am sorry that we will not be able to use the above-titled manuscript. Unfortunately, the paper was published online on a site called bioRXiv, The Preprint Server for Biology, hosted by Cold Spring Harbor Lab. JNCI does not publish previously published work.

Thank you for your submission to the Journal.

I have to say I'm not totally surprised, but I am a little disappointed, the future of academic publishing is definitely not evenly distributed.

Posted in Uncategorized | 10 Comments

The future of academic publishing is here, it just isn't evenly distributed

Academic publishing has always been a slow process. Typically you would submit a paper for publication and then wait a few months to more than a year (statistics journals can be slow!) for a review. Then you'd revise the paper in a process that would take another couple of months, resubmit it and potentially wait another few months while this second set of reviews came back.

Lately statistics and statistical genomics have been doing more of what math does and posting papers to the arxiv or to biorxiv. I don't know if it is just me, but using this process has led to a massive speedup in the rate that my academic work gets used/disseminated. Here are a few examples of how crazy it is out there right now.

I started a post on giving talks on Github. It was tweeted before I even finished!

I really appreciate the compliment, especially coming from someone whose posts I read all the time, but it was wild to me that I hadn't even finished the post yet (still haven't) and it was already public.

Another example is that we have posted several papers on biorxiv and they all get tweeted/read. When we posted the Ballgown paper it was rapidly discussed. The day after it was posted, there were already blog posts about the paper up.

We also have been working on another piece of software on Github that hasn't been published yet, but have already had multiple helpful contributions from people outside our group.

While all of this is going on, we have a paper out to review that we have been waiting to hear about for multiple months. So while open science is dramatically speeding up the rate at which we disseminate our results, the speed isn't evenly distributed.

Posted in Uncategorized | Leave a comment

What I do when I get a new data set as told through tweets

Hilary Mason asked a really interesting question yesterday:

You should really consider reading the whole discussion here it is amazing. But it also inspired me to write a post about what I do, as told by other people on Twitter. I apologize in advance if I missed your tweet, there was way too much good stuff to get them all.

Step 0: Figure out what I'm trying to do with the data

At least for me I come to a new data set in one of three ways: (1) I made it myself, (2) a  collaborator created a data set with a specific question in mind, or (3) a collaborator created a data set and just wants to explore it. In the first case and the second case I already know what the question is, although sometimes in case (2) I still spend a little more time making sure I understand the question before diving in. @visualisingdata and I think alike here:

  Usually this involves figuring out what the variables mean like @_jden does:

If I'm working with a collaborator I do what @evanthomaspaul does:

If the data don't have a question yet, I usually start thinking right away about what questions can actually be answered with the data and what can't. This prevents me from wasting a lot of time later chasing trends. @japerk does something similar:

Step 1: Learn about the elephant Unless the data is something I've analyzed a lot before, I usually feel like the blind men and the elephant.

So the first thing I do is fool around a bit to try to figure out what the data set "looks" like by doing things like what @jasonpbecker does looking at the types of variables I have, what the first few observations and last few observations look like.

If it is medical/social data I usually use this to look for personally identifiable information and then do what @peteskomoroch does:

If the data set is really big, I usually take a carefully chosen random subsample to make it possible to do my exploration interactively like @richardclegg

After doing that I look for weird quirks, like if there are missing values or outliers like @feralparakeet

and like @cpwalker07

and like @toastandcereal

and like @cld276

and @adamlaiacano

Step 2: Clean/organize I usually use the first exploration to figure out things that need to be fixed so that I can mess around with a tidy data set. This includes fixing up missing value encoding like @chenghlee

or more generically like: @RubyChilds

I usually do a fair amount of this, like @the_turtle too:

When I'm done I do a bunch of sanity checks and data integrity checks like @deaneckles and if things are screwed up I got back and fix them:

 Step 3: Plot. That. Stuff. After getting a handle with mostly text based tables and output (things that don't require a graphics device) and cleaning things up a bit I start with plotting everything like @hspter

At this stage my goal is to get the maximum amount of information about the data set in the minimal amount of time. So I do not make the graphs pretty (I think there is a distinction between exploratory and expository graphics). I do histograms and jittered one d plots to look at variables one by one like @FisherDanyel

To compare the distributions of variables I usually use overlayed density plots like @sjwhitworth

I make tons of scatterplots to look at relationships between variables like @wduyck

I usually color/size the dots in the scatterplots by other variables to see if I can identify any confounding relationships that might screw up analyses downstream. Then, if the data are multivariate, I do some dimension reduction to get a feel for high dimensional structure. Nobody mentioned principal components or hierarchical clustering in the Twitter conversation, but I end up using these a lot to just figure out if there are any weird multivariate dependencies I might have missed.

Step 4: Get a quick and dirty answer to the question from Step 1

After I have a feel for the data I usually try to come up with a quick and dirty answer to the question I care about. This might be a simple predictive model (I usually use 60% training, 40% test) or a really basic regression model when possible, just to see if the signal is huge, medium or subtle. I use this as a place to start when doing the rest of the analysis. I also often check this against the intuition of the person who generated the data to make sure something hasn't gone wrong in the data set.

Posted in Uncategorized | 4 Comments

The Real Reason Reproducible Research is Important

Reproducible research has been on my mind a bit these days, partly because it has been in the news with the Piketty stuff, and also perhaps because I just published a book on it and I'm teaching a class on it as we speak (as well as next month and the month after...).

However, as I watch and read many discussions over the role of reproducibility in science, I often feel that many people miss the point. Now, just to be clear, when I use the word "reproducibility" or say that a study is reproducible, I do not mean "independent verification" as in a separate investigator conducted an independent study and came to the same conclusion as the original study (that is what I refer to as "replication"). By using the word reproducible, I mean that the original data (and original computer code) can be analyzed (by an independent investigator) to obtain the same results of the original study. In essence, it is the notion that the data analysis can be successfully repeatedReproducibility is particularly important in large computational studies where the data analysis can often play an outsized role in supporting the ultimate conclusions.

Many people seem to conflate the ideas of reproducible and correctness, but they are not the same thing. One must always remember that a study can be reproducible and still be wrong. By "wrong", I mean that the conclusion or claim can be wrong. If I claim that X causes Y (think "sugar causes cancer"), my data analysis might be reproducible, but my claim might ultimately be incorrect for a variety of reasons. If my claim has any value, then others will attempt to replicate it and the correctness of the claim will be determined by whether others come to similar conclusions.

Then why is reproducibility so important? Reproducibility is important because it is the only thing that an investigator can guarantee about a study.

Contrary to what most press releases would have you believe, an investigator cannot guarantee that the claims made in a study are correct (unless they are purely descriptive). This is because in the history of science, no meaningful claim has ever been proven by a single study. (The one exception might be mathematics, whether they are literally proving things in their papers.) So reproducibility is important not because it ensures that the results are correct, but rather because it ensures transparency and gives us confidence in understanding exactly what was done.

These days, with the complexity of data analysis and the subtlety of many claims (particularly about complex diseases), reproducibility is pretty much the only thing we can hope for. Time will tell whether we are ultimately right or wrong about any claims, but reproducibility is something we can know right now.

Posted in Uncategorized | 11 Comments