Heads up if you are going to submit to the Journal of the National Cancer Institute

Update (6/19/14): The folks at JNCI and OUP have kindly confirmed that they will consider manuscripts that have been posted to preprint servers. 

I just got this email about a paper we submitted to JNCI

Dear Dr. Leek:

I am sorry that we will not be able to use the above-titled manuscript. Unfortunately, the paper was published online on a site called bioRXiv, The Preprint Server for Biology, hosted by Cold Spring Harbor Lab. JNCI does not publish previously published work.

Thank you for your submission to the Journal.

I have to say I'm not totally surprised, but I am a little disappointed, the future of academic publishing is definitely not evenly distributed.

Posted in Uncategorized | 10 Comments

The future of academic publishing is here, it just isn't evenly distributed

Academic publishing has always been a slow process. Typically you would submit a paper for publication and then wait a few months to more than a year (statistics journals can be slow!) for a review. Then you'd revise the paper in a process that would take another couple of months, resubmit it and potentially wait another few months while this second set of reviews came back.

Lately statistics and statistical genomics have been doing more of what math does and posting papers to the arxiv or to biorxiv. I don't know if it is just me, but using this process has led to a massive speedup in the rate that my academic work gets used/disseminated. Here are a few examples of how crazy it is out there right now.

I started a post on giving talks on Github. It was tweeted before I even finished!

I really appreciate the compliment, especially coming from someone whose posts I read all the time, but it was wild to me that I hadn't even finished the post yet (still haven't) and it was already public.

Another example is that we have posted several papers on biorxiv and they all get tweeted/read. When we posted the Ballgown paper it was rapidly discussed. The day after it was posted, there were already blog posts about the paper up.

We also have been working on another piece of software on Github that hasn't been published yet, but have already had multiple helpful contributions from people outside our group.

While all of this is going on, we have a paper out to review that we have been waiting to hear about for multiple months. So while open science is dramatically speeding up the rate at which we disseminate our results, the speed isn't evenly distributed.

Posted in Uncategorized | Leave a comment

What I do when I get a new data set as told through tweets

Hilary Mason asked a really interesting question yesterday:

You should really consider reading the whole discussion here it is amazing. But it also inspired me to write a post about what I do, as told by other people on Twitter. I apologize in advance if I missed your tweet, there was way too much good stuff to get them all.

Step 0: Figure out what I'm trying to do with the data

At least for me I come to a new data set in one of three ways: (1) I made it myself, (2) a  collaborator created a data set with a specific question in mind, or (3) a collaborator created a data set and just wants to explore it. In the first case and the second case I already know what the question is, although sometimes in case (2) I still spend a little more time making sure I understand the question before diving in. @visualisingdata and I think alike here:

  Usually this involves figuring out what the variables mean like @_jden does:

If I'm working with a collaborator I do what @evanthomaspaul does:

If the data don't have a question yet, I usually start thinking right away about what questions can actually be answered with the data and what can't. This prevents me from wasting a lot of time later chasing trends. @japerk does something similar:

Step 1: Learn about the elephant Unless the data is something I've analyzed a lot before, I usually feel like the blind men and the elephant.

So the first thing I do is fool around a bit to try to figure out what the data set "looks" like by doing things like what @jasonpbecker does looking at the types of variables I have, what the first few observations and last few observations look like.

If it is medical/social data I usually use this to look for personally identifiable information and then do what @peteskomoroch does:

If the data set is really big, I usually take a carefully chosen random subsample to make it possible to do my exploration interactively like @richardclegg

After doing that I look for weird quirks, like if there are missing values or outliers like @feralparakeet

and like @cpwalker07

and like @toastandcereal

and like @cld276

and @adamlaiacano

Step 2: Clean/organize I usually use the first exploration to figure out things that need to be fixed so that I can mess around with a tidy data set. This includes fixing up missing value encoding like @chenghlee

or more generically like: @RubyChilds

I usually do a fair amount of this, like @the_turtle too:

When I'm done I do a bunch of sanity checks and data integrity checks like @deaneckles and if things are screwed up I got back and fix them:

 Step 3: Plot. That. Stuff. After getting a handle with mostly text based tables and output (things that don't require a graphics device) and cleaning things up a bit I start with plotting everything like @hspter

At this stage my goal is to get the maximum amount of information about the data set in the minimal amount of time. So I do not make the graphs pretty (I think there is a distinction between exploratory and expository graphics). I do histograms and jittered one d plots to look at variables one by one like @FisherDanyel

To compare the distributions of variables I usually use overlayed density plots like @sjwhitworth

I make tons of scatterplots to look at relationships between variables like @wduyck

I usually color/size the dots in the scatterplots by other variables to see if I can identify any confounding relationships that might screw up analyses downstream. Then, if the data are multivariate, I do some dimension reduction to get a feel for high dimensional structure. Nobody mentioned principal components or hierarchical clustering in the Twitter conversation, but I end up using these a lot to just figure out if there are any weird multivariate dependencies I might have missed.

Step 4: Get a quick and dirty answer to the question from Step 1

After I have a feel for the data I usually try to come up with a quick and dirty answer to the question I care about. This might be a simple predictive model (I usually use 60% training, 40% test) or a really basic regression model when possible, just to see if the signal is huge, medium or subtle. I use this as a place to start when doing the rest of the analysis. I also often check this against the intuition of the person who generated the data to make sure something hasn't gone wrong in the data set.

Posted in Uncategorized | 4 Comments

The Real Reason Reproducible Research is Important

Reproducible research has been on my mind a bit these days, partly because it has been in the news with the Piketty stuff, and also perhaps because I just published a book on it and I'm teaching a class on it as we speak (as well as next month and the month after...).

However, as I watch and read many discussions over the role of reproducibility in science, I often feel that many people miss the point. Now, just to be clear, when I use the word "reproducibility" or say that a study is reproducible, I do not mean "independent verification" as in a separate investigator conducted an independent study and came to the same conclusion as the original study (that is what I refer to as "replication"). By using the word reproducible, I mean that the original data (and original computer code) can be analyzed (by an independent investigator) to obtain the same results of the original study. In essence, it is the notion that the data analysis can be successfully repeatedReproducibility is particularly important in large computational studies where the data analysis can often play an outsized role in supporting the ultimate conclusions.

Many people seem to conflate the ideas of reproducible and correctness, but they are not the same thing. One must always remember that a study can be reproducible and still be wrong. By "wrong", I mean that the conclusion or claim can be wrong. If I claim that X causes Y (think "sugar causes cancer"), my data analysis might be reproducible, but my claim might ultimately be incorrect for a variety of reasons. If my claim has any value, then others will attempt to replicate it and the correctness of the claim will be determined by whether others come to similar conclusions.

Then why is reproducibility so important? Reproducibility is important because it is the only thing that an investigator can guarantee about a study.

Contrary to what most press releases would have you believe, an investigator cannot guarantee that the claims made in a study are correct (unless they are purely descriptive). This is because in the history of science, no meaningful claim has ever been proven by a single study. (The one exception might be mathematics, whether they are literally proving things in their papers.) So reproducibility is important not because it ensures that the results are correct, but rather because it ensures transparency and gives us confidence in understanding exactly what was done.

These days, with the complexity of data analysis and the subtlety of many claims (particularly about complex diseases), reproducibility is pretty much the only thing we can hope for. Time will tell whether we are ultimately right or wrong about any claims, but reproducibility is something we can know right now.

Posted in Uncategorized | 11 Comments

Post-Piketty Lessons

The latest crisis in data analysis comes to us (once again) from the field of Economics. Thomas Piketty, a French economist recently published a book titled Capital in the 21st Century that has been a best-seller. I have not read the book, but based on media reports, it appears to make the claim that inequality has increased in recent years and will likely increase into the future. The book argues that this increase in inequality is driven by capitalism’s tendency to reward capital more than labor. This is my non-economist’s understanding of the book, but the details specific claims of the book are not what I want to discuss here (there is much discussion elsewhere).

An interesting aspect of Piketty’s work, from my perspective, is that he has made all of his data and analysis available on the web. From what I can tell, his analysis was not trivial—data were collected and merged from multiple disparate sources and adjustments were made to different data series to account for various incompatibilities. To me, this sounds like a standard data analysis, in the sense that all meaningful data analyses are complicated. As noted by Nate Silver, data do not arise from a “virgin birth”, and in any example worth discussing, much work has to be done to get the data into a state in which statistical models can be fit, or even more simply, plots can be made.

Chris Giles, a journalist for the Financial Times, recently published a column (unfortunately blocked by paywall) in which he claimed that much of the analysis that Piketty had done was flawed or incorrect. In particular, he claimed that based on his (Giles’) analysis, inequality was not growing as much over time as Piketty claimed. Among other points, Giles claims that numerous errors were made in assembling the data and in Piketty’s original analysis.

This episode smacked of the recent Reinhart-Rogoff kerfuffle in which some fairly basic errors were discovered in those economists' Excel spreadsheets. Some of those errors only made small differences to the results, but a critical methodological component, in which the data were weighted in a special way, appeared to have a significant impact on the results if alternate approaches were taken.

Piketty has since responded forcefully to the FT's column, defending all of the work he has done and addressing the criticisms one by one. To me, the most important result of the FT analysis is that Piketty’s work appears to be largely reproducible. Piketty made his data available, with reasonable documentation (in addition to his book), and Giles was able to come up with the same numbers Piketty came up with. This is a good thing. Piketty’s work was complex, and the only way to communicate the entirety of it was to make the data and code available.

The other aspects of Giles’ analysis are, from an academic standpoint, largely irrelevant to me, particularly because I am not an economist. The reason I find them irrelevant is because the objections are largely over whether he is correct or not. This is an obviously important question, but in any field, no single study or even synthesis can be determined to be "correct" at that instance. Time will tell, and if his work is "correct", his predictions will be borne out by nature. It's not so satisfying to have to wait many years to know if you are correct, but that's how science works.

In the meantime, economists will have a debate over the science and the appropriate methods and data used for analysis. This is also how science works, and it is only (really) possible because Piketty made his work reproducible. Otherwise, the debate would be largely uninformed.

Posted in Uncategorized | 6 Comments

The Big in Big Data relates to importance not size

In the past couple of years several non-statisticians have asked me "what is Big Data exactly?" or "How big is Big Data?". My answer has been "I think Big Data is much more about "data" than "big". I explain below.

Screen Shot 2014-05-28 at 10.14.53 AM Screen Shot 2014-05-28 at 10.15.04 AM

Since 2011 Big Data has been all over the news. The New York Times, The Economist, Science, Nature, etc.. have told us that the Big Data Revolution is upon us (see google trends figure above). But was this really a revolution? What happened to the Massive Data Revolution (see figure above)? For this to be called a revolution, there must be some a drastic change, a discontinuity, or a quantum leap of some kind.  So has there been such a discontinuity in the rate of growth of data? Although this may be true for some fields (for example in genomics, next generation sequencing did introduce a discontinuity around 2007), overall, data size seems to have been growing at a steady rate for decades. For example, in the  graph below (see this paper for source) note the trend in internet traffic data (which btw dwarfs genomics data). There does seem to be a change of rate but during the 1990s which brings me to my main point.

internet data traffic

Although several fields (including Statistics) are having to innovate to keep up with growing data size, I don't see this as something that new. But I do think that we are in the midst of a Big Data revolution.  Although the media only noticed it recently,  it started about 30 years ago. The discontinuity is not in the size of data, but in the percent of fields (across academia, industry and government) that use data. At some point in the 1980s with the advent of cheap computers, data were moved from the file cabinet to the disk drive. Then in the 1990s, with the democratization of the internet, these data started to become easy to share. All of the sudden, people could use data to answer questions that were previously answered only by experts, theory or intuition.

In this blog we like to point out examples but let me review a few. Credit card companies started using purchase data to detect fraud. Baseball teams started scraping data and evaluating players without ever seeing them. Financial companies started analyzing  stock market data to develop investment strategies. Environmental scientists started to gather and analyze data from air pollution monitors. Molecular biologists started quantifying outcomes of interest into matrices of numbers (as opposed to looking at stains on nylon membranes) to discover new tumor types and develop diagnostics tools. Cities started using crime data to guide policing strategies. Netflix started using costumer ratings to recommend movies. Retail stores started mining bonus card data to deliver targeted advertisements. Note that all the data sets mentioned were tiny in comparison to, for example, sky survey data collected by astronomers. But, I still call this phenomenon Big Data because the percent of people using data was in fact Big.

IMG_5053

I borrowed the title of this talk from a very nice presentation by Diego Kuonen

Posted in Uncategorized | Tagged | 5 Comments

10 things statistics taught us about big data analysis

In my previous post I pointed out a major problem with big data is that applied statistics have been left out. But many cool ideas in applied statistics are really relevant for big data analysis. So I thought I'd try to answer the second question in my previous post: "When thinking about the big data era, what are some statistical ideas we've already figured out?" Because the internet loves top 10 lists I came up with 10, but there are more if people find this interesting. Obviously mileage may vary with these recommendations, but I think they are generally not a bad idea.

  1. If the goal is prediction accuracy, average many prediction models together. In general, the prediction algorithms that most frequently win Kaggle competitions or the Netflix prize blend multiple models together. The idea is that by averaging (or majority voting) multiple good prediction algorithms you can reduce variability without giving up bias. One of the earliest descriptions of this idea was of a much simplified version based on bootstrapping samples and building multiple prediction functions - a process called bagging (short for bootstrap aggregating). Random forests, another incredibly successful prediction algorithm, is based on a similar idea with classification trees.
  2. When testing many hypotheses, correct for multiple testing This comic points out the problem with standard hypothesis testing when many tests are performed. Classic hypothesis tests are designed to call a set of data significant 5% of the time, even when the null is true (e.g. nothing is going on). One really common choice for correcting for multiple testing is to use the false discovery rate to control the rate at which things you call significant are false discoveries. People like this measure because you can think of it as the rate of noise among the signals you have discovered. Benjamini and Hochber gave the first definition of the false discovery rate and provided a procedure to control the FDR. There is also a really readable introduction to FDR by Storey and Tibshirani.
  3. When you have data measured over space, distance, or time, you should smooth This is one of the oldest ideas in statistics (regression is a form of smoothing and Galton popularized that a while ago). I personally like locally weighted scatterplot smoothing a lot.  This paperis a good one by Cleveland about loess. Here it is in a gif. loessBut people also like smoothing splines, Hidden Markov Models, moving averages and many other smoothing choices.
  4. Before you analyze your data with computers, be sure to plot it A common mistake made by amateur analysts is to immediately jump to fitting models to big data sets with the fanciest computational tool. But you can miss pretty obvious things like this if you don't plot your data. baThere are too many plots to talk about individually, but one example of an incredibly important plot is the Bland-Altman plot, (called an MA-plot in genomics) when comparing measurements from multiple technologies. R provides tons of graphics for a reason and ggplot2 makes them pretty.
  5. Interactive analysis is the best way to really figure out what is going on in a data set This is related to the previous point; if you want to understand a data set you have to be able to play around with it and explore it. You need to make tables, make plots, identify quirks, outliers, missing data patterns and problems with the data. To do this you need to interact with the data quickly. One way to do this is to analyze the whole data set at once using tools like Hive, Hadoop, or Pig. But an often easier, better, and more cost effective approach is to use random sampling . As Robert Gentleman put it "make big data as small as possible as quick as possible".
  6. Know what your real sample size is.  It can be easy to be tricked by the size of a data set. Imagine you have an image of a simple black circle on a white background stored as pixels. As the resolution increases the size of the data increases, but the amount of information may not (hence vector graphics). Similarly in genomics, the number of reads you measure (which is a main determinant of data size) is not the sample size, it is the number of individuals. In social networks, the number of people in the network may not be the sample size. If the network is very dense, the sample size might be much less. In general the bigger the sample size the better and sample size and data size aren't always tightly correlated.
  7. Unless you ran a randomized trial, potential confounders should keep you up at night Confounding is maybe the most fundamental idea in statistical analysis. It is behind the spurious correlations like these and the reason why nutrition studies are so hard. It is very hard to hold people to a randomized diet and people who eat healthy diets might be different than people who don't in other important ways. In big data sets confounders might be technical variables about how the data were measured or they could be differences over time in Google search terms. Any time you discover a cool new result, your first thought should be, "what are the potential confounders?"correlation
  8. Define a metric for success up front Maybe the simplest idea, but one that is critical in statistics and decision theory. Sometimes your goal is to discover new relationships and that is great if you define that up front. One thing that applied statistics has taught us is that changing the criteria you are going for after the fact is really dangerous. So when you find a correlation, don't assume you can predict a new result or that you have discovered which way a causal arrow goes.
  9. Make your code and data available and have smart people check it As several people pointed out about my last post, the Reinhart and Rogoff problem did not involve big data. But even in this small data example, there was a bug in the code used to analyze them. With big data and complex models this is even more important. Mozilla Science is doing interesting work on code review for data analysis in science. But in general if you just get a friend to look over your code it will catch a huge fraction of the problems you might have.
  10. Problem first not solution backward One temptation in applied statistics is to take a tool you know well (regression) and use it to hit all the nails (epidemiology problems). hitnailsThere is a similar temptation in big data to get fixated on a tool (hadoop, pig, hive, nosql databases, distributed computing, gpgpu, etc.) and ignore the problem of can we infer x relates to y or that x predicts y.
Posted in Uncategorized | 10 Comments

Why big data is in trouble: they forgot about applied statistics

This year the idea that statistics is important for big data has exploded into the popular media. Here are a few examples, starting with the Lazer et. al paper in Science that got the ball rolling on this idea.

All of these articles warn about issues that statisticians have been thinking about for a very long time: sampling populations, confounders, multiple testing, bias, and overfitting. In the rush to take advantage of the hype around big data, these ideas were ignored or not given sufficient attention.

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.

As we have seen, lack of expertise in statistics  has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn't have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.

Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.

Statistical thinking has also been conspicuously absent from major public big data efforts so far. Here are some examples:

One example of this kind of thinking is this insane table from the alumni magazine of the University of California which I found from this amazing talk by Terry Speed (via Rafa, go watch his talk right now, it gets right to the heart of the issue).  It shows a fundamental disrespect for applied statisticians who have developed serious expertise in a range of scientific disciplines.

Screen Shot 2014-05-06 at 9.06.38 PM

All of this leads to two questions:

  1. Given the importance of statistical thinking why aren't statisticians involved in these initiatives?
  2. When thinking about the big data era, what are some statistical ideas we've already figured out?

Posted in Uncategorized | 28 Comments

JHU Data Science: More is More

Today Jeff Leek, Brian Caffo, and I are launching 3 new courses on Coursera as part of the Johns Hopkins Data Science Specialization. These courses are

I'm particularly excited about Reproducible Research, not just because I'm teaching it, but because I think it's essentially the first of its kind being offered in a massive open format. Given the rich discussions about reproducibility that have occurred over the past few years, I'm happy to finally be able to offer this course for free to a large audience.

These courses are launching in addition to the first 3 courses in the sequence: The Data Scientist's Toolbox, R Programming, and Getting and Cleaning Data, which are also running this month in case you missed your chance in April.

All told we have 6 of the 9 courses in the Specialization available as of today. We're really looking forward to next month where we will be launching the final 3 courses: Regression Models, Practical Machine Learning, and Developing Data Products. We also have some exciting announcements coming soon regarding the Capstone Projects.

Every course will be available every month, so don't worry about missing a session. You can always come back next month.

Posted in Uncategorized | Leave a comment

Confession: I sometimes enjoy reading the fake journal/conference spam

I've spent a considerable amount of time setting up filters to avoid getting spam from fake journals and conferences. Unfortunately, they are exceptionally good at thwarting my defenses. This does not annoy me as much as I pretend because, secretly, I enjoy reading some of these emails. Here are three of my favorites.

1) Over-the-top robot:

It gives us immense pleasure to invite you and your research allies to submit a manuscript for the journal “REDACTED”. The expertise of you in the never ending field of Gene Technology is highly appreciable. The level of intricacy shown by you in your work makes us even more proud, and we believe that your works should be known to mankind of science.

2) Sarcastic robot?

First of all, congratulations on the publication of your highly cited original article < The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores > in the field of colon cancer, which has been cited more than 1 times and is in the world's top one percent of papers. Such high number of citations reflects the high quality and influence of your paper.

3) Intimidating robot:

This is Rocky.... Recently we have mailed you about the details of the conference. But we still have not received your response. So today we contact you again.

NB: Although I am joking in this post, I do think these fake journals and conferences are a very serious problem. The fact that they are still around means enough money (mostly taxpayer money) is being spent to keep them in business. If you want to learn more, this blog does a good job on reporting on them and includes a list of culprits.

Posted in Uncategorized | Tagged | 3 Comments