Simply Statistics


What statistics should do about big data: problem forward not solution backward

There has been a lot of discussion among statisticians about big data and what statistics should do to get involved. Recently Steve M. and Larry W. took up the same issue on their blog. I have been thinking about this for a while, since I work in genomics, which almost always comes with "big data". It is also one area of big data where statistics and statisticians have played a huge role.

A question that naturally arises is, "why have statisticians been so successful in genomics?" I think a major reason is the phrase I borrowed from Brian C. (who may have borrowed it from Ron B.)

problem first, not solution backward

One of the reasons that "big data" is even a term is that there is that data are less expensive than they were a few years ago. One example is the dramatic drop in the price of DNA-sequencing. But there are many many more examples. The quantified self movement and Fitbits, Google Books, social network data from Twitter, etc. are all areas where data that cost us a huge amount to collect 10 years ago can now be collected and stored very cheaply.

As statisticians we look for generalizable principles; I would say that you have to zoom pretty far out to generalize from social networks to genomics but here are two:

  1. The data can't be easily analyzed in an R session on a simple laptop (say low Gigs to Terabytes)
  2. The data are generally quirky and messy (unstructured text, json files with lots of missing data, fastq files with quality metrics, etc.)

So how does one end up at the "leading edge" of big data? By being willing to deal with the schlep and work out the knitty gritty of how you apply even standard methods to data sets where taking the mean takes hours. Or taking the time to learn all the kinks that are specific to say, how does one process a microarray, and then taking the time to fix them. This is why statisticians were so successful in genomics, they focused on the practical problems and this gave them access to data no one else had/could use properly.

Doing these things requires a lot of effort that isn't elegant. It also isn't "statistics" by the definition that only mathematical methodology is statistics. Steve alludes to this in his post when he says:

Frankly I am a little disappointed that there does not seem to be any really compelling new idea (e.g. as in neural nets or the kernel embedding idea that drove machine learning).

I think this is a view shared by many statisticians. That since there isn't a new elegant theory yet, there aren't "new ideas" in big data. That focus is solution backward. We want an elegant theory that we can then apply to specific problems if they happen to come up.

The alternative is problem forward. The fact that we can collect data so cheaply means we can measure and study things we never could before. Computer scientists, physicists, genome biologists, and others are leading in big data precisely because they aren't thinking about the statistical solution. They are thinking about solving an important scientific problem and are willing to deal with all the dirty details to get there. This allows them to work on data sets and problems that haven't been considered by other people.

In genomics, this has happened before. In that case, the invention of microarrays revolutionized the field and statisticians jumped on board, working closely with scientists, handling the dirty details, and building software so others could too. As a discipline if we want to be part of the "big data" revolution I think we need to focus on the scientific problems and let methodology come second. That requires a rethinking of what it means to be statistics. Things like parallel computing, data munging, reproducibility, and software development have to be accepted as equally important to methods development.

The good news is that there is plenty of room for statisticians to bring our unique skills in dealing with uncertainty to these new problems; but we will only get a seat at the table if we are willing to deal with the mess that comes with doing real science.

I'll close by listing a few things I'd love to see:

  1. A Bioconductor-like project for social network data. Tyler M. and Ali S. have a paper that would make for an awesome package for this project. 
  2. Statistical pre-processing for fMRI and other brain imaging data. Keep an eye on our smart group for that.
  3. Data visualization for translational applications, dealing with all the niceties of human-data interfaces. See healthvis or the stuffy Miriah Meyer is doing.
  4. Most importantly, starting with specific, unsolved scientific problems. Seeking novel ways to collect cheap data, and analyzing them, even with known and straightforward statistical methods to deepen our understanding about ourselves or the universe.

Sunday data/statistics link roundup (5/19/2013)

  1. This is a ridiculously good post on 20th versus 21st century problems and the rise of the importance of empirical science. I particularly like the discussion of what it means to be a "solved" problem and how that has changed.
  2. A discussion in Science about the (arguably) most important statistics among academics, the impact factor and h-index. This comes on the heels of the San Francisco Declaration of Research Assessment. I like the idea that we should focus on evaluating science for its own merit rather than focusing on summaries like impact factor. But I worry that the "gaming" people are worried about with quantitative numbers like IF will be replaced with "politicking" if it becomes too qualitative. (via Rafa)
  3. A write-up about a survey  in Britain that suggests people don't believe statistics (surprise!). I think this is symptomatic of a bigger issue which is being raised over and over. In the era when scientific problems don't have deterministic solutions how do we determine if a problem has been solved? There is no good answer for this yet and it threatens to undermine a major fraction of the scientific enterprise going forward.
  4. Businesses are confusing data analysis and big data. This is so important and true. Big data infrastructure is often critical for creating/running data products. But discovering new ideas from data often happens on much smaller data sets with good intuition and interactive data analysis.
  5. Really interesting article about how the baseball card numbering system matters and how changing it can upset collectors (via Chris V.).

When does replication reveal fraud?

Here's a little thought experiment for your weekend pleasure. Consider the following:

Joe Scientist decides to conduct a study (call it Study A) to test the hypothesis that a parameter D > 0 vs. the null hypothesis that D = 0. He designs a study, collects some data, conducts an appropriate statistical analysis and concludes that D > 0. This result is published in the Journal of Awesome Results along with all the details of how the study was done.

Jane Scientist finds Joe's study very interesting and tries to replicate his findings. She conducts a study (call it Study B) that is similar to Study A but completely independent of it (and does not communicate with Joe). In her analysis she does not find strong evidence that D > 0 and concludes that she cannot rule out the possibility that D = 0. She publishes her findings in the Journal of Null Results along with all the details.

From these two studies, which of the following conclusions can we make?

  1. Study A is obviously a fraud. If the truth were that D > 0, then Jane should have concluded that D > 0 in her independent replication.
  2. Study B is obviously a fraud. If Study A were conducted properly, then Jane should have reached the same conclusion.
  3. Neither Study A nor Study B was a fraud, but the result for Study A was a Type I error, i.e. a false positive.
  4. Neither Study A nor Study B was a fraud, but the result for Study B was a Type II error, i.e a false negative.

I realize that there are a number of subtle details concerning why things might happen but I've purposely left them out. My question is, based on the information that you actually have about the two studies, what would you consider to be the most likely case? What further information would you like to know beyond what was given here?


The bright future of applied statistics

In 2013, the Committee of Presidents of Statistical Societies (COPSS) celebrates its 50th Anniversary. As part of its celebration, COPSS will publish a book, with contributions from past recipients of its awards, titled “Past, Present and Future of Statistical Science". Below is my contribution titled The bright future of applied statistics.

When I was asked to contribute to this issue, titled Past, Present, and Future of Statistical Science, I contemplated my career while deciding what to write about. One aspect that stood out was how much I benefited from the right circumstances. I came to one clear conclusion: it is a great time to be an applied statistician. I decided to describe the aspects of my career that I have thoroughly enjoyed in the past and present and explain why I this has led me to believe that the future is bright for applied statisticians.

I became an applied statistician while working with David Brillinger on my PhD thesis. When searching for an advisor I visited several professors and asked them about their interests. David asked me what I liked and all I came up with was "I don't know. Music?", to which he responded "That's what we will work on". Apart from the necessary theorems to get a PhD from the Statistics Department at Berkeley, my thesis summarized my collaborative work with researchers at the Center for New Music and Audio Technology. The work
involved separating and parameterizing the harmonic and non-harmonic components of musical sound signals [5]. The sounds had been digitized into data. The work was indeed fun, but I also had my first glimpse into the incredible potential of statistics in a world becoming more and more data-driven.

Despite having expertise only in music, and a thesis that required a CD player to hear the data, fitted models and residuals (, I was hired by the Department of Biostatistics at Johns Hopkins School of Public Health. Later I realized what was probably obvious to the School’s leadership: that regardless of the subject matter of my thesis, my time series expertise could be applied to several public health applications [821]. The public health and biomedical challenges surrounding me were simply too hard to resist and my new
department knew this. It was inevitable that I would quickly turn into an applied Biostatistician.

Since the day that I arrived at Hopkins 15 years ago, Scott Zeger, the department chair, fostered and encouraged faculty to leverage their statistical expertise to make a difference and to have an immediate impact in science. At that time, we were in the midst of a measurement revolution that was transforming several scientific fields into data-driven ones. By being located in a School of Public Health and next to a medical school, we were surrounded by collaborators working in such fields. These included environmental science, neuroscience, cancer biology, genetics, and molecular biology. Much of my work was motivated by collaborations with biologists that, for the first time, were collecting large amounts of data. Biology was changing from a data poor discipline to a data intensive

A specific example came from the measurement of gene expression. Gene expression is the process where DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. Before microarrays were invented in the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure 1A below). With microarrays, this suddenly changed to sifting through tens of thousands of numbers (see Figure 1B). Biologists went from using their eyes to categorize results to having thousands (and now millions) of measurements per sample to analyze. Furthermore, unlike genomic DNA, which is static, gene expression is a dynamic quantity: different tissues express different genes at different levels and at different times. The complexity was exacerbated by unpolished technologies that made measurements much noisier than anticipated. This complexity and level of variability made statistical thinking an important aspect of the analysis. The Biologists that used to say, "if I need statistics, the experiment went wrong" were now seeking out our help. The results of these collaborations have led to, among other things, the development of breast cancer recurrence gene expression assays making it possible to identify patients at risk of distant recurrence following surgery



Figure 1: Illustration of gene expression data before and after micorarrays.

When biologists at Hopkins first came to our department for help with their  microarray data, Scott put them in touch with me because I had experience with (what was then) large datasets (digitized music signals are represented by 44,100 points per second). The more I learned about the scientific problems and the more data I explored, the more motivated I became. The potential for statisticians having an impact in this nascent field was clear and my department was encouraging me to take the plunge. This institutional encouragement and support was crucial as successfully working in this field made it harder to publish in the mainstream statistical journals; an accomplishment that had traditionally been heavily weighted in the promotion process. The message was clear: having an immediate impact on
specific scientific fields would be rewarded as much as mathematically rigorous methods with general applicability.

As with my thesis applications, it was clear that to solve some of the challenges posed by microarray data I would have to learn all about the technology. For this I organized a sabbatical with Terry Speed's group in Melbourne where they helped me accomplish this goal. During this visit I reaffirmed my preference for attacking applied problems with simple statistical methods, as opposed to overcomplicated ones or developing new techniques. Learning that deciphering clever ways of putting the existing statistical toolbox to work was good enough for an accomplished statistician like Terry gave me the necessary confidence to continue working this way. More than a decade later this continues to be my approach to applied statistics. This approach has been instrumental for some of my current collaborative work. In particular, it led to important new biological discoveries made together with Andy Feinberg’s lab [7].

During my sabbatical we developed preliminary solutions that improved precision and aided in the removal of systematic biases for microarray data [6]. I was aware that hundreds, if not thousands, of other scientists were facing the same problematic data and were searching for solutions. Therefore I was also thinking hard about ways in which I could share whatever solutions I developed with others. During this time I received an email from Robert Gentleman asking if I was interested in joining a new software project for the delivery of statistical methods for genomics data. This new collaboration eventually became the Bioconductor project, ( which to this day continues to grow its user and developer base [4]. Bioconductor was the perfect vehicle for having the impact that my department had encouraged me to seek. With Ben Bolstad and others we wrote an R package that has been downloaded tens of thousands of times [3]. Without the availability of software, the statistical method would not have received nearly as much attention. This lesson served me well throughout my career, as developing software packages has greatly helped disseminate my statistical ideas. The fact that my department and school rewarded software publications provided important support.

The impact statisticians have had in genomics is just one example of our fields accomplishment in the 21st century. In academia, the number of statistician becoming leaders in fields like environmental sciences, human genetics, genomics, and social sciences continues to grow. Outside of academia, Sabermetrics has become a standard approach in several sports (not just baseball) and inspired the Hollywood movie Money Ball. A PhD Statistician led the team that won the Netflix million dollar prize []. Nate Silver proved the pundits wrong by once again using statistical models to predict election results almost perfectly. R has become a widely used programming language. It is no surprise that Statistics majors at Harvard have more than quadrupled since 2000 and that statistics MOOCs are among the most popular

The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. Scientific fields that have traditionally relied upon simple data analysis techniques have been turned on their heads by these technologies. Furthermore, advances such as these have brought about a shift from hypothesis to discovery-driven research. However, interpreting information extracted from these massive and complex datasets requires sophisticated statistical skills as one can easily be fooled by patterns that arise by chance. This has greatly elevated the importance of our discipline in biomedical research.

I think that the data revolution is just getting started. Datasets are currently being, or have already been, collected that contain, hidden in their complexity, important truths waiting to be discovered. These discoveries will increase the scientific understanding of our world. Statisticians should be excited and ready to play an important role in the new scientific renaissance driven by the measurement revolution.


[1]   NE Crone, L Hao, J Hart, D Boatman, RP Lesser, R Irizarry, and
B Gordon. Electrocorticographic gamma activity during word production
in spoken and sign language. Neurology, 57(11):2045–2053, 2001.

[2]   Janet A DiPietro, Rafael A Irizarry, Melissa Hawkins, Kathleen A
Costigan, and Eva K Pressman. Cross-correlation of fetal cardiac and
somatic activity as an indicator of antenatal neural development. American
journal of obstetrics and gynecology, 185(6):1421–1428, 2001.

[3]   Laurent Gautier, Leslie Cope, Benjamin M Bolstad, and Rafael A
Irizarry. affyanalysis of affymetrix genechip data at the probe level.
Bioinformatics, 20(3):307–315, 2004.

[4]   Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad,
Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao
Ge, Jeff Gentry, et al. Bioconductor: open software development for
computational biology and bioinformatics. Genome biology, 5(10):R80, 2004.

[5]   Rafael A Irizarry. Local harmonic estimation in musical sound signals.
Journal of the American Statistical Association, 96(454):357–367, 2001.

[6]   Rafael A Irizarry, Bridget Hobbs, Francois Collin, Yasmin D
Beazer-Barclay, Kristen J Antonellis, Uwe Scherf, and Terence P Speed.
Exploration, normalization, and summaries of high density oligonucleotide
array probe level data. Biostatistics, 4(2):249–264, 2003.

[7]   Rafael A Irizarry, Christine Ladd-Acosta, Bo Wen, Zhijin Wu, Carolina
Montano, Patrick Onyango, Hengmi Cui, Kevin Gabo, Michael Rongione,
Maree Webster, et al. The human colon cancer methylome shows similar
hypo-and hypermethylation at conserved tissue-specific cpg island shores.
Nature genetics, 41(2):178–186, 2009.

[8]   Rafael A Irizarry, Clarke Tankersley, Robert Frank, and Susan
Flanders. Assessing homeostasis through circadian patterns. Biometrics,
57(4):1228–1237, 2001.

[9]   Laura J van’t Veer, Hongyue Dai, Marc J Van De Vijver, Yudong D
He, Augustinus AM Hart, Mao Mao, Hans L Peterse, Karin van der Kooy,
Matthew J Marton, Anke T Witteveen, et al. Gene expression profiling
predicts clinical outcome of breast cancer. nature, 415(6871):530–536, 2002.


Sunday data/statistics link roundup (5/12/2013, Mother's Day!)

  1. A tutorial on deep-learning, I really enjoyed reading it, but I'm still trying to figure out how this is different than non-linear logistic regression to estimate features then supervised prediction using those features? Or maybe I'm just naive....
  2. Rafa on political autonomy for science for a blog in PR called 80 grados.  He writes about Rep. Lamar Smith and then focuses more closely on issues related to the University of Puerto Rico. A very nice read. (via Rafa)
  3. Highest paid employees by state. I should have coached football...
  4. Newton took the mean. It warms my empirical heart to hear about how the theoretical result was backed up by averaging (via David S.)
  5. Reinhart and Rogoff publish a correction but stand by their original claims. I'm not sure whether this is a good or a bad thing. But it definitely is an overall win for reproducibility.
  6. Statesy folks are getting some much-deserved attention. Terry Speed is a Fellow of the Royal Society, Peter Hall is a foreign associate of the NAS, Gareth Roberts is also a Fellow of the Royal Society (via Peter H.)
  7. Statisticians go to the movies and the hot hand analysis makes the NY Times (via Dan S.)

Bonus Link!  Karl B.'s Github tutorial is awesome and every statistician should be required to read it. I only ask why he gives all the love to Nacho's admittedly awesome Clickme package and no love to healthvis, we are on Github too!


A Shiny web app to find out how much medical procedures cost in your state.

Today the front page of the Huffington Post featured the new data available from the CMS that shows the cost of many popular procedures broken down by hospital. We here at Simply Statistics think you should be able to explore these data more easily. So we asked John Muschelli to help us build a Shiny App that allows you to interact with these data. You can choose your state and your procedure and see how much the procedure costs at hospitals in your state. It takes a second to load because it is a lot of data....

Here is the link the app. 

Here are some screenshots for intracranial hemmhorage for the US and for Idaho.

Screen Shot 2013-05-08 at 4.57.56 PMScreen Shot 2013-05-08 at 4.58.09 PM\

The R code is here if you want to tweak/modify.


Why the current over-pessimism about science is the perfect confirmation bias vehicle and we should proceed rationally

Recently there have been some high profile flameouts in scientific research. A couple examples include the Duke saga, the replication issues in social sciences, p-value hacking, fabricated data, not enough open-access publication, and on and on.

Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga led to the end of in-progress clinical trials , the lack of replication has led to high-profile arguments between scientists in Discover and Nature among other outlets, and the whole of austerity is under question (sometimes comically) because of a lack of reproducibility.

The result of this high-profile attention is that there is a movement on to "clean up science". As has been pointed out, there is a group of scientists who are making names for themselves primarily as critics of what is wrong with the scientific process. The good news is that these key players are calling attention to issues: reproducibility, replicability, and open access, among others, that are critically important for the scientific enterprise.

I too am concerned about these issues and have altered my own research process to try to address them for my own research group.  I also think that the solutions others have proposed on a larger scale like or PLoS are great advances for the scientific community.

I am also very worried that people are using a few high-profile cases to hyperventilate about the real, solvable, and recognized problems in the scientific process These people get credit and a lot of attention for pointing out how science is "failing". But they aren't giving proportional time to all of the incredible success stories we have had, both in performing research and in reforming research with reproducibility, open access, and replication initiatives.

We should recognize that science is hard and even dedicated, diligent, and honest scientists will make mistakes , perform irreproducible or irreplicable studies, or publish in closed access journals.  Sometimes this is because of ignorance of good research principles, sometimes it is because people are new to working in a world where data/computation are a major player, and some will be because it is legitimately, really hard to make real advances in science. I think people who participate in real science recognize these problems and are eager to solve them. I also have noticed that real scientists generally try to propose a solution when they complain about these issues.

But it seems like sometimes people use these high-profile mistakes out of context to push their own scientific pet peeves. For example:

  1. I don't like p-values and there are lots of results that fail to replicate so it must be the fault of p-values.  Many studies fail to replicate not because the researchers used p-values, but because they performed studies that were either weak or had poorly understood scientific mechanisms.
  2. I don't like not being able to access people's code so lack of reproducibility is causing science to fail. Even in the two most infamous cases (Potti and Reinhart - Rogoff) the problem with the science wasn't reproducibility - it was that the analysis was incorrect/flawed. Reproducibility compounded the problem but wasn't the root cause of the problem.
  3. I don't like not being able to access scientific papers so closed-access journals are evil. For whatever reason (I don't know if I understand why) it is expensive to publish journals. Clearly, because publishing open access is expensive and closed access journals are expensive. If I'm a junior researcher, I'll definitely post my preprints online, but I also want papers in "good" journals and don't have a ton of grant money, so sometimes I'll choose close access.
  4. I don't like these crazy headlines from social psychology (substitute other field here) and there have been some that haven't replicated, so none must replicate. Of course some papers won't replicate, including even high profile papers. If you are doing statistics, then by definition some papers won't replicate since you have to make a decision on noisy data.

These are just a few examples where I feel like a basic, fixable flaw in science has been used to justify a hugely pessimistic view of science in general. I'm not saying it is all rainbows and unicorns. Of course we want to improve the process. But I'm worried that the rational reasonable problems we have, with enough hyperbole, will make it look like the scientific process "sky is falling" and will leave the door open for individuals like Rep. Lamar Smith to come in and turn the scientific process into a political one.

P.S. Andrew Gelman posted on a similar topic yesterday as well.. He argues the case for less optimism and to make sure we don't stay complacent. He added a P.S. and mentioned two points on which we can agree: (1) science is hard and is a human system and we are working to fix the flaws inherent in such systems and (2) that it is still easier to publish as splashy claim than to publish a correction. I do definitely agree with both. I think Gelman would also likely agree that we need to be careful about reciprocity with these issues. If earnest scientists work hard to address reproducibility, replicability, open access, etc. then people who criticize them should have to work just as hard to justify their critiques. Just because it is a critique doesn't mean it should automatically get the same treatment as the original paper.


Talking about MOOCs on MPT Direct Connection

Watch Monday, April 29, 2013 on PBS. See more from Direct Connection.

I appeared on Maryland Public Television's Direct Connection with Jeff Salkin last Monday to talk about MOOCs (along with our Dean Mike Klag).


Reproducibility at Nature

Nature has jumped on to the reproducibility bandwagon and has announced a new approach to improving reproducibility of submitted papers. The new effort is focused primarily and methodology, including statistics, and in making sure that it is clear what an author has done.

To ease the interpretation and improve the reliability of published results we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data.

To this end they have created a checklist for highlighting key aspects that need to be clear in the manuscript. A number of these points are statistical, and two specifically highlight data deposition and computer code availability. I think an important change is the following:

To allow authors to describe their experimental design and methods in as much detail as necessary, the participating journals, including Nature, will abolish space restrictions on the methods section.

I think this is particularly important because of the message it sends. Most journals have overall space limitations and some journals even have specific limits on the Methods section. This sends a clear message that "methods aren't important, results are". Removing space limits on the Methods section will allow people to just say what they actually did, rather than figure out some tortured way to summarize years of work into a smattering of key words.

I think this is a great step forward by a leading journal. The next step will be for Nature to stick to it and make sure that authors live up to their end of the bargain.