Simply Statistics


Reproducibility and reciprocity

One element about the entire discussion about reproducible research that I haven't seen talked about very much is the potential for the lack of reciprocity. I think even if scientists were not concerned about the possibility of getting scooped by others by making their data/code available this issue would be sufficient to give people pause about making their work reproducible.

What do I mean by reciprocity? Consider the following (made up) scenario:

  1. I conduct a study (say, a randomized controlled trial, for concreteness) that I register at beforehand and specify details about the study like the design, purpose, and primary and secondary outcomes.
  2. I rigorously conduct the study, ensuring safety and privacy of subjects, collect the data, and analyze the data.
  3. I publish the results for the primary and secondary outcomes in the peer-reviewed literature where I describe how the study was conducted and the statistical methods that were used. For the sake of concreteness, let's say the results were "significant" by whatever definition of significant you care to use and that the paper was highly influential.
  4. Along with publishing the paper I make the analytic dataset and computer code available so that others can look at what I did and, if they want, reproduce the result.

So far so good right? It seems this would be a great result for any study. Now consider the following possible scenarios:

  1. Someone obtains the data and the code from the web site where it is hosted, analyzes it, and then publishes a note claiming that the intervention negatively affected a different outcome not described in the original study (i.e. not one of the primary or secondary outcomes).
  2. A second person obtains the data, analyzes it, and then publishes a note on the web claiming that the intervention was ineffective for the primary outcome in a the subset of participants that were male.
  3. A third person obtains the data, analyzes the data, and then publishes a note on the web saying that the study is flawed and that the original results of the paper are incorrect. No code, data, or details of their methods are given.

Now, how should one react to the follow-up note claiming the study was flawed? It's easy to imagine a spectrum of possible responses ranging from accusations of fraud to staunch defenses of the original study. Because the original study was influential, there is likely to be a kerfuffle either way.

But what's the problem with the three follow-up scenarios described? The one thing that they have in common is that none of the three responding people were subjected to the same standards to which the original investigator (me) was subjected. I was required to register my trial and state the outcomes in advance. In an ideal world you might argue I should have stated my hypotheses in advance too. That's fine, but the point is that the people analyzing the data subsequently were not required to do any of this. Why should they be held to a lower standard of scrutiny?

The first person analyzed a different outcome that was not a primary or secondary outcome. How many outcomes did they test before the came to that one negatively significant one? The second person examined a subset of the participants. Was the study designed (or powered) to look at this subset? Probably not. The third person claims fraud, but does not provide any details of what they did.

I think it's easy to take care of the third person--just require that they make their work reproducible too. That way we can all see what they did and verify that there was in fact fraud. But the first two people are a little more difficult. If there are no barriers to obtaining the data, then they can just get the data and run a bunch of analyses. If the results don't go their way, they can just move on and no one would be the wiser. If they did, they can try to publish something.

What I think a good reproducibility policy should have is a type of "viral" clause. For example, the GNU General Public License (GPL) is an open source software license that requires, among other things, that anyone who writes their own software, but links to or integrates software covered under the GPL, must publish their software under the GPL too. This "viral" requirement ensures that people cannot make use of the efforts of the open source community without also giving back to that community. There have been numerous heated discussions in the software community regarding the pros and cons of such a clause, with (large) commercial software developers often coming down against it. Open source developers have largely beens skeptical of the arguments of large commercial developers, claiming that those companies simply want to "steal" open source software and/or maintain their dominance.

I think it is important that if we are going to make reproducibility the norm in science, that we have analogous "viral" clauses to ensure that everyone is held to the same standard. This is particularly important in policy-relevant or in politically sensitive subject areas where there are often parties involved who have essentially no interest (and are in fact paid to have no interest) in holding themselves to the same standard of scientific conduct.

Richard Stallman was right to assume that without the copyleft clause in the GPL that large commercial interests would simply usurp the work of the free software community and essentially crush it before it got started. Reproducibility needs its own version of copyleft or else scientists will be left to defend themselves against unscrupulous individuals who are not held to the same standard.


Sunday data/statistics link roundup (4/28/2013)

  1. What it feels like to be bad at math. My personal experience like this culminated in some difficulties with Green's functions back in my early days at USU. I think almost everybody who does enough math eventually runs into a situation where they don't understand what is going on and it stresses them out.
  2. An article about companies that are using data to try to identify people for jobs (via Rafa).
  3. Google trends for predicting the market. I'm not sure that "predicting" is the right word here. I think a better word might be "explaining/associating". I also wonder if this could go off the rails.
  4. This article is ridiculously useful in terms of describing the ways that you can speed up R code. My favorite part of it is that it starts with the "why". Exactly. Premature optimization is the root of all evil.
  5. A discussion of data science at Tumblr. The author/speaker also has a great blog.

Mindlessly normalizing genomics data is bad - but ignoring unwanted variability can be worse

Yesterday, and bleeding over into today, quantile normalization (QN) was being discussed on Twitter. This is the tweet that started the whole thing off. The conversation went a bunch of different directions and then this happened:

well, this happens all over bio-statistics - ie, naive use in seemingly undirected ways until you get a "good" pvalue. And then end

So Jeff and I felt it was important to respond - since we are biostatisticians that work in genomics. We felt a couple of points were worth making:

1. Most statisticians we know, including us, know QN's limitations and are always nervous about using QN. But with most datasets we see, unwanted variability is overwhelming  and we are left with no choice but to normalize in orde to extract anything useful from the data.  In fact, many times QN is not enough and we have to apply further transformations, e.g., to remove batch effects.

2. We would be curious to know which biostatisticians were being referred to. We would like some examples, because most of the genomic statisticians we know work very closely with biologists to aid them in cleaning dirty data to help them find real sources of signal. Furthermore, we encourage biologists to validate their results. In many cases, quantile normalization (or other transforms) are critical to finding results that validate and there is a long literature (both biological and statistical) supporting the importance of appropriate normalization.

3. Assuming the data that you get (sequences, probe intensities, etc.) from high-throughput tech = direct measurement of abundance is incorrect. Before worrying about QN (or other normalization) being an arbitrary transformation that distorts the data, keep in mind that what you want to measure has already been distorted by PCR, the imperfections of the microarray, scanner measurement error, image bleeding, cross hybridization or alignment artifacts, ozone effects, etc...

To go into a little more detail about the reasons that normalization may be important in many cases, so I have written a little more detail below with data if you are interested.

Continue Reading..


Interview at Yale Center for Environmental Law & Policy

Interview with Roger Peng from YCELP on Vimeo.

A few weeks ago I sat down with Angel Hsu of the Yale Center for Environmental Law and Policy to talk about some of their work on air pollution indicators.

(Note: I haven't moved--I still work at the Johns Hopkins School of Public Health.)


Nevins-Potti, Reinhart-Rogoff

There's an interesting parallel between the Nevins-Potti debacle (a true debacle, in my mind) and the recent Reinhart-Rogoff kerfuffle. Both were exposed via some essentially small detail that had nothing to do with the real problem.

In the case of Reinhart-Rogoff, the Excel error was what made them look ridiculous, but it was in fact the "unconventional weighting" of the data that had the most dramatic effect. Furthermore, ever since the paper had come out, academic economists were debating and challenging its conclusions from the get go. Even when legitimate scientific concerns were raised, policy-makers and other academics were not convinced. As soon as the Excel error was revealed, everything needed to be re-examined.

In the Nevins-Potti debacle, Baggerly and Coombes wrote article after article pointing out all the problems and, for the most part, no one in a position of power really cared. The Nevins-Potti errors were real zingers too, not some trivial Excel error (i.e. switching the labels between people with disease and people without disease). But in the end, it took Potti's claim of being a Rhodes Scholar to bring him down. Clearly, the years of academic debate beforehand were meaningless compared to lying on a CV.

In the Reinhart-Rogoff case, reproducibility was an issue and if the data had been made available earlier, the problems would have been discovered earlier and perhaps that would have headed off years of academic debate (for better or for worse). In the Nevins-Potti example, reproducibility was not an issue--the original Nature Medicine study was done using public data and so was reproducible (although it would have been easier if code had been made available). The problem there is that no one listened.

One has to wonder if the academic system is working in this regard. In both cases, it took a minor, but personal failing, to bring down the entire edifice. But the protestations of reputable academics, challenging the research on the merits, were ignored. I'd say in both cases the original research conveniently said what people wanted to hear (debt slows growth, personalized gene signatures can predict response to chemotherapy), and so no amount of research would convince people to question the original findings.

One also has to wonder whether reproducibility is of any help here. I certainly don't think it hurts, but in the case of Nevins-Potti, where the errors were shockingly obvious to anyone paying attention, the problems were deemed merely technical (i.e. statistical). The truth is, reproducibility will be most necessary in highly technical and complex analyses where it's often not obvious how an analysis is done. If you can show a flaw in an analysis that is complicated, what's the use if your work will be written off as merely concerned with technical details (as if those weren't important)? Most of the news articles surrounding Reinhart-Rogoff characterized the problems as complex and statistical (i.e. not important) and not concerned with fundamental questions of interest.

In both cases, I think science was used to push an external agenda, and when the science was called into question, it was difficult to back down. I'll write more in a future post about these kinds of situations and what, if anything, we can do to improve matters.


Podcast #7: Reinhart, Rogoff, Reproducibility

Jeff and I talk about the recent Reinhart-Rogoff reproducibility kerfuffle and how it turns out that data analysis is really hard no matter how big the dataset.


I wish economists made better plots

I'm seeing lots of traffic on a big-time economics article by that failed to reproduce and here are my quick thoughts. You can read a pretty good summary here by Mike Konczal.

Quick background: Carmen Reinhart and Kenneth Rogoff wrote an influential paper that was used by many to justify the need for austerity measures taken by governments to reduce debts relative to GDP. Yesterday, Thomas Herndon, Michael Ash, and Robert Pollin (HAP) released a paper where they reproduced the Reinhart-Rogoff (RR) analysis and noted a few irregularities or errors. In their abstract, HAP claim that they "find that coding errors, selective exclusion of available data, and unconventional weighting of summary statistics [in the RR analysis] lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period.

It appears there were three points made by HAP: (1) RR excluded some important data from their final analysis; (2) RR weighted countries in a manner that was not proportional to the number of years they contributed to the dataset (RR used equal weighting of countries); and (3) there was an error in RR's Excel formula which resulted in them inadvertently leaving out five countries from their final analysis.

The bottom line is shown in HAP's Figure 1, which I reproduce below (on the basis of fair use):

HAP Analysis

From the plot you can see that the HAP's adjusted analysis (circles) more or less coincides with RR's analysis (diamonds) except for the last categories of countries with debt/GDP ratios over 90%. In that category RR's analysis shows a large drop in growth whereas HAP's analysis shows a more or less smooth decline (but still positive growth).

To me, it seems that the incorrect Excel formula is a real error, but easily fixed. It also seemed to have the least impact on the final analysis. The other two problems, which had far bigger impacts, might have some explanation that I'm not aware of. I am not an economist so I await others to weigh in. RR apparently do not comment on the exclusion of certain data points or on the weighting scheme so it's difficult to say what the thinking was, whether it was inadvertent or purposeful.

In summary, so what? Here's what I think:

  1. Is there some fishiness? Sure, but this is not the Potti-Nevins scandal a la economics. I suppose it's possible RR manipulated the analysis to get the answer austerity hawks were looking for, but we don't have the evidence yet and this just doesn't feel like that kind of thing.
  2. What's the counterfactual? Or, what would have happened if the analysis had been done the way HAP propose? Would the world have embraced pro-growth policies by taking on a greater debt burden? My guess is no. Austerity hawks would have found some other study that supported their claims (and in fact there was at least one other).
  3. RR's original analysis did not contain a plot like Figure 1 in HAP's analysis, which I personally find very illuminating. From HAP's figure, you can see that there's quite a bit of variation across countries and perhaps an overall downward trend. I'm not sure I would have dramatically changed my conclusion if I had done the HAP analysis instead of the RR analysis. My point is that plots like this, which show the variability, are very important.
  4. People see what they want to see. I would not be surprised to see some claim that HAP's analysis supports the austerity conclusion because growth under high debt loads is much lower (almost 50%!) than under low debt loads.
  5. If RR's analysis had been correct, should they have even made the conclusions they made? RR indicated that there was a "threshold" at 90% debt/GDP. My experience is that statements about thresholds, are generally very hard to make, even with good data. I wonder what other more knowledgable people think of the original conclusions.
  6. If the data had been made available sooner, this problem would have been fixed sooner. But in my opinion, that's all that would have happened.

The vibe on the Internets seems to be that if only this problem had been identified sooner, the world would be a better place. But my cynical mind says, uh, no. You can toss this incident in the very large bucket of papers with some technical errors that are easily fixed. Thankfully, someone found these errors and fixed them, and that's a good thing. Science moves on.

UPDATE: Reinhart-Rogoff respond.

UPDATE 2: Reinhart-Rogoff more detailed response.


Data science only poses a threat to (bio)statistics if we don't adapt

We have previously mentioned on this blog how statistics needs better marketing. Recently, Karl B. has suggested that "Data science is statistics" and Larry W. has wondered if "Data science is the end of statistics?" I think there are a couple of types of data science and that each has a different relationship to the discipline of academic statistics:

  1. Data science as marketing tool. Data analytics, data science, big data, etc. are terms that companies who already did something (IT infrastructure, consulting, database management, etc.) throw around to make them sound like they are doing the latest and greatest thing. These marketers are dabblers in what I would call the real "science of data" or maybe deal with just one part of the data pipeline. I think they pose no threat to the statistics community other than by generating backlash by over promising on the potential of data science or diluting the term to the point of being almost non-sensical.
  2. Data science as business analytics. Another common use of "data science" is to describe the exact same set of activities that use to be performed by business analytics people, maybe allowing for some growth in the size of the data sets. This might be a threat to folks who do statistics in business schools - although more likely it will be beneficial to those programs as there is growth in the need for business-oriented statisticians.
  3. Data science as big data engineer Sometimes data science refers to people who do stuff with huge amounts of data. Larry refers to this in his post when he talks about people working on billions of data points. Most classically trained statisticians aren't comfortable with data of this size. But at places like Google - where big data sets are routine - the infrastructure is built so that statisticians can access and compress the parts of the data that they need to do their jobs. I don't think this is necessarily a threat to statistics; but we should definitely be integrating data access into our curriculum.
  4. Data science as replacement for statistics Some people (and I think it is the minority) are exactly referring to things that statisticians do when they talk about data science. This means manipulating, collecting, and analyzing data, then making inferences to a population or predictions about what will happen next. This is, of course, a threat to statisticians. Some places, like NC State and Columbia, are tackling this by developing centers/institutes/programs with data science in the name. But I think that is a little dangerous. The data don't matter - it is the problem you can solve with the data. So the key thing is that these institutes need to focus on solving real problems - not just churning out people who know a little R, a little SQL, and a little Python.

So why is #4 happening? I think one reason is reputation. Larry mentions that a statistician produces an estimate and a confidence interval and maybe the confidence interval is too wide. I think he is on to something there, but I think it is a bigger problem. As Roger has pointed out - statisticians often see themselves as referees - rather scientists/business people. So a lot of people have the experience of going to a statistician and feel like they have been criticized for bad experimental design, too small a sample size, etc. These issues are hugely important - but sometimes you have to make due with what you have. I think data scientists in category 4 are taking advantage of a cultural tendency of statisticians to avoid making concrete decisions.

A second reason is that some statisticians have avoided getting their hands dirty. "Hands clean" statisticians don't  get the data from the database, or worry about the data munging, or match identifiers, etc. They wait until the data are nicely formated in a matrix to apply their methods. To stay competitive, we need to produce more "hands dirty" statisticians who are willing to go beyond schlep blindness and handle all aspects of a data analysis. In academia, we can encourage this by incorporating more of those issues into our curriculum.

Finally, I think statisticians focus on optimality hurts us. Our field grew up in an era where data was sparse and we had to squeeze every last ounce of information out what little data we had. Those constraints led to a cultural focus on optimality to a degree that is no longer necessary when data are abundant. In fact, an abundance of data is often unreasonably effective even with suboptimal methods. "Data scientists" understand this and shoot for the 80% solution that is good enough in most cases.

In summary I don't think statistics will be killed off by data science. Most of the hype around data science is actually somewhat removed from our field (see above). But I do think that it is worth considering some potential changes that reposition our discipline as the most useful for answering questions with data. Here are some concrete proposals:

  1. Remove some theoretical requirements and add computing requirements to statistics curricula.
  2. Focus on statistical writing, presentation, and communication as a main part of the curriculum.
  3. Focus on positive interactions with collaborators (being a scientist) rather than immediately going to the referee attitude.
  4. Add a unit on translating scientific problems to statistical problems.
  5. Add a unit on data munging and getting data from databases.
  6. Integrating real and live data analyses into our curricula.
  7. Make all our students create an R package (a data product) before they graduate.
  8. Most important of all have a "big tent" attitude about what constitutes statistics.



Sunday data/statistics link roundup (4/14/2013)

  1. The most influential data scientists on Twitter, featuring Amy Heineike, Hilary Mason, and a few other familiar names to readers of this blog. In other news, I love reading list of the "Top K _____" as much as the next person. I love them even more when they are quantitative (the list above isn't) - even when the quantification is totally bogus. (via John M.)
  2. Rod Little and our own Tom Louis over at the Huffingtonpost talking about the ways in which the U.S. Census supports our democracy. It is a very good piece and I think highlights the critical importance that statistics and data play in keeping government open and honest.
  3. An article about the growing number of fake academic journals and their potential predatory practices. I think I've been able to filter out the fake journals/conferences pretty well (if they've invited 30 Nobel Laureates - probably fake). But this poses big societal problems; how do we tell what is real science from what is fake if you don't have inside knowledge about which journals are real? (via John H.)
  4. A ton of data on the DC Capitol Bikeshare. One of my favorite things is when a government organization just opens up its data. The best part is that the files are formatted as csv's. Clearly someone who knows that the best data formats are open, free, and easy to read into statistical software. In other news, I think one of the most important classes that could be taught is "How to share data 101" (via David B.)
  5. A slightly belated link to a remembrance of George Box. He was the one who said, "All models are wrong, but some are useful."  An absolute titan of our field.
  6. Check out these cool logotypes for famous scientists. I want one! Also, see the article on these awesome minimalist posters celebrating legendary women in science. I want the Sally Ride poster on a t-shirt.
  7. As an advisor, I aspire to treat my students/postdocs like this. (@hunterwalk). I'm not always so good at it, but those are some good ideals to try to live up to.

Great scientist - statistics = lots of failed experiments

E.O. Wilson is a famous evolutionary biologist. He is currently an emeritus professor at Harvard and just this last week dropped this little gem in the Wall Street Journal. In the piece, he suggests that knowing mathematics is not important for becoming a great scientist. Wilson goes even further, suggesting that you can be mathematically semi-literate and still be an amazing scientist. There are two key quotes in the piece that I think deserve special attention:

Fortunately, exceptional mathematical fluency is required in only a few disciplines, such as particle physics, astrophysics and information theory. Far more important throughout the rest of science is the ability to form concepts, during which the researcher conjures images and processes by intuition.

I agree with this quote in general as does Paul Krugman. Many scientific areas don't require advanced measure theory, differential geometry, or number theory to make big advances. It seems like this is is the kind of mathematics to which E.O. Wilson is referring to and on that point I think there is probably universal agreement that you can have a hugely successful scientific career without knowing about measurable spaces.

Wilson doesn't stop there, however. He goes on to paint a much broader picture about how one can pursue science without the aid of even basic mathematics or statistics and this is where I think he goes off the rails a bit:

Ideas in science emerge most readily when some part of the world is studied for its own sake. They follow from thorough, well-organized knowledge of all that is known or can be imagined of real entities and processes within that fragment of existence. When something new is encountered, the follow-up steps usually require mathematical and statistical methods to move the analysis forward. If that step proves too technically difficult for the person who made the discovery, a mathematician or statistician can be added as a collaborator.

I see two huge problems with this statement:

  1. Poor design of experiments is one of, if not the most, common reason for an experiment to fail. It is so important that Fisher said, "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination.  He can perhaps say what the experiment died of." Wilson is suggesting that with careful conceptual thought and some hard work you can do good science, but without a fundamental understanding of basic math, statistics, and study design even the best conceived experiments are likely to fail.
  2. While armchair science was likely the norm when Wilson was in his prime, huge advances have been made in both science and technology. Scientifically, it is difficult to synthesize and understand everything that has been done without some basic understanding of the statistical quality of previous experiments. Similarly, as data collection has evolved statistics and computation are playing a more and more central role. As Rafa has pointed out, people in positions of power who don't understand statistics are a big problem for science.

More importantly, as we live in an increasingly data rich environment both in the sciences and in the broader community - basic statistical and numerical literacy are becoming more and more important. While I agree with Wilson that we should try not to discourage people who have a difficult first encounter with math from pursuing careers in science, I think it is both disingenuous and potentially disastrous to downplay the importance of quantitative skill at the exact moment in history that those skills are most desperately needed.

As a counter proposal to Wilson's idea that we should encourage people to disregard quantitative sciences I propose that we build a better infrastructure for ensuring all people interested in the sciences are able to improve their quantitative skills and literacy. Here at Simply Stats we are all about putting our money where our mouth is and we have already started by creating free, online versions of our quantitative courses. Maybe Wilson should take one....