Mindlessly normalizing genomics data is bad - but ignoring unwanted variability can be worse

Yesterday, and bleeding over into today, quantile normalization (QN) was being discussed on Twitter. This is the tweet that started the whole thing off. The conversation went a bunch of different directions and then this happened:

well, this happens all over bio-statistics - ie, naive use in seemingly undirected ways until you get a "good" pvalue. And then end

So Jeff and I felt it was important to respond - since we are biostatisticians that work in genomics. We felt a couple of points were worth making:

1. Most statisticians we know, including us, know QN's limitations and are always nervous about using QN. But with most datasets we see, unwanted variability is overwhelming  and we are left with no choice but to normalize in orde to extract anything useful from the data.  In fact, many times QN is not enough and we have to apply further transformations, e.g., to remove batch effects.

2. We would be curious to know which biostatisticians were being referred to. We would like some examples, because most of the genomic statisticians we know work very closely with biologists to aid them in cleaning dirty data to help them find real sources of signal. Furthermore, we encourage biologists to validate their results. In many cases, quantile normalization (or other transforms) are critical to finding results that validate and there is a long literature (both biological and statistical) supporting the importance of appropriate normalization.

3. Assuming the data that you get (sequences, probe intensities, etc.) from high-throughput tech = direct measurement of abundance is incorrect. Before worrying about QN (or other normalization) being an arbitrary transformation that distorts the data, keep in mind that what you want to measure has already been distorted by PCR, the imperfections of the microarray, scanner measurement error, image bleeding, cross hybridization or alignment artifacts, ozone effects, etc...

To go into a little more detail about the reasons that normalization may be important in many cases, so I have written a little more detail below with data if you are interested.

Continue reading

Posted in Uncategorized | Tagged , | 7 Comments

Interview at Yale Center for Environmental Law & Policy

Interview with Roger Peng from YCELP on Vimeo.

A few weeks ago I sat down with Angel Hsu of the Yale Center for Environmental Law and Policy to talk about some of their work on air pollution indicators.

(Note: I haven't moved--I still work at the Johns Hopkins School of Public Health.)

Posted in Uncategorized | Leave a comment

Nevins-Potti, Reinhart-Rogoff

There's an interesting parallel between the Nevins-Potti debacle (a true debacle, in my mind) and the recent Reinhart-Rogoff kerfuffle. Both were exposed via some essentially small detail that had nothing to do with the real problem.

In the case of Reinhart-Rogoff, the Excel error was what made them look ridiculous, but it was in fact the "unconventional weighting" of the data that had the most dramatic effect. Furthermore, ever since the paper had come out, academic economists were debating and challenging its conclusions from the get go. Even when legitimate scientific concerns were raised, policy-makers and other academics were not convinced. As soon as the Excel error was revealed, everything needed to be re-examined.

In the Nevins-Potti debacle, Baggerly and Coombes wrote article after article pointing out all the problems and, for the most part, no one in a position of power really cared. The Nevins-Potti errors were real zingers too, not some trivial Excel error (i.e. switching the labels between people with disease and people without disease). But in the end, it took Potti's claim of being a Rhodes Scholar to bring him down. Clearly, the years of academic debate beforehand were meaningless compared to lying on a CV.

In the Reinhart-Rogoff case, reproducibility was an issue and if the data had been made available earlier, the problems would have been discovered earlier and perhaps that would have headed off years of academic debate (for better or for worse). In the Nevins-Potti example, reproducibility was not an issue--the original Nature Medicine study was done using public data and so was reproducible (although it would have been easier if code had been made available). The problem there is that no one listened.

One has to wonder if the academic system is working in this regard. In both cases, it took a minor, but personal failing, to bring down the entire edifice. But the protestations of reputable academics, challenging the research on the merits, were ignored. I'd say in both cases the original research conveniently said what people wanted to hear (debt slows growth, personalized gene signatures can predict response to chemotherapy), and so no amount of research would convince people to question the original findings.

One also has to wonder whether reproducibility is of any help here. I certainly don't think it hurts, but in the case of Nevins-Potti, where the errors were shockingly obvious to anyone paying attention, the problems were deemed merely technical (i.e. statistical). The truth is, reproducibility will be most necessary in highly technical and complex analyses where it's often not obvious how an analysis is done. If you can show a flaw in an analysis that is complicated, what's the use if your work will be written off as merely concerned with technical details (as if those weren't important)? Most of the news articles surrounding Reinhart-Rogoff characterized the problems as complex and statistical (i.e. not important) and not concerned with fundamental questions of interest.

In both cases, I think science was used to push an external agenda, and when the science was called into question, it was difficult to back down. I'll write more in a future post about these kinds of situations and what, if anything, we can do to improve matters.

Posted in Uncategorized | 8 Comments

Podcast #7: Reinhart, Rogoff, Reproducibility

Jeff and I talk about the recent Reinhart-Rogoff reproducibility kerfuffle and how it turns out that data analysis is really hard no matter how big the dataset.

Posted in Uncategorized | 3 Comments

I wish economists made better plots

I'm seeing lots of traffic on a big-time economics article by that failed to reproduce and here are my quick thoughts. You can read a pretty good summary here by Mike Konczal.

Quick background: Carmen Reinhart and Kenneth Rogoff wrote an influential paper that was used by many to justify the need for austerity measures taken by governments to reduce debts relative to GDP. Yesterday, Thomas Herndon, Michael Ash, and Robert Pollin (HAP) released a paper where they reproduced the Reinhart-Rogoff (RR) analysis and noted a few irregularities or errors. In their abstract, HAP claim that they "find that coding errors, selective exclusion of available data, and unconventional weighting of summary statistics [in the RR analysis] lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period.

It appears there were three points made by HAP: (1) RR excluded some important data from their final analysis; (2) RR weighted countries in a manner that was not proportional to the number of years they contributed to the dataset (RR used equal weighting of countries); and (3) there was an error in RR's Excel formula which resulted in them inadvertently leaving out five countries from their final analysis.

The bottom line is shown in HAP's Figure 1, which I reproduce below (on the basis of fair use):

HAP Analysis

From the plot you can see that the HAP's adjusted analysis (circles) more or less coincides with RR's analysis (diamonds) except for the last categories of countries with debt/GDP ratios over 90%. In that category RR's analysis shows a large drop in growth whereas HAP's analysis shows a more or less smooth decline (but still positive growth).

To me, it seems that the incorrect Excel formula is a real error, but easily fixed. It also seemed to have the least impact on the final analysis. The other two problems, which had far bigger impacts, might have some explanation that I'm not aware of. I am not an economist so I await others to weigh in. RR apparently do not comment on the exclusion of certain data points or on the weighting scheme so it's difficult to say what the thinking was, whether it was inadvertent or purposeful.

In summary, so what? Here's what I think:

  1. Is there some fishiness? Sure, but this is not the Potti-Nevins scandal a la economics. I suppose it's possible RR manipulated the analysis to get the answer austerity hawks were looking for, but we don't have the evidence yet and this just doesn't feel like that kind of thing.
  2. What's the counterfactual? Or, what would have happened if the analysis had been done the way HAP propose? Would the world have embraced pro-growth policies by taking on a greater debt burden? My guess is no. Austerity hawks would have found some other study that supported their claims (and in fact there was at least one other).
  3. RR's original analysis did not contain a plot like Figure 1 in HAP's analysis, which I personally find very illuminating. From HAP's figure, you can see that there's quite a bit of variation across countries and perhaps an overall downward trend. I'm not sure I would have dramatically changed my conclusion if I had done the HAP analysis instead of the RR analysis. My point is that plots like this, which show the variability, are very important.
  4. People see what they want to see. I would not be surprised to see some claim that HAP's analysis supports the austerity conclusion because growth under high debt loads is much lower (almost 50%!) than under low debt loads.
  5. If RR's analysis had been correct, should they have even made the conclusions they made? RR indicated that there was a "threshold" at 90% debt/GDP. My experience is that statements about thresholds, are generally very hard to make, even with good data. I wonder what other more knowledgable people think of the original conclusions.
  6. If the data had been made available sooner, this problem would have been fixed sooner. But in my opinion, that's all that would have happened.

The vibe on the Internets seems to be that if only this problem had been identified sooner, the world would be a better place. But my cynical mind says, uh, no. You can toss this incident in the very large bucket of papers with some technical errors that are easily fixed. Thankfully, someone found these errors and fixed them, and that's a good thing. Science moves on.

UPDATE: Reinhart-Rogoff respond.

UPDATE 2: Reinhart-Rogoff more detailed response.

Posted in Uncategorized | 11 Comments

Data science only poses a threat to (bio)statistics if we don't adapt

We have previously mentioned on this blog how statistics needs better marketing. Recently, Karl B. has suggested that "Data science is statistics" and Larry W. has wondered if "Data science is the end of statistics?" I think there are a couple of types of data science and that each has a different relationship to the discipline of academic statistics:

  1. Data science as marketing tool. Data analytics, data science, big data, etc. are terms that companies who already did something (IT infrastructure, consulting, database management, etc.) throw around to make them sound like they are doing the latest and greatest thing. These marketers are dabblers in what I would call the real "science of data" or maybe deal with just one part of the data pipeline. I think they pose no threat to the statistics community other than by generating backlash by over promising on the potential of data science or diluting the term to the point of being almost non-sensical.
  2. Data science as business analytics. Another common use of "data science" is to describe the exact same set of activities that use to be performed by business analytics people, maybe allowing for some growth in the size of the data sets. This might be a threat to folks who do statistics in business schools - although more likely it will be beneficial to those programs as there is growth in the need for business-oriented statisticians.
  3. Data science as big data engineer Sometimes data science refers to people who do stuff with huge amounts of data. Larry refers to this in his post when he talks about people working on billions of data points. Most classically trained statisticians aren't comfortable with data of this size. But at places like Google - where big data sets are routine - the infrastructure is built so that statisticians can access and compress the parts of the data that they need to do their jobs. I don't think this is necessarily a threat to statistics; but we should definitely be integrating data access into our curriculum.
  4. Data science as replacement for statistics Some people (and I think it is the minority) are exactly referring to things that statisticians do when they talk about data science. This means manipulating, collecting, and analyzing data, then making inferences to a population or predictions about what will happen next. This is, of course, a threat to statisticians. Some places, like NC State and Columbia, are tackling this by developing centers/institutes/programs with data science in the name. But I think that is a little dangerous. The data don't matter - it is the problem you can solve with the data. So the key thing is that these institutes need to focus on solving real problems - not just churning out people who know a little R, a little SQL, and a little Python.

So why is #4 happening? I think one reason is reputation. Larry mentions that a statistician produces an estimate and a confidence interval and maybe the confidence interval is too wide. I think he is on to something there, but I think it is a bigger problem. As Roger has pointed out - statisticians often see themselves as referees - rather scientists/business people. So a lot of people have the experience of going to a statistician and feel like they have been criticized for bad experimental design, too small a sample size, etc. These issues are hugely important - but sometimes you have to make due with what you have. I think data scientists in category 4 are taking advantage of a cultural tendency of statisticians to avoid making concrete decisions.

A second reason is that some statisticians have avoided getting their hands dirty. "Hands clean" statisticians don't  get the data from the database, or worry about the data munging, or match identifiers, etc. They wait until the data are nicely formated in a matrix to apply their methods. To stay competitive, we need to produce more "hands dirty" statisticians who are willing to go beyond schlep blindness and handle all aspects of a data analysis. In academia, we can encourage this by incorporating more of those issues into our curriculum.

Finally, I think statisticians focus on optimality hurts us. Our field grew up in an era where data was sparse and we had to squeeze every last ounce of information out what little data we had. Those constraints led to a cultural focus on optimality to a degree that is no longer necessary when data are abundant. In fact, an abundance of data is often unreasonably effective even with suboptimal methods. "Data scientists" understand this and shoot for the 80% solution that is good enough in most cases.

In summary I don't think statistics will be killed off by data science. Most of the hype around data science is actually somewhat removed from our field (see above). But I do think that it is worth considering some potential changes that reposition our discipline as the most useful for answering questions with data. Here are some concrete proposals:

  1. Remove some theoretical requirements and add computing requirements to statistics curricula.
  2. Focus on statistical writing, presentation, and communication as a main part of the curriculum.
  3. Focus on positive interactions with collaborators (being a scientist) rather than immediately going to the referee attitude.
  4. Add a unit on translating scientific problems to statistical problems.
  5. Add a unit on data munging and getting data from databases.
  6. Integrating real and live data analyses into our curricula.
  7. Make all our students create an R package (a data product) before they graduate.
  8. Most important of all have a "big tent" attitude about what constitutes statistics.

 

Posted in Uncategorized | 8 Comments

Sunday data/statistics link roundup (4/14/2013)

  1. The most influential data scientists on Twitter, featuring Amy Heineike, Hilary Mason, and a few other familiar names to readers of this blog. In other news, I love reading list of the "Top K _____" as much as the next person. I love them even more when they are quantitative (the list above isn't) - even when the quantification is totally bogus. (via John M.)
  2. Rod Little and our own Tom Louis over at the Huffingtonpost talking about the ways in which the U.S. Census supports our democracy. It is a very good piece and I think highlights the critical importance that statistics and data play in keeping government open and honest.
  3. An article about the growing number of fake academic journals and their potential predatory practices. I think I've been able to filter out the fake journals/conferences pretty well (if they've invited 30 Nobel Laureates - probably fake). But this poses big societal problems; how do we tell what is real science from what is fake if you don't have inside knowledge about which journals are real? (via John H.)
  4. A ton of data on the DC Capitol Bikeshare. One of my favorite things is when a government organization just opens up its data. The best part is that the files are formatted as csv's. Clearly someone who knows that the best data formats are open, free, and easy to read into statistical software. In other news, I think one of the most important classes that could be taught is "How to share data 101" (via David B.)
  5. A slightly belated link to a remembrance of George Box. He was the one who said, "All models are wrong, but some are useful."  An absolute titan of our field.
  6. Check out these cool logotypes for famous scientists. I want one! Also, see the article on these awesome minimalist posters celebrating legendary women in science. I want the Sally Ride poster on a t-shirt.
  7. As an advisor, I aspire to treat my students/postdocs like this. (@hunterwalk). I'm not always so good at it, but those are some good ideals to try to live up to.
Posted in Uncategorized | 2 Comments

Great scientist - statistics = lots of failed experiments

E.O. Wilson is a famous evolutionary biologist. He is currently an emeritus professor at Harvard and just this last week dropped this little gem in the Wall Street Journal. In the piece, he suggests that knowing mathematics is not important for becoming a great scientist. Wilson goes even further, suggesting that you can be mathematically semi-literate and still be an amazing scientist. There are two key quotes in the piece that I think deserve special attention:

Fortunately, exceptional mathematical fluency is required in only a few disciplines, such as particle physics, astrophysics and information theory. Far more important throughout the rest of science is the ability to form concepts, during which the researcher conjures images and processes by intuition.

I agree with this quote in general as does Paul Krugman. Many scientific areas don't require advanced measure theory, differential geometry, or number theory to make big advances. It seems like this is is the kind of mathematics to which E.O. Wilson is referring to and on that point I think there is probably universal agreement that you can have a hugely successful scientific career without knowing about measurable spaces.

Wilson doesn't stop there, however. He goes on to paint a much broader picture about how one can pursue science without the aid of even basic mathematics or statistics and this is where I think he goes off the rails a bit:

Ideas in science emerge most readily when some part of the world is studied for its own sake. They follow from thorough, well-organized knowledge of all that is known or can be imagined of real entities and processes within that fragment of existence. When something new is encountered, the follow-up steps usually require mathematical and statistical methods to move the analysis forward. If that step proves too technically difficult for the person who made the discovery, a mathematician or statistician can be added as a collaborator.

I see two huge problems with this statement:

  1. Poor design of experiments is one of, if not the most, common reason for an experiment to fail. It is so important that Fisher said, "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination.  He can perhaps say what the experiment died of." Wilson is suggesting that with careful conceptual thought and some hard work you can do good science, but without a fundamental understanding of basic math, statistics, and study design even the best conceived experiments are likely to fail.
  2. While armchair science was likely the norm when Wilson was in his prime, huge advances have been made in both science and technology. Scientifically, it is difficult to synthesize and understand everything that has been done without some basic understanding of the statistical quality of previous experiments. Similarly, as data collection has evolved statistics and computation are playing a more and more central role. As Rafa has pointed out, people in positions of power who don't understand statistics are a big problem for science.

More importantly, as we live in an increasingly data rich environment both in the sciences and in the broader community - basic statistical and numerical literacy are becoming more and more important. While I agree with Wilson that we should try not to discourage people who have a difficult first encounter with math from pursuing careers in science, I think it is both disingenuous and potentially disastrous to downplay the importance of quantitative skill at the exact moment in history that those skills are most desperately needed.

As a counter proposal to Wilson's idea that we should encourage people to disregard quantitative sciences I propose that we build a better infrastructure for ensuring all people interested in the sciences are able to improve their quantitative skills and literacy. Here at Simply Stats we are all about putting our money where our mouth is and we have already started by creating free, online versions of our quantitative courses. Maybe Wilson should take one....

Posted in Uncategorized | 13 Comments

Climate Science Day on Capitol Hill

A few weeks ago I participated in the fourth annual Climate Science Day organized by the ASA and a host of other professional and scientific societies. There's a nice write up of the event written by Steve Pierson over at Amstat News. There were a number of statisticians there besides me, but the vast majority of people were climate modelers, atmospheric scientists, agronomists, and the like. Below is our crack team of scientists outside the office of (Dr.) Andy Harris. Might be the only time you see me wearing a suit.

IMG_3783

The basic idea behind the day is to get scientists who do climate-related research into the halls of Congress to introduce themselves to members of Congress and make themselves available for scientific consultations. I was there (with Brooke Anderson, the other JHU rep) because of some of my work on the health effects of heat. I was paired up with Tony Broccoli, a climate modeler at Rutgers, as we visited the various offices of New Jersey and Maryland legislators. We also talked to staff from the Senate Health, Education, Labor, and Pensions (HELP) committee.

Here are a few things I learned:

  • It was fun. I'd never been to Congress before so it was interesting for me to walk around and see how people work. Everyone (regardless of party) was super friendly and happy to talk to us.
  • The legislature appears to be run by women. Seriously, I think every staffer we met with (but one) was a woman. Might have been a coincidence, but I was not expecting that. We only met with one actual member of Congress, and that was (Dr.) Andy Harris from Maryland's first district.
  • Climate change is not really on anyone's radar. Oh well, we were there 3 days before the sequester hit so there were understandably other things on their minds. Waxman-Markey was the most recent legislation taken up by the House and it went nowhere in the Senate.
  • The Senate HELP committee has PhDs working on its staff. Didn't know that.
  • Staffers are working on like 90 things at once, probably none of which are related to each other. That's got to be a tough job.
  • I used more business cards on this one day than in my entire life.
  • Senate offices are way nicer than House offices.
  • The people who write our laws are around 22 years old. Maybe 25 if they went to law school. I'm cool with that, I think.
Posted in Uncategorized | Tagged | Leave a comment

NIH is looking for an Associate Director for Data Science: Statisticians should consider applying

NIH understands the importance of data and several months ago they announced this new position. Here is an excerpt from the add:

The ADDS will focus on the urgent need and increased opportunities for capitalizing on the expanding collections of biomedical data to advance NIH’s mission. In doing so, the incumbent will provide programmatic NIH-wide leadership for areas of data science that relate to data emanating from many areas of study (e.g., genomics, imaging, and electronic heath records). This will require knowledge about multiple domains of study as well as familiarity with approaches for integrating data from these various domains.

In my opinion, the person holding this job should have hands-on experience with data analysis and programming. The nuisances nuances of what a data analyst needs to successfully do his/her job can't be underestimated. This knowledge will help this director make the right decisions when it comes to choosing what data to make available and how to make it available.  When it comes to creating data resources, good intentions don't always translate into usable products.

In this new era of data driven science this position will be highly influential making this job quite attractive. If you know of a Statistician that you think is interested please pass along the information.

Posted in Uncategorized | Tagged , , | 4 Comments