Mindlessly normalizing genomics data is bad - but ignoring unwanted variability can be worse

Yesterday, and bleeding over into today, quantile normalization (QN) was being discussed on Twitter. This is the Yesterday, and bleeding over into today, quantile normalization (QN) was being discussed on Twitter. This is the that started the whole thing off. The conversation went a bunch of different directions and then this happened:

well, this happens all over bio-statistics - ie, naive use in seemingly undirected ways until you get a “good” pvalue. And then end

So Jeff and I felt it was important to respond - since we are biostatisticians that work in genomics. We felt a couple of points were worth making:

  1. Most statisticians we know, including us, know QN’s limitations and are always nervous about using QN. But with most datasets we see, unwanted variability is overwhelming  and we are left with no choice but to normalize in orde to extract anything useful from the data.  In fact, many times QN is not enough and we have to apply further transformations, e.g., to remove batch effects.

2. We would be curious to know which biostatisticians were being referred to. We would like some examples, because most of the genomic statisticians we know work very closely with biologists to aid them in cleaning dirty data to help them find real sources of signal. Furthermore, we encourage biologists to validate their results. In many cases, quantile normalization (or other transforms) are critical to finding results that validate and there is a long literature (both biological and statistical) supporting the importance of appropriate normalization.

3. Assuming the data that you get (sequences, probe intensities, etc.) from high-throughput tech = direct measurement of abundance is incorrect. Before worrying about QN (or other normalization) being an arbitrary transformation that distorts the data, keep in mind that what you want to measure has already been distorted by PCR, the imperfections of the microarray, scanner measurement error, image bleeding, cross hybridization or alignment artifacts, ozone effects, etc…

To go into a little more detail about the reasons that normalization may be important in many cases, so I have written a little more detail below with data if you are interested.

Please save the unsolicited R01s

Editor’s note: With the sequestration deadline hours away, the career of many young US scientists is on the line. In this guest post, our colleague Steven Salzberg , an avid _Editor’s note: With the sequestration deadline hours away, the career of many young US scientists is on the line. In this guest post, our colleague Steven Salzberg , an avid  and its peer review process, tells us why now more than ever the NIH should prioritize funding R01s over other project grants .

The scientific reasons it is not helpful to study the Newtown shooter's DNA

The Connecticut Medical Examiner has asked to sequence and study the DNA of the recent Newtown shooter. I’ve been seeing this pop up over the last few days on a lot of popular media sites, where they mention some objections scientists (or geneticists) may have to this “scientific” study. But I haven’t seen the objections explicitly laid out anywhere. So here are mine. Ignoring the fundamentals of the genetics of complex disease: If the violent behavior of the shooter has any genetic underpinning, it is complex.

Sunday data/statistics link roundup 12/23/12

A cool data visualization for blood glucose levels for diabetic individuals. This kind of interactive visualization can help people see where/when major health issues arise for chronic diseases. This was a class project by Jeff Heer’s Stanford CS448B students Ben Rudolph and Reno Bowen (twitter @RenoBowen). Speaking of interactive visualizations, I also got this link from Patrick M. It looks like a way to build interactive graphics and my understanding is it is compatible with R data frames, worth checking out (plus, Dex is a good name).

Top-down versus bottom-up science: data analysis edition

In our most recent video, Steven Salzberg discusses the ENCODE project. Some of the advantages and disadvantages of top-down science are described. Here, top-down refers to big coordinated projects like the Human Genome Project (HGP). In contrast, the approach of funding many small independent projects, via the R01 mechanism, is referred to as bottom-up. Note that for the cost of HGP we could have funded thousands of R01s. However it is not clear that without the HGP we would have had public sequence data as early as we did.

Sunday Data/Statistics Link Roundup (9/2/2012)

Just got back from IBC 2012 in Kobe Japan. I was in an awesome session (organized by the inimitable Lieven Clement) with great talks by Matt McCall, Djork-Arne Clevert, Adetayo Kasim, and Willem Talloen. Willem’s talk nicely tied in our work and how it plays into the pharmaceutical development process and the bigger theme of big data. On the way home through SFO I saw this hanging in the airport.

Replication and validation in -omics studies - just as important as reproducibility

The psychology/social psychology community has made replication a huge focus over the last year. One reason is the recent, public blow-up over a famous study that did not replicate. There are also concerns about the experimental and conceptual design of these studies that go beyond simple lack of replication. In genomics, a similar scandal occurred due to what amounted to “data fudging”. Although, in the genomics case, much of the blame and focus has been on lack of reproducibility or data availability.

Follow up on "Statistics and the Science Club"

I agree with Roger’s latest post: “we need to expand the tent of statistics and include people who are using their statistical training to lead the new science”. I am perhaps a bit more worried than Roger. Specifically, I worry that talented go-getters interested in leading science via data analysis will achieve this without engaging our research community. A quantitatively trained person (engineers , computer scientists, physicists, etc..) with strong computing skills (knows python, C, and shell scripting), that reads, for example, “Elements of Statistical Learning” and learns R, is well on their way.

"How do we evaluate statisticians working in genomics? Why don't they publish in stats journals?" Here is my answer

During the past couple of years I have been asked these questions by several department chairs and other senior statisticians interested in hiring or promoting faculty working in genomics. The main difficulty stems from the fact that we (statisticians working in genomics) publish in journals outside the mainstream statistical journals. This can be a problem during evaluation because a quick-and-dirty approach to evaluating an academic statistician is to count papers in the Annals of Statistics, JASA, JRSS and Biometrics.

Sample mix-ups in datasets from large studies are more common than you think

If you have analyzed enough high throughput data you have seen it before: a male sample that is really a female, a liver that is a kidney, etc… As the datasets I analyze get bigger I see more and more sample mix-ups. When I find a couple of samples for which sex is incorrectly annotated (one can easily see this from examining data from X and Y chromosomes) I can’t help but wonder if there are more that are undetectable (e.

People in positions of power that don't understand statistics are a big problem for genomics

I finally got around to reading the IOM report on translational omics and it is very good. The report lays out problems with current practices and how these led to undesired results such as the now infamous Duke trials and the growth in retractions in the scientific literature. Specific recommendations are provided related to reproducibility and validation. I expect the report will improve things. Although I think bigger improvements will come as a result of retirements.