26
Apr

Mindlessly normalizing genomics data is bad - but ignoring unwanted variability can be worse

Tweet about this on Twitter51Share on Facebook14Share on Google+4Share on LinkedIn0Email this to someone

Yesterday, and bleeding over into today, quantile normalization (QN) was being discussed on Twitter. This is the tweet that started the whole thing off. The conversation went a bunch of different directions and then this happened:

well, this happens all over bio-statistics - ie, naive use in seemingly undirected ways until you get a "good" pvalue. And then end

So Jeff and I felt it was important to respond - since we are biostatisticians that work in genomics. We felt a couple of points were worth making:

1. Most statisticians we know, including us, know QN's limitations and are always nervous about using QN. But with most datasets we see, unwanted variability is overwhelming  and we are left with no choice but to normalize in orde to extract anything useful from the data.  In fact, many times QN is not enough and we have to apply further transformations, e.g., to remove batch effects.

2. We would be curious to know which biostatisticians were being referred to. We would like some examples, because most of the genomic statisticians we know work very closely with biologists to aid them in cleaning dirty data to help them find real sources of signal. Furthermore, we encourage biologists to validate their results. In many cases, quantile normalization (or other transforms) are critical to finding results that validate and there is a long literature (both biological and statistical) supporting the importance of appropriate normalization.

3. Assuming the data that you get (sequences, probe intensities, etc.) from high-throughput tech = direct measurement of abundance is incorrect. Before worrying about QN (or other normalization) being an arbitrary transformation that distorts the data, keep in mind that what you want to measure has already been distorted by PCR, the imperfections of the microarray, scanner measurement error, image bleeding, cross hybridization or alignment artifacts, ozone effects, etc...

To go into a little more detail about the reasons that normalization may be important in many cases, so I have written a little more detail below with data if you are interested.

Most, if not all, the high throughput data we have analyzed needs some kind of normalization. This applies to both microarrays and next-gen sequencing. To demonstrate why, below I include 5 boxplots of log intensities from 5 microarrays that were hybridized to the same RNA (technical replicates).

Screen shot 2013-04-25 at 11.12.20 PM

See the problem? If we took the data at face value we would conclude that there is a large (almost 2 fold) global change in expression when comparing, say, samples C and E. But they are technical replicates so the observed difference is not biologically driven. Discrepancies like these are the rule rather than the exception. Biologists seem to underestimate the amount of unwanted variability present in the data they produce. Look at enough data and you will quickly learn that, in most cases, unwanted experimental variability dwarfs the biological differences we are interested in discovering. Normalization is the statistical technique that saves biologists millions of dollars  a year by fixing this problem in silico rather than redoing the experiment.

For the data above you might be tempted to simply standardize the data by subtracting the median. But the problem is more complicated than that as shown in the plot below. This plot shows the log ratio (M) versus the average of the logs intensities (A) for two technical replicates in which 16 probes (red dots) have been "spiked-in" to have true fold changes of 2. The other ~20,000 probesets (blue streak) are supposed to be unchanged (M=0). See the curvature of the genes that are supposed to be at 0?  Taken at face value, thousands of the low expressed probes exhibit larger differential expression than the only 16 that are actually different. That's a problem. And standardizing by the subtracting the median won't fix it. Non-linear biases such as this one are also quite common.Screen shot 2013-04-25 at 11.14.20 PM

QN offers one solution to this problem  if you can assume that the true distribution of what you are measuring is roughly the same across samples. Briefly, QN forces each sample to have the same distribution. The after picture above is the result of QN. It removes the curvature but preserves most of the real differences.

So why should we be nervous? QN and other normalization techniques risk throwing the baby out with the bath water. What if there is a real global difference? If there is, and you use QN, you will miss it and you may introduce artifacts. But the assumptions are no secret and it's up to the biologists to decide if they are reasonable. At the same time, we have to be very careful about interpreting large scale changes given that we see large scale changes when we know there are none. Other than cases were global differences are forced or simulated, I have yet to see a good example in which QN causes more harm than good. I'm sure there are some real data examples out there, so if you have one please share, as I would love to use it as an example in class.

Also note that statisticians (including me) are working hard at deciphering ways  to normalize without the need for such strong assumptions. Although in their first incarnation they were useless, current control probes/transcripts techniques are promising. We have used them in the past to normalize methylation data (a similar approach was used here for gene expression data). And then there is subset quantile normalization. I am sure there are others and more to come. So Biologists, don't worry, we have your backs and serve at your pleasure. In the meantime don't be so afraid of QN: at least give it a try before you knock it.

  • Titus Brown

    I would appreciate a technical discussion of how to decide whether or not QN is applicable; in many of our RNAseq data sets, we find that the distributions between samples are different enough that QN seems inappropriate.

    • Rafael Irizarry

      Is the statement "we find that the distributions between samples are different enough that QN seems inappropriate" based on data? if it is, then i would argue that it is very hard to distinguish between biology and experimental artifact.

      My own approach to deciphering if QN is appropriate is discussing the biology with the expert. I have some ideas for using data-driven approaches but they are bit too complicated for a comment section. Maybe Ill write something up.

      • Titus Brown

        Largely based on eyeballing the distributions of the mRNAseq expression levels. Since we are trying to compare across different enrichment techniques or time points, I would *expect* the distributions to be different.

        Perhaps I'm reading into your response, but you seem to be saying that if you *can't* use QN, then you're stuck. I find that scientifically troubling...

        • Rafael Irizarry

          If you accept that global differences can be due to experimental artifacts you will have to come up with some way to normalize that doesn't remove the real global differences you assume are there. I've used control probes when I've encountered this problem. Many of us are working on this problem. See the end of post for a couple of references.

          • Titus Brown

            I accept the general point, of course. But I dislike the reliance on QN. Maybe I'm just not sufficiently grokking that there aren't alternatives?

          • Rafael Irizarry

            I should have pointed out that there are many other choices that work just as well as QN. But many (most) rely on similarly strong assumptions that wash out global changes.

  • http://twitter.com/michelebusby Michele Busby

    I think that in most cases normalizing microarrays is a very different animal than normalizing RNA-Seq expression measurements because microarrays have the issues with funny fold change compressions at the upper and lower tails of the measurements, and RNA Seq is more close to a direct measurement of abundance.

    In RNA-Seq if you have well-behaved libraries you should be seeing something like a binomial sampling from two different but similar distributions. If you plot the gene counts against one another you often find a straight line that goes through the center, which I think justifies a scaling approach. In microarrays the line isn't straight so you need quantile normalization, or something like it.

    If the libraries are so different and you can't find a straight line...I think I would probably use spike ins and try to find a straight line before I tried quantile normalization.

    You do have problems when your library complexities vary a lot in that the median isn't the line through the center. That would be two lognormal distributions with a different variance. Does scaling work then? I'd have to do some simulations to figure that out but maybe you know off the top of your head.

    I agree that you can't do blind normalization.