Tag: genomics


Mindlessly normalizing genomics data is bad - but ignoring unwanted variability can be worse

Yesterday, and bleeding over into today, quantile normalization (QN) was being discussed on Twitter. This is the tweet that started the whole thing off. The conversation went a bunch of different directions and then this happened:

well, this happens all over bio-statistics - ie, naive use in seemingly undirected ways until you get a "good" pvalue. And then end

So Jeff and I felt it was important to respond - since we are biostatisticians that work in genomics. We felt a couple of points were worth making:

1. Most statisticians we know, including us, know QN's limitations and are always nervous about using QN. But with most datasets we see, unwanted variability is overwhelming  and we are left with no choice but to normalize in orde to extract anything useful from the data.  In fact, many times QN is not enough and we have to apply further transformations, e.g., to remove batch effects.

2. We would be curious to know which biostatisticians were being referred to. We would like some examples, because most of the genomic statisticians we know work very closely with biologists to aid them in cleaning dirty data to help them find real sources of signal. Furthermore, we encourage biologists to validate their results. In many cases, quantile normalization (or other transforms) are critical to finding results that validate and there is a long literature (both biological and statistical) supporting the importance of appropriate normalization.

3. Assuming the data that you get (sequences, probe intensities, etc.) from high-throughput tech = direct measurement of abundance is incorrect. Before worrying about QN (or other normalization) being an arbitrary transformation that distorts the data, keep in mind that what you want to measure has already been distorted by PCR, the imperfections of the microarray, scanner measurement error, image bleeding, cross hybridization or alignment artifacts, ozone effects, etc...

To go into a little more detail about the reasons that normalization may be important in many cases, so I have written a little more detail below with data if you are interested.

Continue Reading..


Please save the unsolicited R01s

Editor's note: With the sequestration deadline hours away, the career of many young US scientists is on the line.  In this guest post, our colleague Steven Salzberg , an avid defender of NIH and its peer review process, tells us why now more than ever the NIH should prioritize funding R01s over other project grants .

First let's get the obvious facts out of the way: the federal budget is a mess, and Congress is completely disfunctional.  When it comes to NIH funding, this is not a good thing.

Hidden within the larger picture, though, is a serious menace to our decades-long record of incredibly successful research in the United States.  The investigator-driven, basic research grant is in even worse shape than the overall NIH budget.  A recent analysis by FASEB, shown in the figure here, reveals that the number of new R01s reached its peak in 2003 - ten years ago! - and has been steadily declining since.  In 2003, 7,430 new R01s were awarded.  In 2012, that number had dropped to 5,437, a 27% decline.


For those who might not be familiar with the NIH system, the R01 grant is the crown jewel of research grants.  R01s are awarded to individual scientists to pursue all varieties of biomedical research, from very basic science to clinical research.  For R01s, NIH doesn't tell the scientists what to do: we propose the ideas, we write them up, and then NIH organizes a rigorous peer review (which isn't perfect, but it's the best system anyone has).  Only the top-scoring proposals get funded.

This process has gotten much tougher over the years.  In 1995, the success rate for R01s was 25.9%.  Today it is 18.4% and falling.  This includes applications from everyone, even the most experienced and proven scientists.  Thus no matter who you are, you can expect that there is more than an 80% chance that your grant application will be turned down.  In some areas it is even worse: NIAID's website announced that it is currently funding only 6% of R01s.

Why are R01s declining?  Not for lack of interest: the number of applications last year was 29,627, an all-time high.  Besides the overall budget problem, another problem is growing: the fondness of the NIH administration for big, top-down science projects, many times with the letters "ome" or "omics" attached.

Yes, the human genome was a huge success.  Maybe the human microbiome will be too.  But now NIH is pushing gigantic, top-down projects: ENCODE, 1000 Genomes, the cancer anatomy genome project (CGAP), the cancer genome atlas (TCGA), a new "brain-ome" project, and more. The more money is allocated to these big projects, the less R01s NIH can fund. For example, NIAID, with its 6% R01 success rate, has been spending tens of millions of dollars per year on 3 large Microbial Genome Sequencing Center contracts and tens of millions more on 5 large Bioinformatics Resource Center contracts.  As far as I can tell, no one uses these bioinformatics resource centers for anything - in fact, virtually no one outside the centers even knows they exist. Furthermore, these large, top-down driven sequencing projects don't address specific scientific hypotheses, but they produce something that the NIH administration seems to love: numbers.  It's impressive to see how many genomes they've sequenced, and it makes for nice press releases.  But very often we simply don't need these huge, top-down projects to answer scientific questions.  Genome sequencing is cheap enough that we can include it in an R01 grant, if only NIH will stop pouring all its sequencing money into these huge, monolithic projects.

I'll be the first person to cheer if Congress gets its act together and fund NIH at a level that allows reasonable growth.  But whether or not that happens, the growth of big science projects, often created and run by administrators at NIH rather than scientists who have successfully competed for R01s, represents a major threat to the scientist-driven research that has served the world so well for the past 50 years.  Many scientists are afraid to speak out against this trend, because by doing so we (yes, this includes me) are criticizing those same NIH administrators who manage our R01s.   But someone has to say something.  A 27% decline in the number of R01s over the past decade is not a good thing.  Maybe it's time to stop the omics train.


The scientific reasons it is not helpful to study the Newtown shooter's DNA

The Connecticut Medical Examiner has asked to sequence and study the DNA of the recent Newtown shooter. I've been seeing this pop up over the last few days on a lot of popular media sites, where they mention some objections scientists (or geneticists) may have to this "scientific" study. But I haven't seen the objections explicitly laid out anywhere. So here are mine.

Ignoring the fundamentals of the genetics of complex disease: If the violent behavior of the shooter has any genetic underpinning, it is complex. If you only look at one person's DNA, without a clear behavior definition (violent? mental disorder? etc.?) it is impossible to assess important complications such as penetranceepistasis, and gene-environment interactions, to name a few. These make statistical analysis incredibly complicated even in huge, well-designed studies.

Small Sample Size:  One person hit on the issue that is maybe the biggest reason this is a waste of time/likely to lead to incorrect results. You can't draw a reasonable conclusion about any population by looking at only one individualThis is actually a fundamental component of statistical inference. The goal of statistical inference is to take a small, representative sample and use data from that sample to say something about the bigger population. In this case, there are two reasons that the usual practice of statistical inference can't be applied: (1) only one individual is being considered, so we can't measure anything about how variable (or accurate) the data are, and (2) we've picked one, incredibly high-profile, and almost certainly not representative, individual to study.

Multiple testing/data dredging: The small sample size problem is compounded by the fact that we aren't looking at just one or two of the shooter's genes, but rather the whole genome. To see why making statements about violent individuals based on only one person's DNA is a bad idea, think about the 20,000 genes in a human body. Let's suppose that only one of the genes causes violent behavior (it is definitely more complicated than that) and that there is no environmental cause to the violent behavior (clearly false). Furthermore, suppose that if you have the bad version of the violent gene you will do something violent in your life (almost definitely not a sure thing).

Now, even with all these simplifying (and incorrect) assumptions for each gene you flip a coin with a different chance of being heads. The violent gene turned up tails, but so did a large number of other genes. If we compare the set of genes that came up tails to another individual, they will have a huge number in common in addition to the violent gene. So based on this information, you would have no idea which gene causes violence even in this hugely simplified scenario.

Heavy reliance on prior information/intuition: This is a supposedly scientific study, but the small sample size/multiple testing problems mean any conclusions from the data will be very very weak. The only thing you could do is take the set of genes you found and then rely on previous studies to try to determine which one is the "violence gene". But now you are being guided by intuition, guesswork, and a bunch of studies that may or may not be relevant. The result is that more than likely you'd end up on the wrong gene.

The result is that it is highly likely that no solid statistical information will be derived from this experiment. Sometimes, just because the technology exists to run an experiment, doesn't mean that experiment will teach us anything.


Sunday data/statistics link roundup 12/23/12

  1. A cool data visualization for blood glucose levels for diabetic individuals. This kind of interactive visualization can help people see where/when major health issues arise for chronic diseases. This was a class project by Jeff Heer's Stanford CS448B students Ben Rudolph and Reno Bowen (twitter @RenoBowen). Speaking of interactive visualizations, I also got this link from Patrick M. It looks like a way to build interactive graphics and my understanding is it is compatible with R data frames, worth checking out (plus, Dex is a good name).
  2. Here is an interesting review of Nate Silver's book. The interesting thing about the review is that it doesn't criticize the statistical content, but criticizes the belief that people only use data analysis for good. This is an interesting theme we've seen before. Gelman also reviews the review.
  3. It's a little late now, but this tool seems useful for folks who want to know whatdoineedonmyfinal?
  4. A list of the best open data releases of 2012. I particularly like the rat sightings in New York and think the Baltimore fixed speed cameras (which I have a habit of running afoul of).
  5. A map of data scientists on Twitter.  Unfortunately, since we don't have "data scientist" in our Twitter description, Simply Statistics does not appear. I'm sure we would have been central....
  6. Here is an interesting paper where some investigators developed a technology that directly reads out a bar chart of the relevant quantities. They mention this means there is no need for statistical analysis. I wonder if the technology also reads out error bars.

Top-down versus bottom-up science: data analysis edition

In our most recent video, Steven Salzberg discusses the ENCODE project. Some of the advantages and disadvantages of top-down science are described.  Here, top-down refers to big coordinated projects like the Human Genome Project (HGP). In contrast, the approach of funding many small independent projects, via the R01 mechanism, is referred to as bottom-up. Note that for the cost of HGP we could have funded thousands of R01s. However it is not clear that without the HGP we would have had public sequence data as early as we did. As Steven points out, when it comes to data generation the economies of scale make big projects more efficient. But the same is not necessarily true for data analysis.

Big projects like ENCODE and 1000 genomes include data analysis teams that work in coordination with the data producers.  It is true that very good teams are assembled and very good tools developed. But what if instead of holding the data under embargo until the first analysis is done and a paper (or 30) is published, the data was made publicly available with no restrictions and the scientific community was challenged to compete for data analysis and biological discovery R01s? I have no evidence that this would produce better science, but my intuition is that, at least in the case of data analysis, better methods would be developed. Here is my reasoning. Think of the best 100 data analysts in academia and consider the following two approaches:

1- Pick the best among the 100 and have a small group carefully coordinate with the data producers to develop data analysis methods.

2- Let all 100 take a whack at it and see what falls out.

In scenario 1 the selected group has artificial protection from competing approaches and there are less brains generating novel ideas. In scenario 2 the competition would be fierce and after several rounds of sharing ideas (via publications and conferences), groups would borrow from others and generate even better methods.

Note that the big projects do make the data available and R01s are awarded to develop analysis tools for these data. But this only happens after giving the consortium’s group a substantial head start. 

I have not participated in any of these consortia and perhaps I am being naive. So I am very interested to hear the opinions of others.


Sunday Data/Statistics Link Roundup (9/2/2012)

  1. Just got back from IBC 2012 in Kobe Japan. I was in an awesome session (organized by the inimitable Lieven Clement) with great talks by Matt McCall, Djork-Arne Clevert, Adetayo Kasim, and Willem Talloen. Willem’s talk nicely tied in our work and how it plays into the pharmaceutical development process and the bigger theme of big data. On the way home through SFO I saw this hanging in the airport. A fitting welcome back to the states. Although, as we talked about in our first podcast, I wonder how long the Big Data hype will last…
  2. Simina B. sent this link along for a masters program in analytics at NC State. Interesting because it looks a lot like a masters in statistics program, but with a heavier emphasis on data collection/data management. I wonder what role the stat department down there is playing in this program and if we will see more like it pop up? Or if programs like this with more data management will be run by stats departments other places. Maybe our friends down in Raleigh have some thoughts for us. 
  3. If one set of weekly links isn’t enough to fill your procrastination quota, go check out NextGenSeek’s weekly stories. A bit genomics focused, but lots of cool data/statistics links in there too. Love the “extreme Venn diagrams”. 
  4. This seems almost like the fast statistics journal I proposed earlier. Can’t seem to access the first issue/editorial board either. Doesn’t look like it is open access, so it’s still not perfect. But I love the sentiment of fast/single round review. We can do better though. I think Yihue X. has some really interesting ideas on how. 
  5. My wife taught for a year at Grinnell in Iowa and loved it there. They just released this cool data set with a bunch of information about the college. If all colleges did this, we could really dig in and learn a lot about the American secondary education system (link via Hilary M.). 
  6. From the way-back machine, a rant from Rafa about meetings. Stayed tuned this week for some Simply Statistics data about our first year on the series of tubes

Replication and validation in -omics studies - just as important as reproducibility

The psychology/social psychology community has made replication a huge focus over the last year. One reason is the recent, public blow-up over a famous study that did not replicate. There are also concerns about the experimental and conceptual design of these studies that go beyond simple lack of replication. In genomics, a similar scandal occurred due to what amounted to “data fudging”. Although, in the genomics case, much of the blame and focus has been on lack of reproducibility or data availability

I think one of the reasons that the field of genomics has focused more on reproducibility is that replication is already more consistently performed in genomics. There are two forms for this replication: validation and independent replication. Validation generally refers to a replication experiment performed by the same research lab or group - with a different technology or a different data set. On the other hand, independent replication of results is usually performed by an outside laboratory. 

Validation is by far the more common form of replication in genomics. In this article in Science, Ioannidis and Khoury point out that validation has different meaning depending on the subfield of genomics. In GWAS studies, it is now expected that every significant result will be validated in a second large cohort with genome-wide significance for the identified variants.

In gene expression/protein expression/systems biology analyses, there has been no similar definition of the “criteria for validation”. Generally the experiments are performed and if a few/a majority/most of the results are confirmed, the approach is considered validated. My colleagues and I just published a paper where we define a new statistical sampling approach for validating lists of features in genomics studies that is somewhat less ambiguous. But I think this is only a starting point. Just like in psychology, we need to focus not just on reproducibility, but also replicability of our results, and we need new statistical approaches for evaluating whether validation/replication have actually occurred. 


Follow up on "Statistics and the Science Club"

I agree with Roger’s latest post: “we need to expand the tent of statistics and include people who are using their statistical training to lead the new science”. I am perhaps a bit more worried than Roger. Specifically, I worry that talented go-getters interested in leading science via data analysis will achieve this without engaging our research community. 

A  quantitatively trained person (engineers , computer scientists, physicists, etc..) with strong computing skills (knows python, C, and shell scripting), that reads, for example, “Elements of Statistical Learning” and learns R, is well on their way. Eventually, many of these users of Statistics will become developers and if we don’t keep up then what do they need from us? Our already-written books may be enough. In fact, in genomics, I know several people like this that are already developing novel statistical methods. I want these researchers to be part of our academic departments. Otherwise, I fear we will not be in touch with the problems and data that lead to, quoting Roger, “the most exciting developments of our lifetime.” 


"How do we evaluate statisticians working in genomics? Why don't they publish in stats journals?" Here is my answer

During the past couple of years I have been asked these questions by several department chairs and other senior statisticians interested in hiring or promoting faculty working in genomics. The main difficulty stems from the fact that we (statisticians working in genomics) publish in journals outside the mainstream statistical journals. This can be a problem during evaluation because a quick-and-dirty approach to evaluating an academic statistician is to count papers in the Annals of Statistics, JASA, JRSS and Biometrics. The evaluators feel safe counting these papers because they trust the fellow-statistician editors of these journals. However, statisticians working in genomics tend to publish in journals like Nature Genetics, Genome Research, PNAS, Nature Methods, Nucleic Acids Research, Genome Biology, and Bioinformatics. In general, these journals do not recruit statistical referees and a considerable number of papers with questionable statistics do get published in them. However, when the paper’s main topic is a statistical method or if it heavily relies on statistical methods, statistical referees are used. So, if the statistician is the corresponding or last author and it’s a stats paper, it is OK to assume the statistics are fine and you should go ahead and be impressed by the impact factor of the journal… it’s not east getting statistics papers in these journals. 

But we really should not be counting papers blindly. Instead we should be reading at least some of them. But here again the evaluators get stuck as we tend to publish papers with application/technology specific jargon and show-off by presenting results that are of interest to our potential users (biologists) and not necessarily to our fellow statisticians. Here all I can recommend is that you seek help. There are now a handful of us that are full professors and most of us are more than willing to help out with, for example, promotion letters.

So why don’t we publish in statistical journals? The fear of getting scooped due to the slow turnaround of stats journals is only one reason. New technologies that quickly became widely used (microarrays in 2000 and nextgen sequencing today) created a need for data analysis methods among large groups of biologists. Journals with large readerships and high impact factors, typically not interested in straight statistical methodology work, suddenly became amenable to publishing our papers, especially if they solved a data analytic problem faced by many biologists. The possibility of publishing in widely read journals is certainly seductive. 

While in several other fields, data analysis methodology development is restricted to the statistics discipline, in genomics we compete with other quantitative scientists capable of developing useful solutions: computer scientists, physicists, and engineers were also seduced by the possibility of gaining notoriety with publications in high impact journals. Thus, in genomics, the competition for funding, citation and publication in the top scientific journals is fierce. 

Then there is funding. Note that while most biostatistics methodology NIH proposals go to the Biostatistical Methods and Research Design (BMRD) study section, many of the genomics related grants get sent to other sections such as the Genomics Computational Biology and Technology (GCAT) and Biodata Management and Anlayis (BDMA) study sections. BDMA and GCAT are much more impressed by Nature Genetics and Genome Research than JASA and Biometrics. They also look for citations and software downloads. 

To be considered successful by our peers in genomics, those who referee our papers and review our grant applications, our statistical methods need to be delivered as software and garner a user base. Publications in statistical journals, especially those not appearing in PubMed, are not rewarded. This lack of incentive combined with how time consuming it is to produce and maintain usable software, has led many statisticians working in genomics to focus solely on the development of practical methods rather than generalizable mathematical theory. As a result, statisticians working in genomics do not publish much in the traditional statistical journals. You should not hold this against them, especially if they are developers and maintainers of widely used software.


Sample mix-ups in datasets from large studies are more common than you think

If you have analyzed enough high throughput data you have seen it before: a male sample that is really a female, a liver that is a kidney, etc… As the datasets I analyze get bigger I see more and more sample mix-ups. When I find a couple of  samples for which sex is incorrectly annotated (one can easily see this from examining data from X and Y chromosomes) I can’t help but wonder if there are more that are undetectable (e.g. swapping samples of same sex). Datasets that include two types of measurements, for example genotypes and gene expression, make it possible to detect sample swaps more generally. I recently attended a talk by Karl Broman on this topic (one of best talks I’ve seen.. check out the slides here). Karl reports an example in which it looks as if whoever was pipetting skipped a sample and kept on going, introducing an off-by-one error for over 50 samples. As I sat through the talk, I wondered how many of the large GWAS studies have mix-ups like this?

A recent paper (gated) published by Lude Franke and colleagues describes MixupMapper: a method for detecting and correcting mix-ups. They examined several public datasets and discovered mix-ups in all of them. The worst performing study, published in PLoS Genetics, was reported to have 23% of the samples swapped. I was surprised that the MixupMapper paper was not published in a higher impact journal.  Turns out PLoS Genetics rejected the paper. I think this was a big mistake on their part: the paper is clear and well written, reports a problem with a PLoS Genetics papers, and describes a solution to a problem that should have us all quite worried. I think it’s important that everybody learn about this problem so I was happy to see that, eight months later, Nature Genetics published a paper reporting mix-ups (gated)… but they didn’t cite the MixupMapper paper! Sorry Lude, welcome to the reverse scooped club.