Simply Statistics

24
Jun

How public relations and the media are distorting science

Throughout history, engineers, medical doctors and other applied scientists have helped convert  basic science discoveries into products, public goods and policy that have greatly improved our quality of life. With rare exceptions, it has taken years if not decades to establish these discoveries. And even the exceptions stand on the shoulders of incremental contributions. The researchers that produce this knowledge go through a slow and painstaking process to reach these achievements.

In contrast, most science related media reports that grab the public's attention fall into three categories:

  1. The exaggerated big discovery: Recent examples include the discovery of the bubonic plague in the NYC subway, liquid water in mars, and the infidelity gene.
  2. Over-promising:  These try to explain a complicated basic science finding and, in the case of biomedical research, then speculate without much explanation that the finding will "lead to a deeper understanding of diseases and new ways to treat or cure them."
  3. Science is broken:  These tend to report an anecdote about an allegedly corrupt scientist, maybe cite the "Why Most Published Research Findings are False" paper, and then extrapolate speculatively.

In my estimation, despite the attention grabbing headlines, the great majority of the subject matter included in these reports will not have an impact on our lives and will not even make it into scientific textbooks. So does science still have anything to offer? Reports of the third category have even scientists particularly worried. I, however, remain optimistic. First, I do not see any empirical evidence showing that the negative effects of the lack of reproducibility are worse now than 50 years ago. Furthermore, although not widely reported in the lay press, I continue to see bodies of work built by several scientists over several years or decades with much promise of leading to tangible improvements to our quality of life.  Recent advances that I am excited about include topological insulators, PD-1 pathway inhibitors, clustered regularly interspaced short palindromic repeats, advances in solar energy technology, and prosthetic robotics.

However, there is one general aspect of science that I do believe has become worse.  Specifically, it's a shift in how much scientists jockey for media attention, even if it's short-lived. Instead of striving for having a sustained impact on our field, which may take decades to achieve, an increasing number of scientists seem to be placing more value on appearing in the New York Times, giving a Ted Talk or having a blog or tweet go viral. As a consequence, too many of us end up working on superficial short term challenges that, with the help of a professionally crafted press release, may result in an attention grabbing media report. NB: I fully support science communication efforts, but not when the primary purpose is garnering attention, rather than educating.

My concern spills over to funding agencies and philanthropic organizations as well. Consider the following two options. Option 1: be the funding agency representative tasked with organizing a big science project with a well-oiled PR machine. Option 2: be the funding agency representative in charge of several small projects, one of which may with low, but non-negligible, probability result in a Nobel Prize 30 years down the road. In the current environment, I see a preference for option 1.

I am also concerned about how this atmosphere may negatively affect societal improvements within science. Publicly shaming transgressors on Twitter or expressing one's outrage on a blog post can garner many social media clicks. However, these may have a smaller positive impact than mundane activities such as serving on a committee that, after several months of meetings, implements incremental, yet positive, changes. Time and energy spent on trying to increase internet clicks is time and energy we don't spend on the tedious administrative activities that are needed to actually affect change.

Because so many of the scientists that thrive in this atmosphere of short-lived media reports are disproportionately rewarded, I imagine investigators starting their careers feel some pressure to garner some media attention of their own. Furthermore, their view of how they are evaluated may be highly biased because evaluators that ignore media reports and focus more on the specifics of the scientific content, tend to be less visible. So if you want to spend your academic career slowly building a body of work with the hopes of being appreciated decades from now, you should not think that it is hopeless based on what is perhaps, a distorted view of how we are currently being evaluated.

Update: changed topological insulators links to these two. Here is one more. Via David S.

09
Jun

Batch effects are everywhere! Deflategate edition

In my opinion, batch effects are the biggest challenge faced by genomics research, especially in precision medicine. As we point out in this review, they are everywhere among high-throughput experiments. But batch effects are not specific to genomics technology. In fact, in this 1972 paper (paywalled), WJ Youden describes batch effects in the context of measurements made by physicists. Check out this plot of speed of light estimates with confidence intervals (red and green are same lab).

Slide1

Sometimes you find batch effects where you least expect them. For example, in the deflategate debate. Here is quote from the New England patriot's deflategate rebuttal (written with help from Nobel Prize winner Roderick MacKinnon)

in other words, the Colts balls were measured after the Patriots balls and had warmed up more. For the above reasons, the Wells Report conclusion that physical law cannot explain the pressures is incorrect.

Here is another one:

In the pressure measurements physical conditions were not very well-defined and major uncertainties, such as which gauge was used in pre-game measurements, affect conclusions.

So NFL, please read our paper before you accuse a player of cheating.

Disclaimer: I live in New England but I am Ravens fan.

20
May

Is it species or is it batch? They are confounded, so we can't know

In a 2005 OMICS paper, an analysis of human and mouse gene expression microarray measurements from several tissues led the authors to conclude that "any tissue is more similar to any other human tissue examined than to its corresponding mouse tissue". Note that this was a rather surprising result given how similar tissues are between species. For example, both mice and humans see with their eyes, breathe with their lungs, pump blood with their hearts, etc... Two follow-up papers (here and here) demonstrated that platform-specific technical variability was the cause of this apparent dissimilarity. The arrays used for the two species were different and thus measurement platform and species were completely confounded. In a 2010 paper, we confirmed that once this technical variability  was accounted for, the number of genes expressed in common  between the same tissue across the two species was much higher than the those expressed in common  between two species across the different tissues (see Figure 2 here).

So what is confounding and why is it a problem? This topic has been discussed broadly. We wrote a review some time ago. But based on recent discussions I've participated in, it seems that there is still some confusion. Here I explain, aided by some math, how confounding leads to problems in the context of estimating species effects in genomics. We will use

  • Xi to represent the gene expression measurements for human tissue i,
  • aX to represent the level of expression that is specific to humans and
  • bX to represent the batch effect introduced by the use of the human microarray platform.
  • Therefore Xi =a+ bX + ei,with ei the tissue i  effect and other uninteresting sources of variability.

Similarly, we will use:

  • Yi to represent the measurements for mouse tissue i
  • aY  to represent the mouse specific level and
  • bY the batch effect introduced by the use of the mouse microarray platform.
  • Therefore Yi =a+ bY +fi,with fi tissue i  effect and other uninteresting sources of variability.

If we are interested in estimating a species effect that is general across tissues, then we are interested in the following quantity:

 aY - aX

Naively, we would think that we can estimate this quantity using the observed differences between the species that cancel out the tissue effect. We observe a difference for each tissue: Y - X, Y2 - X, etc... The problem is that aand bare always together as are aand bY .We say that the batch effect bX is confounded with the species effect aX. Therefore, on average, the observed differences include both the species and the batch effects. To estimate the difference above we would write a a model like this:

Y - Xi = (aY - aX) + (bY - bX) + other sources of variability

and then estimate the unknown quantities of interest: (aY - aX) and (bY - bX) from the observed data Y1 - X1, Y2 - X2, etc... The problem is that, we can estimate the aggregate effect (aY - aX) + (bY - bX), but, mathematically, we can't tease apart the two differences.  To see this note that if we are using least squares, the estimates (aY - aX) = 7,  (bY - bX)=3  will fit the data exactly as well as (aY - aX)=3,(bY - bX)=7 since

{(Y-X) -(7+3))^2 = {(Y-X)- (3+7)}^2.

In fact, under these circumstances, there are an infinite number of solutions to the standard statistical estimation approaches. A simple analogy is to try to find a unique solution to the equations m+n = 0. If batch and species are not confounded then we are able to tease apart differences just as if we were given another equation: m+n=0; m-n=2. You can learn more about this in this linear models course.

Note that the above derivation apply to each gene affected by the batch effect. In practice we commonly see hundreds of genes affected. As a consequence, when we compute distances between two samples from different species we may see large differences even where there is no species effect. This is because the bY - bX  differences for each gene are squared and added up.

In summary, if you completely confound your variable of interest, in this case species, with a batch effect, you will not be able to estimate the effect of either. In fact, in a 2010 Nature Genetics Review  about batch effects we warned about "cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions". We also warned that none of the existing solutions for batch effects (Combat, SVA, RUV, etc...) can save you from a situation with perfect confounding. Because we can't always predict what will introduce unwanted variability, we recommend randomization as an experimental design approach.

Almost a decade later after the OMICS paper was published, the same surprising conclusion was reached in this PNAS paper:  "tissues appear more similar to one another within the same species than to the comparable organs of other species". This time RNAseq was used for both species and therefore the different platform issue was not considered*. Therefore, the authors implicitly assumed that (bY - bX)=0. However, in a recent F1000 Research publication Gilad and Mizrahi-Man describe describe an exercise in forensic bioinformatics that led them to discover that mice and human samples were run in different lanes or different instruments. The confounding was near perfect (see Figure 1). As pointed out by these authors, with this experimental design we can't  simply accept that (bY - bX)=0, which implies that we can't estimate a species effect. Gilad and Mizrahi-Man then apply a linear model (ComBat) to account for the batch/species effect and find that samples cluster almost perfectly by tissue. However, Gilad and Mizrahi-Man correctly note that,  due to the confounding, if there is in fact a species effect, this approach will remove it along with the batch effect. Unfortunately, due to the experimental design it will be hard or impossible to determine if it's batch or if it's species. More data  and more analyses are needed.

Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.

 * The fact that RNAseq was used does not necessarily mean there is no platform effect. The species have different genomes, with different sequences and thus can lead to different biases during experimental protocols.

Update: Shin Lin has repeated a small version of the experiment described in the PNAS paper. The new experimental design does not confound lane/instrument with species. The new data confirms their original results pointing to the fact that lane/instrument do not explain the clustering by species. You can see his response in the comments here.

13
Apr

Genomics Case Studies Online Courses Start in Two Weeks (4/27)

The last month of the HarvardX Data Analysis for Genomics series start on 4/27. We will cover case studies on RNAseq, Variant calling, ChipSeq and DNA methylation. Faculty includes Shirley Liu, Mike Love, Oliver Hoffman and the HSPH Bioinformatics Core. Although taking the previous courses on the series will help, the four case study courses were developed as stand alone and you can obtain a certificate for each one without taking any other course.

Each course is presented over two weeks but will remain open until June 13 to give students an opportunity to take them all if they wish. For more information follow the links listed below.

  1. RNA-seq data analysis will be lead by Mike Love
  2. Variant Discovery and Genotyping will be taught by Shannan Ho Sui, Oliver Hofmann, Radhika Khetani and Meeta Mistry (from the The HSPH Bioinformatics Core)
  3. ChIP-seq data analysis will be lead by Shirley Liu
  4. DNA methylation data analysis will be lead by Rafael Irizarry
24
Mar

Introduction to Bioconductor HarvardX MOOC starts this Monday March 30

Bioconductor is one of the most widely used open source toolkits for biological high-throughput data. In this four week course, co-taught with Vince Carey and Mike Love, we will introduce you to Bioconductor's general infrastructure and then focus on two specific technologies: next generation sequencing and microarrays. The lectures and assessments will be annotated in case you want to focus only on one of these two technologies. Although if you plan to be a bioinformatician we recommend you learn both.

Topics covered include:

  • A short introduction to molecular biology and measurement technology
  • An overview on how to leverage the platform and genome annotation packages and experimental archives
  • GenomicsRanges: the infrastructure for storing, manipulating and analyzing next generation sequencing data
  • Parallel computing and cloud concepts
  • Normalization, preprocessing and bias correction.
  • Statistical inference in practice: including hierarchical models and gene set enrichment analysis
  • Building statistical analysis pipelines of genome-scale assays including the creation of reproducible reports

Throughout the class we will be using data examples from both next generation sequencing and microarray experiments.

We will assume basic knowledge of Statistics and R.

For more information visit the course website.

14
Mar

π day special: How to use Bioconductor to find empirical evidence in support of π being a normal number

Editor's note: Today 3/14/15 at some point between  9:26:53 and 9:26:54 it was the most π day of them all. Below is a repost from last year. 

Happy π day everybody!

I wanted to write some simple code (included below) to the test parallelization capabilities of my  new cluster. So, in honor of  π day, I decided to check for evidence that π is a normal number. A normal number is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10−m. For example, using the Poisson approximation, we can predict that the pattern "123456789" should show up between 0 and 3 times in the first billion digits of π (it actually shows up twice starting, at the 523,551,502-th and  773,349,079-th decimal places).

To test our hypothesis, let Y1, ..., Y100 be the number of "00", "01", ...,"99" in the first billion digits of  π. If  π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01.  In the qq-plot below I show Z-scores (Y - 10,000,000) /  √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, 0.89, 0.92, and 0.99.

pi

Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.

pi2

These observed counts should also be independent, and to explore this we can look at autocorrelation plots:

piacf

To do this in about an hour and with just a few lines of code (included below), I used the Bioconductor Biostrings package to match strings and the foreach function to parallelize.


library(Biostrings)
library(doParallel)
registerDoParallel(cores = 48)
x=scan("pi-billion.txt",what="c")
x=substr(x,3,nchar(x)) ##remove 3.
x=BString(x)
n<-length(x)
p <- 1/(10^d)
par(mfrow=c(2,3))
for(d in 2:4){
if(d<5){
patterns<-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
} else{
patterns<-sprintf(paste0("%0",d,"d"),sample(10^d,10^4)-1)
}
res <- foreach(pat=patterns,.combine=c) %dopar% countPattern(pat,x)
z <- (res - n*p ) / sqrt( n*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(d,"digits"))
abline(0,1)
##correction: original post had length(res)
if(d<5) print(1-pchisq(sum ((res - n*p)^2/(n*p)),length(res)-1))
}
###Now count in segments
d <- 1
m <-10^5
patterns <-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
res <- foreach(pat=patterns,.combine=cbind) %dopar% {
tmp<-start(matchPattern(pat,x))
tmp2<-floor( (tmp-1)/m)
return(tabulate(tmp2+1,nbins=n/m))
}
##qq-plots
par(mfrow=c(2,5))
p <- 1/(10^d)
for(i in 1:ncol(res)){
z <- (res[,i] - m*p) / sqrt( m*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(i-1))
abline(0,1)
}
##ACF plots
par(mfrow=c(2,5))
for(i in 1:ncol(res)) acf(res[,i])

NB: A normal number has the above stated property in any base. The examples above a for base 10.

02
Mar

Advanced Statistics for the Life Sciences MOOC Launches Today

In this four week course we will teach statistical techniques that are commonly used in the analysis of high-throughput data and their corresponding R implementations. In Week 1 we will explain inference in the context of high-throughput data and introduce the concept of error controlling procedures. We will describe the strengths and weakness of the Bonferroni correction, FDR and q-values. We will show how to implement these in cases in which  thousands of tests are conducted, as is typically done with genomics data. In Week 2 we will introduce the concept of mathematical distance and how it is used in exploratory data analysis, clustering, and machine learning. We will describe how techniques such as principal component analysis (PCA) and the singular value decomposition (SVD) can be used for dimension reduction in high dimensional data. During week 3 we will describe confounding, latent variables and factor analysis in the context of high dimensional data and how this relates to batch effects. We will show how to implement methods such as SVA to perform inference on data affected by batch effects. Finally, during week 4 we will show how statistical modeling, and empirical Bayes modeling in particular, are powerful techniques that greatly improve precision in high-throughput data. We will be using R code to explain concepts throughout the course. We will also be using exploratory data analysis and data visualization to motivate the techniques we teach during each week.

13
Feb

Introduction to Linear Models and Matrix Algebra MOOC starts this Monday Feb 16

Matrix algebra is the language of modern data analysis. We use it to develop and describe statistical and machine learning methods, and to code efficiently in languages such as R, matlab and python. Concepts such as principal component analysis (PCA) are best described with matrix algebra. It is particularly useful to describe linear models.

Linear models are everywhere in data analysis. ANOVA, linear regression, limma, edgeR, DEseq, most smoothing techniques, and batch correction methods such as SVA and Combat are based on linear models. In this two week MOOC we well describe the basics of matrix algebra, demonstrate how linear models are used in the life sciences and show how to implement these efficiently in R.

Update: Here is the link to the class

21
Jan

Data as an antidote to aggressive overconfidence

A recent NY Times op-ed reminded us of the many biases faced by women at work. A followup op-ed  gave specific recommendations for how to conduct ourselves in meetingsIn general, I found these very insightful, but don't necessarily agree with the recommendations that women should "Practice Assertive Body Language".  Instead, we should make an effort to judge ideas by their content and not be impressed by body language. More generally, it is a problem that many of the characteristics that help advance careers contribute nothing to intellectual output. One of these is what I call aggressive overconfidence.

Here is an example (based on a true story). A data scientist finds a major flaw with the data analysis performed by a prominent data-producing scientist's lab. Both are part of a large collaborative project. A meeting is held among the project leaders to discuss the disagreement. The data producer is very self-confident in defending his approach. The data scientist, who in not nearly as aggressive, is interrupted so much that she barely gets her point across. The project leaders decide that this seems to be simply a difference of opinion and, for all practical purposes, ignore the data scientist. I imagine this story sounds familiar to many. While in many situations this story ends here, when the results are data driven we can actually fact check opinions that are pronounced as fact. In this example, the data is public and anybody with the right expertise can download the data and corroborate the flaw in the analysis. This is typically quite tedious, but it can be done. Because the key flaws are rather complex, the project leaders, lacking expertise in data analysis, can't make this determination. But eventually, a chorus of fellow data analysts will be too loud to ignore.

That aggressive overconfidence is generally rewarded in academia is a problem. And if this trait is highly correlated with being male, then a manifestation of this is a worsened gender gap. My experience (including reading internet discussions among scientists on controversial topics) has convinced me that this trait is in fact correlated with gender. But the solution is not to help women become more aggressively overconfident. Instead we should continue to strive to judge work based on content rather than style. I am optimistic that more and more, data, rather than who sounds more sure of themselves, will help us decide who wins a debate.

 

20
Jan

Gorging ourselves on "free" health care: Harvard's dilemma

Editor's note: This is a guest post by Laura Hatfield. Laura is an Assistant Professor of Health Care Policy at Harvard Medical School, with a specialty in Biostatistics. Her work focuses on understanding trade-offs and relationships among health outcomes. Dr. Hatfield received her BS in genetics from Iowa State University and her PhD in biostatistics from the University of Minnesota. She tweets @bioannie

I didn’t imagine when I joined Harvard’s Department of Health Care Policy that the New York Times would be writing about my benefits package. Then a vocal and aggrieved group of faculty rebelled against health benefits changes for 2015, and commentators responded by gleefully skewering entitled-sounding Harvard professors. But I’m a statistician, so I want to talk data.

Health care spending is tremendously right-skewed. The figure below shows the annual spending distribution among people with any spending (~80% of the total population) in two data sources on people covered by employer-sponsored insurance, such as the Harvard faculty. Notice that the y axis is on the log scale. More than half of people spend $1000 or less, but a few very unfortunate folks top out near half a million.

spending_distribution

Source: Measuring health care costs of individuals with employer-sponsored health insurance in the US: A comparison of survey and claims data. A. Aizcorbe, E. Liebman, S. Pack, D.M. Cutler, M.E. Chernew, A.B. Rosen. BEA working paper. WP2010-06. June 2010.

If instead of contributing to my premiums, Harvard instead gave me the $1000/month premium contribution in the form of wages, I would be on the hook for my own health care expenses. If I stay healthy, I pocket the money, minus income taxes. If I get sick, I have the extra money available to cover the expenses…provided I’m not one of the unlucky 10% of people spending more than $12,000/year. In that case, the additional wages would be insufficient to cover my health care expenses. This “every woman for herself” system lacks the key benefit of insurance: risk pooling. The sickest among us would be bankrupted by health costs. Another good reason for an employer to give me benefits is that I do not pay taxes on this part of my compensation (more on that later).

At the opposite end of the spectrum is the Harvard faculty health insurance plan. Last year, the university paid ~$1030/month toward my premium and I put in ~$425 (tax-free). In exchange for this ~$17,000 of premiums, my family got first-dollar insurance coverage with very low co-pays. Faculty contributions to our collective expenses health care were distributed fairly evenly among all of us, with only minimal cost sharing to reflect how much care each person consumed. The sickest among us were in no financial peril. My family didn’t use much care and thus didn’t get our (or Harvard’s) money’s worth for all that coverage, but I’m ok with it. I still prefer risk pooling.

Here’s the problem: moral hazard. It’s a word I learned when I started hanging out with health economists. It describes the tendency of people to over-consume goods that feel free, such as health care paid through premiums or desserts at an all-you-can-eat buffet. Just look at this array—how much cake do *you* want to eat for $9.99?

 

buffet

Source: https://www.flickr.com/photos/jimmonk/5687939526/in/photostream/

One way to mitigate moral hazard is to expose people to more of their cost of care at the point of service instead of through premiums. You might think twice about that fifth tiny cake if you were paying per morsel. This is what the new Harvard faculty plans do: our premiums actually go down, but now we face a modest deductible, $250 per person or $750 max for a family. This is meant to encourage faculty to use their health care more efficiently, but it still affords good protection against catastrophic costs. The out-of-pocket max remains low at $1500 per individual or $4500 per family, with recent announcements to further protect individuals who pay more than 3% of salary in out-of-pocket health costs through a reimbursement program.

The allocation of individuals’ contributions between premiums and point-of-service costs is partly a question of how we cross-subsidize each other. If Harvard’s total contribution remains the same and health care costs do not grow faster than wages (ha!), then increased cost sharing decreases the amount by which people who use less care subsidize those who use more. How you feel about the “right” level of cost sharing may depend on whether you’re paying or receiving a subsidy from your fellow employees. And maybe your political leanings.

What about the argument that it is better for an employer to “pay” workers by health insurance premium contributions rather than wages because of the tax benefits? While we might prefer to get our compensation in the form of tax-free health benefits vs taxed wages, the university, like all employers, is looking ahead to the Cadillac tax provision of the ACA. So they have to do some re-balancing of our overall compensation. If Harvard reduces its health insurance contributions to avoid the tax, we might reasonably expect to make up that difference in higher wages. The empirical evidence is complicated and suggests that employers may not immediately return savings on health benefits dollar-for-dollar in the form of wages.

As far as I can tell, Harvard is contributing roughly the same amount as last year toward my health benefits, but exact numbers are difficult to find. I switched plan types\footnote{into a high-deductible plan, but that’s a topic for another post!}, so I can’t find and directly compare Harvard’s contributions in the same plan type this year and last. Peter Ubel argues that if the faculty *had* seen these figures, we might not have revolted. The actuarial value of our plans remains very high (91%, just a bit better than the expensive Platinum plans on the exchanges) and Harvard’s spending on health care has grown from 8% to 12% of the university’s budget over the past few years. Would these data have been sufficient to quell the insurrection? Good question.