Simply Statistics


Residual expertise - or why scientists are amateurs at most of science

Editor's note: I have been unsuccessfully attempting to finish a book I started 3 years ago about how and why everyone should get pumped about reading and understanding scientific papers. I've adapted part of one of the chapters into this blogpost. It is pretty raw but hopefully gets the idea across. 

An episode of The Daily Show with Jon Stewart featured physicist Lisa Randall, an incredible physicist and noted scientific communicator, as the invited guest.

Near the end of the interview, Stewart asked Randall why, with all the scientific progress we have made, that we have been unable to move away from fossil fuel-based engines. The question led to the exchange:

Randall: “So this is part of the problem, because I’m a scientist doesn’t mean I know the answer to that question.”

Stewart: ”Oh is that true? Here’s the thing, here’s what’s part of the answer. You could say anything and I would have no idea what you are talking about.”

Professor Randall is a world leading physicist, the first woman to achieve tenure in physics at Princeton, Harvard, and MIT, and a member of the National Academy of Sciences.2 But when it comes to the science of fossil fuels, she is just an amateur. Her response to this question is just perfect - it shows that even brilliant scientists can just be interested amateurs on topics outside of their expertise. Despite Professor Randall’s over-the-top qualifications, she is an amateur on a whole range of scientific topics from medicine, to computer science, to nuclear engineering. Being an amateur isn’t a bad thing, and recognizing where you are an amateur may be the truest indicator of genius. That doesn’t mean Professor Randall can’t know a little bit about fossil fuels or be curious about why we don’t all have nuclear-powered hovercrafts yet. It just means she isn’t the authority.

Stewart’s response is particularly telling and indicative of what a lot of people think about scientists. It takes years of experience to become an expert in a scientific field - some have suggested as many as 10,000 hours of dedicated time. Professor Randall is a scientist - so she must have more information about any scientific problem than an informed amateur like Jon Stewart. But of course this isn’t true, Jon Stewart (and you) could quickly learn as much about fossil fuels as a scientist if the scientist wasn't already an expert in the area. Sure a background in physics would help, but there are a lot of moving parts in our dependence on fossil fuels, including social, political, economic problems in addition to the physics involved.

This is an example of "residual expertise" - when people without deep scientific training are willing to attribute expertise to scientists even if it is outside their primary area of focus. It is closely related to the logical fallacy behind the argument from authority:

A is an authority on a particular topic

A says something about that topic

A is probably correct

the difference is that with residual expertise you assume that since A is an authority on a particular topic, if they say something about another, potentially related topic, they will probably be correct. This idea is critically important, it is how quacks make their living. The logical leap of faith from "that person is a doctor" to "that person is a doctor so of course they understand epidemiology, or vaccination, or risk communication" is exactly the leap empowered by the idea of residual expertise. It is also how you can line up scientific experts against any well established doctrine like evolution or climate change. Experts in the field will know all of the relevant information that supports key ideas in the field and what it would take to overturn those ideas. But experts outside of the field can be lined up and their residual expertise used to call into question even the most supported ideas.

What does this have to do with you?

Most people aren't necessarily experts in scientific disciplines they care about. But becoming a successful amateur requires a much smaller time commitment than becoming an expert, but can still be incredibly satisfying, fun, and useful. This book is designed to help you become a fired-up amateur in the science of your choice. Think of it like a hobby, but one where you get to learn about some of the coolest new technologies and ideas coming out in the scientific literature. If you can ignore the way residual expertise makes you feel silly for reading scientific papers you don't fully understand - you can still learn a ton and have a pretty fun time doing it.




The tyranny of the idea in science

There are a lot of analogies between startups and academic science labs. One thing that is definitely very different is the relative value of ideas in the startup world and in the academic world. For example, Paul Graham has said:

Actually, startup ideas are not million dollar ideas, and here's an experiment you can try to prove it: just try to sell one. Nothing evolves faster than markets. The fact that there's no market for startup ideas suggests there's no demand. Which means, in the narrow sense of the word, that startup ideas are worthless.

In academics, almost the opposite is true. There is huge value to being first with an idea, even if you haven't gotten all the details worked out or stable software in place. Here are a couple of extreme examples illustrated with Nobel prizes:

  1. Higgs Boson - Peter Higgs postulated the Boson in 1964, he won the Nobel Prize in 2013 for that prediction, in between tons of people did follow on work, someone convinced Europe to build one of the most expensive pieces of scientific equipment ever built and conservatively thousands of scientists and engineers had to do a ton of work to get the equipment to (a) work and (b) confirm the prediction.
  2. Human genome - Watson and Crick postulated the structure of DNA in 1953, they won the Nobel Prize in  medicine in 1962 for this work. But the real value of the human genome was realized when the largest biological collaboration in history sequenced the human genome, along with all of the subsequent work in the genomics revolution.

These are two large scale examples where the academic scientific community (as represented by the Nobel committee, mostly because it is a concrete example) rewards the original idea and not the hard work to achieve that idea. I call this, "the tyranny of the idea." I notice a similar issue on a much smaller scale, for example when people don't recognize software as a primary product of science. I feel like these decisions devalue the real work it takes to make any scientific idea a reality. Sure the ideas are good, but it isn't clear that some ideas wouldn't be discovered by someone else - but surely we aren't going to build another large hadron collider. I'd like to see the scales correct back the other way a little bit so we put at least as much emphasis on the science it takes to follow through on an idea as on discovering it in the first place.


Mendelian randomization inspires a randomized trial design for multiple drugs simultaneously

Joe Pickrell has an interesting new paper out about Mendelian randomization. He discusses some of the interesting issues that come up with these studies and performs a mini-review of previously published studies using the technique.

The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel's laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast). A very famous example of this was an experiment performed by Leonid Kruglyak's group where they took two strains of yeast and repeatedly mated them, then measured genetics and gene expression data. The experimental design looked like this:



If you look at the allele inherited from the two parental strains (BY, RM)  at two separate genes on different chromsomes in each of the 112 segregants (yeast offspring)  they do appear to be random and independent:

Screen Shot 2015-05-07 at 11.20.46 AM



So this is a randomized trial in yeast where the yeast were each randomized to many many genetic "treatments" simultaneously. Now this isn't strictly true, since genes on the same chromosomes near each other aren't exactly random and in humans it is definitely not true since there is population structure, non-random mating and a host of other issues. But you can still do cool things to try to infer causality from the genetic "treatments" to downstream things like gene expression ( and even do a reasonable job in the model organism case).

In my mind this raises a potentially interesting study design for clinical trials. Suppose that there are 10 treatments for a disease that we know about. We design a study where each of the patients in the trial was randomized to receive treatment or placebo for each of the 10 treatments. So on average each person would get 5 treatments.  Then you could try to tease apart the effects using methods developed for the Mendelian randomization case. Of course, this is ignoring potential interactions, side effects of taking multiple drugs simultaneously, etc. But I'm seeing lots of interesting proposals for new trial designs (which may or may not work), so I thought I'd contribute my own interesting idea.


Rafa's citations above replacement in statistics journals is crazy high.

Editor's note:  I thought it would be fun to do some bibliometrics on a Friday. This is super hacky and the CAR/Y stat should not be taken seriously. 

I downloaded data on the 400 most cited papers between 2000-2010 in some statistical journals from Web of Science. Here is a boxplot of the average number of citations per year (from publication date - 2015) to these papers in the journals Annals of Statistics, Biometrics, Biometrika, Biostatistics, JASA, Journal of Computational and Graphical Statistics, Journal of Machine Learning Research, and Journal of the Royal Statistical Society Series B.




There are several interesting things about this graph right away. One is that JASA has the highest median number of citations, but has fewer "big hits" (papers with 100+ citations/year) than Annals of Statistics, JMLR, or JRSS-B. Another thing is how much of a lottery developing statistical methods seems to be. Most papers, even among the 400 most cited, have around 3 citations/year on average. But a few lucky winners have 100+ citations per year. One interesting thing for me is the papers that get 10 or more citations per year but aren't huge hits. I suspect these are the papers that solve one problem well but don't solve the most general problem ever.

Something that jumps out from that plot is the outlier for the journal Biostatistics. One of their papers is cited 367.85 times per year. The next nearest competitor is 67.75 and it is 19 standard deviations above the mean! The paper in question is: "Exploration, normalization, and summaries of high density oligonucleotide array probe level data", which is the paper that introduced RMA, one of the most popular methods for pre-processing microarrays ever created. It was written by Rafa and colleagues. It made me think of the statistic "wins above replacement" which quantifies how many extra wins a baseball team gets by playing a specific player in place of a league average replacement.

What about a "citations /year above replacement" statistic where you calculate for each journal:

Median number of citations to a paper/year with Author X - Median number of citations/year to an average paper in that journal

Then average this number across journals. This attempts to quantify how many extra citations/year a person's papers generate compared to the "average" paper in that journal. For Rafa the numbers look like this:

  • Biostatistics: Rafa = 15.475, Journal = 1.855, CAR/Y =  13.62
  • JASA: Rafa = 74.5, Journal = 5.2, CAR/Y = 69.3
  • Biometrics: Rafa = 4.33, Journal = 3.38, CAR/Y = 0.95

So Rafa's citations above replacement is (13.62 + 69.3 + 0.95)/3 =  27.96! There are a couple of reasons why this isn't a completely accurate picture. One is the low sample size, the second is the fact that I only took the 400 most cited papers in each journal. Rafa has a few papers that didn't make the top 400 for journals like JASA - which would bring down his CAR/Y.



Data analysis subcultures

Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time 'panel data', to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as 'longitudinal data', and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials re-discovered across multiple disciplines.

But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they'd give you back very different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:

  • Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).
  • In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.
  • This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.
  • Psychology has a history of using parametric statistics, genomics is big into empirical Bayes, and you see a lot of Bayesian statistics in climate studies.
  • You see homoskedasticity tests used a lot in econometrics, but that is hardly ever done through formal hypothesis testing in biostatistics.
  • Training sets and test sets are used in machine learning for prediction, but rarely used for inference.

This is just a partial list I thought of off the top of my head, there are a ton more. These decisions matter a lot in a data analysis.  The problem is that the behavioral component of a data analysis is incredibly strong, no matter how much we'd like to think of the process as mathematico-theoretical. Until we acknowledge that the most common reason a method is chosen is because, "I saw it in a widely-cited paper in journal XX from my field" it is likely that little progress will be made on resolving the statistical problems in science.


A blessing of dimensionality often observed in high-dimensional data sets

Tidy data sets have one observation per row and one variable per column.  Using this definition, big data sets can be either:

  1. Wide - a wide data set has a large number of measurements per observation, but fewer observations. This type of data set is typical in neuroimaging, genomics, and other biomedical applications.
  2. Tall - a tall data set has a large number of observations, but fewer measurements. This is the typical setting in a large clinical trial or in a basic social network analysis.

The curse of dimensionality tells us that estimating some quantities gets harder as the number of dimensions of a data set increases - as the data gets taller or wider. An example of this was nicely illustrated by my student Prasad (although it looks like his quota may be up on Rstudio).

For wide data sets there is also a blessing of dimensionality. The basic reason for the blessing of dimensionality is that:

No matter how many new measurements you take on a small set of observations, the number of observations and all of their characteristics are fixed.

As an example, suppose that we make measurements on 10 people. We start out by making one measurement (blood pressure), then another (height), then another (hair color) and we keep going and going until we have one million measurements on those same 10 people. The blessing occurs because the measurements on those 10 people will all be related to each other. If 5 of the people are women and 5 or men, then any measurement that has a relationship with sex will be highly correlated with any other measurement that has a relationship with sex. So by knowing one small bit of information, you can learn a lot about many of the different measurements.

This blessing of dimensionality is the key idea behind many of the statistical approaches to wide data sets whether it is stated explicitly or not. I thought I'd make a very short list of some of these ideas:

1. Idea: De-convolving mixed observations from high-dimensional data. 

How the blessing plays a role: The measurements for each observation are assumed to be a mixture of values measured from different observation types. The proportion of each observation type is assumed to be fixed across measurements, so you can take advantage of the multiple measurements to estimate the mixing percentage and perform the deconvolution. (Wenyi Wang came and gave an excellent seminar on this idea at JHU a couple of days ago, which inspired this post).

2. Idea: The two groups model for false discovery rates.

How the blessing plays a role:  The models assume that a hypothesis test is performed for each observation and that the probability any observation is drawn from the null, the null distribution, and the alternative distributions are common across observations. If the null is assumed known, then it is possible to use the known null distribution to estimate the common probability that an observation is drawn from the null.


3. Idea: Empirical Bayes variance shrinkage for linear models

How the blessing plays a role:  A linear model is fit for each observation and the means and variances of the log ratios calculated from the model are assumed to follow a common distribution across observations. The method estimates the hyper-parameters of these common distributions and uses them to adjust any individual measurement's estimates.


4. Idea: Surrogate variable analysis

How the blessing plays a role:  Each observation is assumed to be influenced by a single variable of interest (a primary variable) and multiple unmeasured confounders. Since the observations are fixed, the values of the unmeasured confounders are the same for each measurement and a supervised PCA can be used to estimate surrogates for the confounders. (see my JHU job talk for more on the blessing)


The blessing of dimensionality I'm describing here is related to the idea that Andrew Gelman refers to in this 2004 post.  Basically, since increasingly large number of measurements are made on the same observations there is an inherent structure to those observations. If you take advantage of that structure, then as the dimensionality of your problem increases you actually get better estimates of the structure in your high-dimensional data - a nice blessing!


Teaser trailer for the Genomic Data Science Specialization on Coursera


We have been hard at work in the studio putting together our next specialization to launch on Coursera. It will be called the "Genomic Data Science Specialization" and includes a spectacular line up of instructors: Steven Salzberg, Ela Pertea, James Taylor, Liliana Florea, Kasper Hansen, and me. The specialization will cover command line tools, statistics, Galaxy, Bioconductor, and Python. There will be a capstone course at the end of the sequence featuring an in-depth genomic analysis. If you are a grad student, postdoc, or principal investigator in a group that does genomics this specialization is for you. If you are a person looking to transition into one of the hottest areas of research with the new precision medicine initiative this is for you. Get pumped and share the teaser-trailer with your friends!


A surprisingly tricky issue when using genomic signatures for personalized medicine

My student Prasad Patil has a really nice paper that just came out in Bioinformatics (preprint in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.

An example of this normalization is to mean-center the measurements for each gene in the testing/application stage, then apply the prediction rule. The problem is that if you use a different set of samples when calculating the mean you can get a totally different prediction function. The basic problem is illustrated in this graphic.


Screen Shot 2015-03-19 at 12.58.03 PM


This seems like a pretty esoteric statistical issue, but it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set. In this plot, Prasad made predictions for the exact same set of patients two times when the patient population varied in ER status composition. As many as 30% of the predictions were different for the same patient with the same data if you just varied who they were being predicted with.

Screen Shot 2015-03-19 at 1.02.25 PM


This paper highlights how tricky statistical issues can slow down the process of translating ostensibly really useful genomic signatures into clinical practice and lends even more weight to the idea that precision medicine is a statistical field.


A simple (and fair) way all statistics journals could drive up their impact factor.


If every method in every stats journal was implemented in a corresponding R package (easy), was required to have a  companion document that was a tutorial on how to use the software (easy), included a reference to how to cite the paper if you used the software (easy) and the paper/tutorial was posted to the relevant message boards for the communities of interest (easy) that journal would see a dramatic bump in its impact factor.


Data science done well looks easy - and that is a big problem for data scientists

Data science has a ton of different definitions. For the purposes of this post I'm going to use the definition of data science we used when creating our Data Science program online. Data science is:

Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.

In general the data science process is iterative and the different components blend together a little bit. But for simplicity lets discretize the tasks into the following 7 steps:

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results
  7. Make your analysis reproducible

A good data science project answers a real scientific or business analytics question. In almost all of these experiments the vast majority of the analyst's time is spent on getting and cleaning the data (steps 2-3) and communication and reproducibility (6-7). In most cases, if the data scientist has done her job right the statistical models don't need to be incredibly complicated to identify the important relationships the project is trying to find. In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer. One option is to spend a huge amount of time trying to tune a statistical model to try to answer the question but serious data scientist's usually instead try to go back and get the right data.

The result of this process is that most well executed and successful data science projects don't (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I've evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.

It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren't quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks super easy to someone who either doesn't know about the data collection and cleaning process or doesn't care.

This poses a major public relations problem for serious data scientists. When you show someone a good data science project they almost invariably think "oh that is easy" or "that is just a trivial statistical/machine learning model" and don't see all of the work that goes into solving the real problems in data science. A concrete example of this is in academic statistics. It is customary for people to show theorems in their talks and maybe even some of the proof. This gives people working on theoretical projects an opportunity to "show their stuff" and demonstrate how good they are. The equivalent for a data scientist would be showing how they found and cleaned multiple data sets, merged them together, checked for biases, and arrived at a simplified data set. Showing the "proof" would be equivalent to showing how they matched IDs. These things often don't look nearly as impressive in talks, particularly if the audience doesn't have experience with how incredibly delicate real data analysis is. I imagine versions of this problem play out in industry as well (candidate X did a good analysis but it wasn't anything special, candidate Y used Hadoop to do BIG DATA!).

The really tricky twist is that bad data science looks easy too. You can scrape a data set off the web and slap a machine learning algorithm on it no problem. So how do you judge whether a data science project is really "hard" and whether the data scientist is an expert? Just like with anything, there is no easy shortcut to evaluating data science projects. You have to ask questions about the details of how the data were collected, what kind of biases might exist, why they picked one data set over another, etc.  In the meantime, don't be fooled by what looks like simple data science - it can often be pretty effective.


Editor's note: If you like this post, you might like my pay-what-you-want book Elements of Data Analytic Style: