28
Oct

## Why I support statisticians and their resistance to hype

Despite Statistics being the most mature data related discipline, statisticians have not fared well in terms of being selected for funding or leadership positions in the new initiatives brought about by the increasing interest in data. Just to give one example (Jeff and Terry Speed give many more) the White House Big Data Partners Workshop  had 19 members of which 0 were statisticians. The statistical community is clearly worried about this predicament and there is widespread consensus that we need to be better at marketing. Although I agree that only good can come from better communicating what we do, it is also important to continue doing one of the things we do best: resisting the hype and being realistic about data.

This week, after reading Mike Jordan's reddit ask me anything, I was reminded of exactly how much I admire this quality in statisticians. From reading the interview one learns about instances where hype has led to confusion, how getting past this confusion helps us better understand and consequently appreciate the importance of his field. For the past 30 years, Mike Jordan has been one of the most prolific academics working in the areas that today are receiving increased attention. Yet, you won't find a hyped-up press release coming out of his lab.  In fact when a journalist tried to hype up Jordan's critique of hype, Jordan called out the author.

Assessing the current situation with data initiatives it is hard not to conclude that hype is being rewarded. Many statisticians have come to the sad realization that by being cautious and skeptical, we may be losing out on funding possibilities and leadership roles. However, I remain very much upbeat about our discipline.  First, being skeptical and cautious has actually led to many important contributions. An important example is how randomized controlled experiments changed how medical procedures are evaluated. A more recent one is the concept of FDR, which helps control false discoveries in, for example,  high-throughput experiments. Second, many of us continue to work in the interface with real world applications placing us in a good position to make relevant contributions. Third, despite the failures alluded to above, we continue to successfully find ways to fund our work. Although resisting the hype has cost us in the short term, we will continue to produce methods that will be useful in the long term, as we have been doing for decades. Our methods will still be used when today's hyped up press releases are long forgotten.

26
Oct

## Return of the sunday links! (10/26/14)

New look for the blog and bringing back the links. If you have something that you'd like included in the Sunday links, email me and let me know. If you use the title of the message "Sunday Links" you'll be more likely for me to find it when I search my gmail.

1. Thomas L. does a more technical post on semi-parametric efficiency, normally I'm a data n' applications guy, but I love these in depth posts, especially when the papers remind me of all the things I studied at my alma mater.
2. I am one of those people who only knows a tiny bit about Docker, but hears about it all the time. That being said, after I read about Rocker, I got pretty excited.
3. Hadley W.'s favorite tools, seems like that dude likes R Studio for some reason....(me too)
4. A cool visualization of chess piece survival rates.
5. A short movie by 538 about statistics and the battle between Deep Blue and Gary Kasparov. Where's the popcorn?
6. Twitter engineering released an R package for detecting outbreaks. I wonder how circular binary segmentation would do?

24
Oct

## An interactive visualization to teach about the curse of dimensionality

I recently was contacted for an interview about the curse of dimensionality. During the course of the conversation, I realized how hard it is to explain the curse to a general audience. One of the best descriptions I could come up with was trying to describe sampling from a unit line, square, cube, etc. and taking samples with side length fixed. You would capture fewer and fewer points. As I was saying this, I realized it is a pretty bad way to explain the curse of dimensionality in words. But there was potentially a cool data visualization that would illustrate the idea. I went to my student Prasad, our resident interactive viz design expert to see if he could build it for me. He came up with this cool Shiny app where you can simulate a number of points (n) and then fix a side length for 1-D, 2-D, 3-D, and 4-D and see how many points you capture in a cube of that length in that dimension. You can find the full app here or check it out on the blog here:

22
Oct

## Vote on simply statistics new logo design

As you can tell, we have given the Simply Stats blog a little style update. It should be more readable on phones or tablets now. We are also about to get a new logo. We are down to the last couple of choices and can't decide. Since we are statisticians, we thought we'd collect some data. Here is the link to the poll. Let us know

16
Oct

## Creating the field of evidence based data analysis - do people know what a p-value looks like?

In the medical sciences, there is a discipline called "evidence based medicine". The basic idea is to study the actual practice of medicine using experimental techniques. The reason is that while we may have good experimental evidence about specific medicines or practices, the global behavior and execution of medical practice may also matter. There have been some success stories from this approach and also backlash from physicians who don't like to be told how to practice medicine. However, on the whole it is a valuable and interesting scientific exercise.

Roger introduced the idea of evidence based data analysis in a previous post. The basic idea is to study the actual practice and behavior of data analysts to identify how analysts behave. There is a strong history of this type of research within the data visualization community starting with Bill Cleveland and extending forward to work by Diane Cook, , Jeffrey Heer, and others.

Today we published a large-scale evidence based data analysis randomized trial. Two of the most common data analysis tasks (for better or worse) are exploratory analysis and the identification of statistically significant results. Di Cook's group calls this idea "graphical inference" or "visual significance" and they have studied human's ability to detect significance in the context of linear regression and how it associates with demographics and visual characteristics of the plot.

We performed a randomized study to determine if data analysts with basic training could identify statistically significant relationships. Or as the first author put it in a tweet:

What we found was that people were pretty bad at detecting statistically significant results, but that over multiple trials they could improve. This is a tentative first step toward understanding how the general practice of data analysis works. If you want to play around and see how good you are at seeing p-values we also built this interactive Shiny app. If you don't see the app you can also go to the Shiny app page here.

15
Oct

## Dear Laboratory Scientists: Welcome to My World

Consider the following question: Is there a reproducibility/replication crisis in epidemiology?

I think there are only two possible ways to answer that question:

1. No, there is no replication crisis in epidemiology because no one ever believes the result of an epidemiological study unless it has been replicated a minimum of 1,000 times in every possible population.
2. Yes, there is a replication crisis in epidemiology, and it started in 1854 when John Snow inferred, from observational data, that cholera was spread via contaminated water obtained from public pumps.

If you chose (2), then I don't think you are allowed to call it a "crisis" because I think by definition, a crisis cannot last 160 years. In that case, it's more of a chronic disease.

I had an interesting conversation last week with a prominent environmental epidemiologist over the replication crisis that has been reported about extensively in the scientific and popular press. In his view, he felt this was less of an issue in epidemiology because epidemiologists never really had the luxury of people (or at least fellow scientists) believing their results because of their general inability to conduct controlled experiments.

Given the observational nature of most environmental epidemiological studies, it's generally accepted in the community that no single study can be considered causal, and that many replications of a finding are need to establish a causal connection. Even the popular press knows now to include the phrase "correlation does not equal causation" when reporting on an observational study. The work of Sir Austin Bradford Hill essentially codifies the standard of evidence needed to draw causal conclusions from observational studies.

So if "correlation does not equal causation", it begs the question, what does equal causation? Many would argue that a controlled experiment, whether it's a randomized trial or a laboratory experiment, equals causation. But people who work in this area have long known that while controlled experiments do assign the treatment or exposure, there are still many other elements of the experiment that are not controlled.

For example, if subjects drop out of a randomized trial, you now essentially have an observational study (or at least a "broken" randomized trial). If you are conducting a laboratory experiment and all of the treatment samples are measured with one technology and all of the control samples are measured with a different technology (perhaps because of a lack of blinding), then you still have confounding.

The correct statement is not "correlation does not equal causation" but rather "no single study equals causation", regardless of whether it was an observational study or a controlled experiment. Of course, a very tightly controlled and rigorously conducted controlled experiment will be more valuable than a similarly conducted observational study. But in general, all studies should simply be considered as further evidence for or against an hypothesis. We should not be lulled into thinking that any single study about an important question can truly be definitive.

13
Oct

## I declare the Bayesian vs. Frequentist debate over for data scientists

In a recent New York Times article the "Frequentists versus Bayesians" debate was brought up once again. I agree with Roger:

Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this:  "Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him." This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, Jeff and I agree on is that this debate is not constructive. As Rob Kass suggests it's time to move on to pragmatism. Here I follow up Jeff's recent post by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest.

Applied statisticians help answer questions with data. How should I design a roulette so my casino makes \$? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy? Our involvement typically means analyzing data and designing experiments. To do this we use a variety of techniques that have been successfully applied in the past and that we have mathematically shown to have desirable properties. Some of these tools are frequentist, some of them are Bayesian, some could be argued to be both, and some don't even use probability. The Casino will do just fine with frequentist statistics, while the baseball team might want to apply a Bayesian approach to avoid overpaying for players that have simply been lucky.

It is also important to remember that good applied statisticians also *think*. They don't apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon.

Only someone that does not know how to think like a statistician would act like the frequentists in the cartoon. Unfortunately we do have such people analyzing data. But their choice of technique is not the problem, it's their lack of critical thinking. However, even the most frequentist-appearing applied statistician understands Bayes rule and will adapt the Bayesian approach when appropriate. In the above XCKD example, any respectful applied statistician would not even bother examining the data (the dice roll), because they would assign a probability of 0 to the sun exploding (the empirical prior based on the fact that they are alive). However, superficial propositions arguing for wider adoption of Bayesian methods fail to realize that using these techniques in an actual data analysis project is very different from simply thinking like a Bayesian. To do this we have to represent our intuition or prior knowledge (or whatever you want to call it) with mathematical formulae. When theoretical Bayesians pick these priors, they mainly have mathematical/computational considerations in mind. In practice we can't afford this luxury: a bad prior will render the analysis useless regardless of its convenient mathematically properties.

Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In this popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol' fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn't really matter.

For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let's consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without
the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the "expert knowledge" at the time (see below)?

And how would the Bayesian analysis performed by tabacco companies shape the debate? Ultimately, I think applied statisticians would have made an equally convincing case against smoking with Bayesian posteriors as opposed to frequentist confidence intervals. Going forward I hope applied statisticians continue to be free to use whatever techniques they see fit and that critical thinking about data continues to be what distinguishes us. Imposing Bayesian or frequentists philosophy on us would be a disaster.

09
Oct

## Data science can't be point and click

As data becomes cheaper and cheaper there are more people that want to be able to analyze and interpret that data.  I see more and more that people are creating tools to accommodate folks who aren't trained but who still want to look at data right now. While I admire the principle of this approach - we need to democratize access to data - I think it is the most dangerous way to solve the problem.

The reason is that, especially with big data, it is very easy to find things like this with point and click tools:

US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation (http://www.tylervigen.com/view_correlation?id=1597)

The danger with using point and click tools is that it is very hard to automate the identification of warning signs that seasoned analysts get when they have their hands in the data. These may be spurious correlation like the plot above or issues with data quality, or missing confounders, or implausible results. These things are much easier to spot when analysis is being done interactively. Point and click software is also getting better about reproducibility, but it still a major problem for many interfaces.

Despite these issues, point and click software are still all the rage. I understand the sentiment, there is a bunch of data just laying there and there aren't enough people to analyze it expertly. But you wouldn't want me to operate on you using point and click surgery software. You'd want a surgeon who has practiced on real people and knows what to do when she has an artery in her hand. In the same way, I think point and click software allows untrained people to do awful things to big data.

The ways to solve this problem are:

1. More data analysis training
2. Encouraging people to do their analysis interactively

I have a few more tips which I have summarized in this talk on things statistics taught us about big data.

08
Oct

## The Leek group guide to genomics papers

Leek group guide to genomics papers

When I was a student, my advisor, John Storey, made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.

• It got me caught up on the field of computational genomics
• It was expertly curated, so it filtered a lot of papers I didn't need to read
• It gave me my first set of ideas to try to pursue as I was reading the papers

I have often thought I should make a similar list for folks who may want to work wtih me (or who want to learn about statistial genomics). So this is my first attempt at that list. I've tried to separate the papers into categories and I've probably missed important papers. I'm happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.

06
Oct

## An economic model for peer review

I saw this tweet the other day:

It reminded me that a few years ago I had a paper that went through the peer review wringer. It drove me completely bananas. One thing that drove me so crazy about the process was how long the referees waited before reviewing and how terrible the reviews were after that long wait. So I started thinking about the "economics of peer review". Basically, what is the incentive for scientists to contribute to the system.

To get a handle on this idea, I designed a "peer review game" where there are a fixed number of players N. The players play the game for a fixed period of time. During that time, they can submit papers or they can review papers. For each person, their final score at the end of the time is $S_i = \sum {\rm Submitted \; Papers \; Accepted}$.

Based on this model, under closed peer review, there is one Nash equilibrium under the strategy that no one reviews any papers. Basically, no one can hope to improve their score by reviewing, they can only hope to improve their score by submitting more papers (sound familiar?). Under open peer review, there are more potential equilibria, based on the relative amount of goodwill you earn from your fellow reviewers by submitting good reviews.

We then built a model system for testing out our theory. The system involved having groups of students play a "peer review game" where they submitted solutions to SAT problems like:

Each solution was then randomly assigned to another player to review. Those players could (a) review it and reject it, (b) review it and accept it, or (c) not review it. The person with the most points at the end of the time (one hour) won.

We found some cool things:

1. In closed review, reviewing gave no benefit.
2. In open review, reviewing gave a small positive benefit.
3. Both systems gave comparable accuracy
4. All peer review increased the overall accuracy of responses

The paper is here and all of the data and code are here.