Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Thinking like a statistician: this is not the election for progressives to vote third party

Democratic elections permit us to vote for whomever we perceive has the highest expectation to do better with the issues we care about. Let’s simplify and assume we can quantify how satisfied we are with an elected official’s performance. Denote this quantity with X. Because when we cast our vote we still don’t know for sure how the candidate will perform, we base our decision on what we expect, denoted here with E(X). Thus we try to maximize E(X). However, both political theory and data tell us that in US presidential elections only two parties have a non-negligible probability of winning. This implies that E(X) is 0 for some candidates no matter how large X could potentially be. So what we are really doing is deciding if E(X-Y) is positive or negative with X representing one candidate and Y the other.

In past elections some progressives have argued that the difference between candidates is negligible and have therefore supported the Green Party ticket. The 2000 election is a notable example. The 2000 election was won by George W. Bush by just five electoral votes. In Florida, which had 25 electoral votes, Bush beat Al Gore by just 537 votes. Green Party candidate Ralph Nader obtained 97,488 votes. Many progressive voters were OK with this outcome because they perceived E(X-Y) to be practically 0.

In contrast, in 2016, I suspect few progressives think that E(X-Y) is anywhere near 0. In the figures below I attempt to quantify the progressive’s pre-election perception of consequences for the last five contests. The first figure shows E(X) and E(Y) and the second shows E(X-Y). Note despite E(X) being the lowest in the last past five elections, E(X-Y) is by far the largest. So if these figures accurately depict your perception and you think like a statistician, it becomes clear that this is not the election to vote third party.

election-2016

election-diff-2016

Facebook and left censoring

From the Wall Street Journal:

Several weeks ago, Facebook disclosed in a post on its “Advertiser Help Center” that its metric for the average time users spent watching videos was artificially inflated because it was only factoring in video views of more than three seconds. The company said it was introducing a new metric to fix the problem.

A classic case of left censoring (in this case, by “accident”).

Also this:

Ad buying agency Publicis Media was told by Facebook that the earlier counting method likely overestimated average time spent watching videos by between 60% and 80%, according to a late August letter Publicis Media sent to clients that was reviewed by The Wall Street Journal.

What does this information tell us about the actual time spent watching Facebook videos?

Not So Standard Deviations Episode 22 - Number 1 Side Project

Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.

Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play.

Please leave us a review on iTunes!

Support us through our Patreon page.

Show Notes:

Download the audio for this episode.

Listen here:

Mastering Software Development in R

Today I’m happy to announce that we’re launching a new specialization on Coursera titled Mastering Software Development in R. This is a 5-course sequence developed with Sean Kross and Brooke Anderson.

This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing software. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.

The first course, The R Programming Environment, launches today. In the following months, we will launch the remaining courses:

  • Advanced R Programming
  • Building R Packages
  • Building Data Visualization Tools

In addition to the course, we have a companion textbook that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.

Interview With a Data Sucker

A few months ago Jill Sederstrom from ASH Clinical News interviewed me for this article on the data sharing editorial published by the The New England Journal of Medicine (NEJM) and the debate it generated. The article presented a nice summary, but I thought the original comprehensive set of questions was very good too. So, with permission from ASH Clinical News, I am sharing them here along with my answers.

Before I answer the questions below, I want to make an important remark. When writing these answers I am reflecting on data sharing in general. Nuances arise in different contexts that need to be discussed on an individual basis. For example, there are different considerations to keep in mind when sharing publicly funded data in genomics (my field) and sharing privately funded clinical trials data, just to name two examples.

In your opinion, what do you see as the biggest pros of data sharing?

The biggest pro of data sharing is that it can accelerate and improve the scientific enterprise. This can happen in a variety of ways. For example, competing experts may apply an improved statistical analysis that finds a hidden discovery the original data generators missed. Furthermore, examination of data by many experts can help correct errors missed by the analyst of the original project. Finally, sharing data facilitates the merging of datasets from different sources that allow discoveries not possible with just one study.

Note that data sharing is not a radical idea. For example, thanks to an organization called The MGED Soceity, most journals require all published microarray gene expression data to be public in one of two repositories: GEO or ArrayExpress. This has been an incredible success, leading to new discoveries, new databases that combine studies, and the development of widely used statistical methods and software built with these data as practice examples.

The NEJM editorial expressed concern that a new generation of researchers will emerge, those who had nothing to do with collecting the research but who will use it to their own ends. It referred to these as “research parasites.” Is this a real concern?

Absolutely not. If our goal is to facilitate scientific discoveries that improve our quality of life, I would be much more concerned about “data hoarders” than “research parasites”. If an important nugget of knowledge is hidden in a dataset, don’t you want the best data analysts competing to find it? Restricting the researchers who can analyze the data to those directly involved with the generators cuts out the great majority of experts.

To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates. Finding a cure is possible but only after analyzing a very very complex genomic assay. If some of the best data analysts in the world want to help, does it make any sense at all to restrict the pool of analysts to, say, a freshly minted masters level statistician working for the genomics core that generated the data? Furthermore, what would be the harm of having someone double check that analysis?

The NEJM editorial also presented several other concerns it had with data sharing including whether researchers would compare data across clinical trials that is not in fact comparable and a failure to provide correct attribution. Do you see these as being concerns? What cons do you believe there may be to data sharing?

If such mistakes are made, good peer reviewers will catch the error. If it escapes peer review, we point it out in post publication discussions. Science is constantly self correcting.

Regarding attribution, this is a legitimate, but in my opinion, minor concern. Developers of open source statistical methods and software see our methods used without attribution quite often. We survive. But as I elaborate below, we can do things to alleviate this concern.

Is data stealing a real worry? Have you ever heard of it happening before?

I can’t say I can recall any case of data being stolen. But let’s remember that most published data is paid for by tax payers. They are the actual owners. So there is an argument to be made that the public’s data is being held hostage.

Does data sharing need to happen symbiotically as the editorial suggests? Why or why not?

I think symbiotic sharing is the most effective approach to the repurposing of data. But no, I don’t think we need to force it to happen this way. Competition is one of the key ingredients of the scientific enterprise. Having many groups competing almost always beats out a small group of collaborators. And note that the data generators won’t necessarily have time to collaborate with all the groups interested in the data.

In a recent blog post, you suggested several possible data sharing guidelines. What would the advantage be of having guidelines in place in help guide the data sharing process?

I think you are referring to a post by Jeff Leek but I am happy to answer. For data to be generated, we need to incentivize the endeavor. Guidelines that assure patient privacy should of course be followed. Some other simple guidelines related to those mentioned by Jeff are:

  1. Reward data generators when their data is used by others.
  2. Penalize those that do not give proper attribution.
  3. Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis.
  4. Include data sharing ethics in scientific education

One of the guidelines suggested a new designation for leaders of major data collection or software generation projects. Why do you think this is important?

Again, this was Jeff, but I agree. This is important because we need an incentive other than giving the generators exclusive rights to publications emanating from said data.

You also discussed the need for requiring statistical/computational co-authors for papers written by experimentalists with no statistical/computational co-authors and vice versa. What role do you see the referee serving? Why is this needed?

I think the same rule should apply to referees. Every paper based on the analysis of complex data needs to have a referee with statistical/computational expertise. I also think biomedical journals publishing data-driven research should start adding these experts to their editorial boards. I should mention that NEJM actually has had such experts on their editorial board for a while now.

Are there certain guidelines would feel would be most critical to include?

To me the most important ones are:

  1. The funding agencies and the community should reward data generators when their data is used by others. Perhaps more than for the papers they produce with these data.

  2. Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis. Bashing published results and talking about the “replication crisis” has become fashionable. Although in some cases it is very well merited (see Baggerly and Coombes work for example) in some circumstances critiques are made without much care mainly for the attention. If we are not careful about keeping a good balance, we may end up paralyzing scientific progress.

You mentioned that you think symbiotic data sharing would be the most effective approach. What are some ways in which scientists can work symbiotically?

I can describe my experience. I am trained as a statistician. I analyze data on a daily basis both as a collaborator and method developer. Experience has taught me that if one does not understand the scientific problem at hand, it is hard to make a meaningful contribution through data analysis or method development. Most successful applied statisticians will tell you the same thing.

Most difficult scientific challenges have nuances that only the subject matter expert can effectively describe. Failing to understand these usually leads analysts to chase false leads, interpret results incorrectly or waste time solving a problem no one cares about. Successful collaboration usually involve a constant back and forth between the data analysts and the subject matter experts.

However, in many circumstances the data generator is not necessarily the only one that can provide such guidance. Some data analysts actually become subject matter experts themselves, others download data and seek out other collaborators that also understand the details of the scientific challenge and data generation process.