27 Sep 2016
Democratic elections permit us to vote for whomever we perceive has
the highest expectation to do better with the issues we care about. Let’s
simplify and assume we can quantify how satisfied we are with an
elected official’s performance. Denote this quantity with X. Because
when we cast our vote we still don’t know for sure how the candidate
will perform, we base our decision on what we expect, denoted here with
E(X). Thus we try to maximize E(X). However, both political theory
and data tell us that in US presidential elections only two parties
have a non-negligible probability of winning. This implies that
E(X) is 0 for some candidates no matter how large X could
potentially be. So what we are really doing is deciding if E(X-Y) is
positive or negative with X representing one candidate and Y the
In past elections some progressives have argued that the difference
between candidates is negligible and have therefore supported the Green Party
ticket. The 2000 election is a notable example. The
was won by George W. Bush by just five electoral votes. In Florida,
which had 25 electoral votes, Bush beat Al
Gore by just 537 votes. Green Party candidate Ralph
Nader obtained 97,488 votes. Many progressive voters were OK with this
outcome because they perceived E(X-Y) to be practically 0.
In contrast, in 2016, I suspect few progressives think that
E(X-Y) is anywhere near 0. In the figures below I attempt to
quantify the progressive’s pre-election perception of consequences for
the last five contests. The first
figure shows E(X) and E(Y) and the second shows E(X-Y). Note
despite E(X) being the lowest in the last past five elections,
E(X-Y) is by far the largest. So if these figures accurately depict
your perception and you think
like a statistician, it becomes clear that this is not the election to
vote third party.
26 Sep 2016
From the Wall Street Journal:
Several weeks ago, Facebook disclosed in a post on its “Advertiser Help Center” that its metric for the average time users spent watching videos was artificially inflated because it was only factoring in video views of more than three seconds. The company said it was introducing a new metric to fix the problem.
A classic case of left censoring (in this case, by “accident”).
Ad buying agency Publicis Media was told by Facebook that the earlier counting method likely overestimated average time spent watching videos by between 60% and 80%, according to a late August letter Publicis Media sent to clients that was reviewed by The Wall Street Journal.
What does this information tell us about the actual time spent watching Facebook videos?
19 Sep 2016
Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.
Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!
If you have questions you’d like us to answer, you can send them to
nssdeviations @ gmail.com or tweet us at @NSSDeviations.
Subscribe to the podcast on iTunes or Google Play.
Please leave us a review on iTunes!
Support us through our Patreon page.
Download the audio for this episode.
19 Sep 2016
Today I’m happy to announce that we’re launching a new specialization on Coursera titled Mastering Software Development in R. This is a 5-course sequence developed with Sean Kross and Brooke Anderson.
This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing software. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.
The first course, The R Programming Environment, launches today. In the following months, we will launch the remaining courses:
- Advanced R Programming
- Building R Packages
- Building Data Visualization Tools
In addition to the course, we have a companion textbook that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.
07 Sep 2016
A few months ago Jill Sederstrom from ASH Clinical News interviewed
me for this article on the data sharing editorial published by the The New England Journal of Medicine (NEJM) and the debate it generated.
The article presented a nice summary, but I thought the original
comprehensive set of questions was very good too. So, with permission from
ASH Clinical News, I am sharing them here along with my answers.
Before I answer the questions below, I want to make an important remark.
When writing these answers I am reflecting on data sharing in
general. Nuances arise in different contexts that need to be
discussed on an individual basis. For example, there are different
considerations to keep in mind when sharing publicly funded data in
genomics (my field) and sharing privately funded clinical trials data,
just to name two examples.
In your opinion, what do you see as the biggest pros of data sharing?
The biggest pro of data sharing is that it can accelerate and improve
the scientific enterprise. This can happen in a variety of ways. For
example, competing experts may apply an improved statistical analysis
that finds a hidden discovery the original data generators missed.
Furthermore, examination of data by many experts can help correct
errors missed by the analyst of the original project. Finally, sharing
data facilitates the merging of datasets from different sources that
allow discoveries not possible with just one study.
Note that data sharing is not a radical idea. For example, thanks to
an organization called The MGED Soceity, most journals require all published
microarray gene expression data to be public in one of two
repositories: GEO or ArrayExpress. This has been an incredible
success, leading to new discoveries, new databases that combine
studies, and the development of widely used statistical methods and
software built with these data as practice examples.
The NEJM editorial expressed concern that a new generation of researchers will emerge, those who had nothing to do with collecting the research but who will use it to their own ends. It referred to these as “research parasites.” Is this a real concern?
Absolutely not. If our goal is to facilitate scientific discoveries that
improve our quality of life, I would be much more concerned about
“data hoarders” than “research parasites”. If an important nugget of
knowledge is hidden in a dataset, don’t you want the best data
analysts competing to find it? Restricting the researchers who can
analyze the data to those directly involved with the generators cuts
out the great majority of experts.
To further illustrate this, let’s consider a very concrete example
with real life consequences. Imagine a loved one has a disease with
high mortality rates. Finding a cure is possible but only after
analyzing a very very complex genomic assay. If some of the best data
analysts in the world want to help, does it make any sense at all to
restrict the pool of analysts to, say, a freshly minted masters level
statistician working for the genomics core that generated the data?
Furthermore, what would be the harm of having someone double check
The NEJM editorial also presented several other concerns it had with data sharing including whether researchers would compare data across clinical trials that is not in fact comparable and a failure to provide correct attribution. Do you see these as being concerns? What cons do you believe there may be to data sharing?
If such mistakes are made, good peer reviewers will catch the error.
If it escapes peer review, we point it out in post publication
discussions. Science is constantly self correcting.
Regarding attribution, this is a legitimate, but in my opinion, minor
concern. Developers of open source statistical methods and software
see our methods used without attribution quite often. We survive. But
as I elaborate below, we can do things to alleviate this concern.
Is data stealing a real worry? Have you ever heard of it happening before?
I can’t say I can recall any case of data being stolen. But let’s
remember that most published data is paid for by tax payers. They are the
actual owners. So there is an argument to be made that the public’s
data is being held hostage.
Does data sharing need to happen symbiotically as the editorial suggests? Why or why not?
I think symbiotic sharing is the most effective approach to the
repurposing of data. But no, I don’t think we need to force it to happen this way.
Competition is one of the key ingredients of the scientific
enterprise. Having many groups competing almost always beats out a
small group of collaborators. And note that the data generators won’t
necessarily have time to collaborate with all the groups interested in
In a recent blog post, you suggested several possible data sharing guidelines. What would the advantage be of having guidelines in place in help guide the data sharing process?
I think you are referring to a post by Jeff Leek but I am happy to
answer. For data to be generated, we need to incentivize the endeavor.
Guidelines that assure patient privacy should of course be followed.
Some other simple guidelines related to those mentioned by Jeff are:
- Reward data generators when their data is used by others.
- Penalize those that do not give proper attribution.
- Apply the same critical rigor on critiques of the original analysis
as we apply to the original analysis.
- Include data sharing ethics in scientific education
One of the guidelines suggested a new designation for leaders of major data collection or software generation projects. Why do you think this is important?
Again, this was Jeff, but I agree. This is important because we need
an incentive other than giving the generators exclusive rights to
publications emanating from said data.
You also discussed the need for requiring statistical/computational co-authors for papers written by experimentalists with no statistical/computational co-authors and vice versa. What role do you see the referee serving? Why is this needed?
I think the same rule should apply to referees. Every paper based on
the analysis of complex data needs to have a referee with
statistical/computational expertise. I also think biomedical journals
publishing data-driven research should start adding these experts to
their editorial boards. I should mention that NEJM actually has had
such experts on their editorial board for a while now.
Are there certain guidelines would feel would be most critical to include?
To me the most important ones are:
The funding agencies and the community should reward data
generators when their data is used by others. Perhaps more than for
the papers they produce with these data.
Apply the same critical rigor on critiques of the original analysis
as we apply to the original analysis. Bashing published results and
talking about the “replication crisis”
has become fashionable. Although in some cases it is very well merited
(see Baggerly and Coombes work for example) in some circumstances critiques are made without much care mainly for the attention. If we
are not careful about keeping a good balance, we may end up
paralyzing scientific progress.
You mentioned that you think symbiotic data sharing would be the most effective approach. What are some ways in which scientists can work symbiotically?
I can describe my experience. I am trained as a statistician. I analyze
data on a daily basis both as a collaborator and method developer.
Experience has taught me that if one does not understand the
scientific problem at hand, it is hard to make a meaningful
contribution through data analysis or method development. Most
successful applied statisticians will tell you the same thing.
Most difficult scientific challenges have nuances that only the
subject matter expert can effectively describe. Failing to understand
these usually leads analysts to chase false leads, interpret results
incorrectly or waste time solving a problem no one cares about.
Successful collaboration usually involve a constant back and forth
between the data analysts and the subject matter experts.
However, in many circumstances the data generator is not necessarily
the only one that can provide such guidance. Some data analysts
actually become subject matter experts themselves, others download
data and seek out other collaborators that also understand the details
of the scientific challenge and data generation process.