Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Facebook and left censoring

From the Wall Street Journal:

Several weeks ago, Facebook disclosed in a post on its “Advertiser Help Center” that its metric for the average time users spent watching videos was artificially inflated because it was only factoring in video views of more than three seconds. The company said it was introducing a new metric to fix the problem.

A classic case of left censoring (in this case, by “accident”).

Also this:

Ad buying agency Publicis Media was told by Facebook that the earlier counting method likely overestimated average time spent watching videos by between 60% and 80%, according to a late August letter Publicis Media sent to clients that was reviewed by The Wall Street Journal.

What does this information tell us about the actual time spent watching Facebook videos?

Not So Standard Deviations Episode 22 - Number 1 Side Project

Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.

Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play.

Please leave us a review on iTunes!

Support us through our Patreon page.

Show Notes:

Download the audio for this episode.

Listen here:

Mastering Software Development in R

Today I’m happy to announce that we’re launching a new specialization on Coursera titled Mastering Software Development in R. This is a 5-course sequence developed with Sean Kross and Brooke Anderson.

This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing software. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.

The first course, The R Programming Environment, launches today. In the following months, we will launch the remaining courses:

  • Advanced R Programming
  • Building R Packages
  • Building Data Visualization Tools

In addition to the course, we have a companion textbook that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.

Interview With a Data Sucker

A few months ago Jill Sederstrom from ASH Clinical News interviewed me for this article on the data sharing editorial published by the The New England Journal of Medicine (NEJM) and the debate it generated. The article presented a nice summary, but I thought the original comprehensive set of questions was very good too. So, with permission from ASH Clinical News, I am sharing them here along with my answers.

Before I answer the questions below, I want to make an important remark. When writing these answers I am reflecting on data sharing in general. Nuances arise in different contexts that need to be discussed on an individual basis. For example, there are different considerations to keep in mind when sharing publicly funded data in genomics (my field) and sharing privately funded clinical trials data, just to name two examples.

In your opinion, what do you see as the biggest pros of data sharing?

The biggest pro of data sharing is that it can accelerate and improve the scientific enterprise. This can happen in a variety of ways. For example, competing experts may apply an improved statistical analysis that finds a hidden discovery the original data generators missed. Furthermore, examination of data by many experts can help correct errors missed by the analyst of the original project. Finally, sharing data facilitates the merging of datasets from different sources that allow discoveries not possible with just one study.

Note that data sharing is not a radical idea. For example, thanks to an organization called The MGED Soceity, most journals require all published microarray gene expression data to be public in one of two repositories: GEO or ArrayExpress. This has been an incredible success, leading to new discoveries, new databases that combine studies, and the development of widely used statistical methods and software built with these data as practice examples.

The NEJM editorial expressed concern that a new generation of researchers will emerge, those who had nothing to do with collecting the research but who will use it to their own ends. It referred to these as “research parasites.” Is this a real concern?

Absolutely not. If our goal is to facilitate scientific discoveries that improve our quality of life, I would be much more concerned about “data hoarders” than “research parasites”. If an important nugget of knowledge is hidden in a dataset, don’t you want the best data analysts competing to find it? Restricting the researchers who can analyze the data to those directly involved with the generators cuts out the great majority of experts.

To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates. Finding a cure is possible but only after analyzing a very very complex genomic assay. If some of the best data analysts in the world want to help, does it make any sense at all to restrict the pool of analysts to, say, a freshly minted masters level statistician working for the genomics core that generated the data? Furthermore, what would be the harm of having someone double check that analysis?

The NEJM editorial also presented several other concerns it had with data sharing including whether researchers would compare data across clinical trials that is not in fact comparable and a failure to provide correct attribution. Do you see these as being concerns? What cons do you believe there may be to data sharing?

If such mistakes are made, good peer reviewers will catch the error. If it escapes peer review, we point it out in post publication discussions. Science is constantly self correcting.

Regarding attribution, this is a legitimate, but in my opinion, minor concern. Developers of open source statistical methods and software see our methods used without attribution quite often. We survive. But as I elaborate below, we can do things to alleviate this concern.

Is data stealing a real worry? Have you ever heard of it happening before?

I can’t say I can recall any case of data being stolen. But let’s remember that most published data is paid for by tax payers. They are the actual owners. So there is an argument to be made that the public’s data is being held hostage.

Does data sharing need to happen symbiotically as the editorial suggests? Why or why not?

I think symbiotic sharing is the most effective approach to the repurposing of data. But no, I don’t think we need to force it to happen this way. Competition is one of the key ingredients of the scientific enterprise. Having many groups competing almost always beats out a small group of collaborators. And note that the data generators won’t necessarily have time to collaborate with all the groups interested in the data.

In a recent blog post, you suggested several possible data sharing guidelines. What would the advantage be of having guidelines in place in help guide the data sharing process?

I think you are referring to a post by Jeff Leek but I am happy to answer. For data to be generated, we need to incentivize the endeavor. Guidelines that assure patient privacy should of course be followed. Some other simple guidelines related to those mentioned by Jeff are:

  1. Reward data generators when their data is used by others.
  2. Penalize those that do not give proper attribution.
  3. Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis.
  4. Include data sharing ethics in scientific education

One of the guidelines suggested a new designation for leaders of major data collection or software generation projects. Why do you think this is important?

Again, this was Jeff, but I agree. This is important because we need an incentive other than giving the generators exclusive rights to publications emanating from said data.

You also discussed the need for requiring statistical/computational co-authors for papers written by experimentalists with no statistical/computational co-authors and vice versa. What role do you see the referee serving? Why is this needed?

I think the same rule should apply to referees. Every paper based on the analysis of complex data needs to have a referee with statistical/computational expertise. I also think biomedical journals publishing data-driven research should start adding these experts to their editorial boards. I should mention that NEJM actually has had such experts on their editorial board for a while now.

Are there certain guidelines would feel would be most critical to include?

To me the most important ones are:

  1. The funding agencies and the community should reward data generators when their data is used by others. Perhaps more than for the papers they produce with these data.

  2. Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis. Bashing published results and talking about the “replication crisis” has become fashionable. Although in some cases it is very well merited (see Baggerly and Coombes work for example) in some circumstances critiques are made without much care mainly for the attention. If we are not careful about keeping a good balance, we may end up paralyzing scientific progress.

You mentioned that you think symbiotic data sharing would be the most effective approach. What are some ways in which scientists can work symbiotically?

I can describe my experience. I am trained as a statistician. I analyze data on a daily basis both as a collaborator and method developer. Experience has taught me that if one does not understand the scientific problem at hand, it is hard to make a meaningful contribution through data analysis or method development. Most successful applied statisticians will tell you the same thing.

Most difficult scientific challenges have nuances that only the subject matter expert can effectively describe. Failing to understand these usually leads analysts to chase false leads, interpret results incorrectly or waste time solving a problem no one cares about. Successful collaboration usually involve a constant back and forth between the data analysts and the subject matter experts.

However, in many circumstances the data generator is not necessarily the only one that can provide such guidance. Some data analysts actually become subject matter experts themselves, others download data and seek out other collaborators that also understand the details of the scientific challenge and data generation process.

A Short Guide for Students Interested in a Statistics PhD Program

This summer I had several conversations with undergraduate students seeking career advice. All were interested in data analysis and were considering graduate school. I also frequently receive requests for advice via email. We have posted on this topic before, for example here and here, but I thought it would be useful to share this short guide I put together based on my recent interactions.

It’s OK to be confused

When I was a college senior I didn’t really understand what Applied Statistics was nor did I understand what one does as a researcher in academia. Now I love being an academic doing research in applied statistics. But it is hard to understand what being a researcher is like until you do it for a while. Things become clearer as you gain more experience. One important piece of advice is to carefully consider advice from those with more experience than you. It might not make sense at first, but I can tell today that I knew much less than I thought I did when I was 22.

Should I even go to graduate school?

Yes. An undergraduate degree in mathematics, statistics, engineering, or computer science provides a great background, but some more training greatly increases your career options. You may be able to learn on the job, but note that a masters can be as short as a year.

A masters or a PhD?

If you want a career in academia or as a researcher in industry or government you need a PhD. In general, a PhD will give you more career options. If you want to become a data analyst or research assistant, a masters may be enough. A masters is also a good way to test out if this career is a good match for you. Many people do a masters before applying to PhD Programs. The rest of this guide focuses on those interested in a PhD.

What discipline?

There are many disciplines that can lead you to a career in data science: Statistics, Biostatistics, Astronomy, Economics, Machine Learning, Computational Biology, and Ecology are examples that come to mind. I did my PhD in Statistics and got a job in a Department of Biostatistics. So this guide focuses on Statistics/Biostatistics.

Note that once you finish your PhD you have a chance to become a postdoctoral fellow and further focus your training. By then you will have a much better idea of what you want to do and will have the opportunity to chose a lab that closely matches your interests.

What is the difference between Statistics and Biostatistics?

Short answer: very little. I treat them as the same in this guide. Long answer: read this.

How should I prepare during my senior year?

Math

Good grades in math and statistics classes are almost a requirement. Good GRE scores help and you need to get a near perfect score in the Quantitative Reasoning part of the GRE. Get yourself a practice book and start preparing. Note that to survive the first two years of a statistics PhD program you need to prove theorems and derive relatively complicated mathematical results. If you can’t easily handle the math part of the GRE, this will be quite challenging.

When choosing classes note that the area of math most related to your stat PhD courses is Real Analysis. The area of math most used in applied work is Linear Algebra, specifically matrix theory including understanding eigenvalues and eigenvectors. You might not make the connection between what you learn in class and what you use in practice until much later. This is totally normal.

If you don’t feel ready, consider doing a masters first. But also, get a second opinion. You might be being too hard on yourself.

Programming

You will be using a computer to analyze data so knowing some programming is a must these days. At a minimum, take a basic programming class. Other computer science classes will help especially if you go into an area dealing with large datasets. In hindsight, I wish I had taken classes on optimization and algorithm design.

Know that learning to program and learning a computer language are different things. You need to learn to program. The choice of language is up for debate. If you only learn one, learn R. If you learn three, learn R, Python and C++.

Knowing Linux/Unix is an advantage. If you have a Mac try to use the terminal as much as possible. On Windows get an emulator.

Writing and Communicating

My biggest educational regret is that, as a college student, I underestimated the importance of writing. To this day I am correcting that mistake.

Your success as a researcher greatly depends on how well you write and communicate. Your thesis, papers, grant proposals and even emails have to be well written. So practice as much as possible. Take classes, read works by good writers, and practice. Consider starting a blog even if you don’t make it public. Also note that in academia, job interviews will involve a 50 minute talk as well as several conversations about your work and future plans. So communication skills are also a big plus.

But wait, why so much math?

The PhD curriculum is indeed math heavy. Faculty often debate the possibility of changing the curriculum. But regardless of differing opinions on what is the right amount, math is the foundation of our discipline. Although it is true that you will not directly use much of what you learn, I don’t regret learning so much abstract math because I believe it positively shaped the way I think and attack problems.

Note that after the first two years you are pretty much done with courses and you start on your research. If you work with an applied statistician you will learn data analysis via the apprenticeship model. You will learn the most, by far, during this stage. So be patient. Watch these two Karate Kid scenes for some inspiration.

What department should I apply to?

The top 20-30 departments are practically interchangeable in my opinion. If you are interested in applied statistics make sure you pick a department with faculty doing applied research. Note that some professors focus their research on the mathematical aspects of statistics. By reading some of their recent papers you will be able to tell. An applied paper usually shows data (not simulated) and motivates a subject area challenge in the abstract or introduction. A theory paper shows no data at all or uses it only as an example.

Can I take a year off?

Absolutely. Especially if it’s to work in a data related job. In general, maturity and life experiences are an advantage in grad school.

What should I expect when I finish?

You will have many many options. The demand of your expertise is great and growing. As a result there are many high-paying options. If you want to become an academic I recommend doing a postdoc. Here is why. But there are many other options as we describe here and here.