Simply Statistics


Reproducible Research Course Companion

Screen Shot 2015-01-26 at 4.14.26 PMI'm happy to announce that you can now get a copy of the Reproducible Research Course Companion from the Apple iBookstore. The purpose of this e-book is pretty simple. The book provides all of the key video lectures from my Reproducible Research course offered on Coursera, in a simple offline e-book format. The book can be viewed on a Mac, iPad, or iPad mini.

If you're interested in taking my Reproducible Research course on Coursera and would like a flavor of what the course will be like, then you can view the lectures through the book (the free sample contains three lectures). On the other hand, if you already took the course and would like access to the lecture material afterwards, then this might be a useful add-on. If you care currently enrolled in the course, then this could be a handy way for you to take the lectures on the road with you.

Please note that all of the lectures are still available for free on YouTube via my YouTube channel. Also, the book provides content only. If you wish to actually complete the course, you must take it through the Coursera web site.


Dear Laboratory Scientists: Welcome to My World

Consider the following question: Is there a reproducibility/replication crisis in epidemiology?

I think there are only two possible ways to answer that question:

  1. No, there is no replication crisis in epidemiology because no one ever believes the result of an epidemiological study unless it has been replicated a minimum of 1,000 times in every possible population.
  2. Yes, there is a replication crisis in epidemiology, and it started in 1854 when John Snow inferred, from observational data, that cholera was spread via contaminated water obtained from public pumps.

If you chose (2), then I don't think you are allowed to call it a "crisis" because I think by definition, a crisis cannot last 160 years. In that case, it's more of a chronic disease.

I had an interesting conversation last week with a prominent environmental epidemiologist over the replication crisis that has been reported about extensively in the scientific and popular press. In his view, he felt this was less of an issue in epidemiology because epidemiologists never really had the luxury of people (or at least fellow scientists) believing their results because of their general inability to conduct controlled experiments.

Given the observational nature of most environmental epidemiological studies, it's generally accepted in the community that no single study can be considered causal, and that many replications of a finding are need to establish a causal connection. Even the popular press knows now to include the phrase "correlation does not equal causation" when reporting on an observational study. The work of Sir Austin Bradford Hill essentially codifies the standard of evidence needed to draw causal conclusions from observational studies.

So if "correlation does not equal causation", it begs the question, what does equal causation? Many would argue that a controlled experiment, whether it's a randomized trial or a laboratory experiment, equals causation. But people who work in this area have long known that while controlled experiments do assign the treatment or exposure, there are still many other elements of the experiment that are not controlled.

For example, if subjects drop out of a randomized trial, you now essentially have an observational study (or at least a "broken" randomized trial). If you are conducting a laboratory experiment and all of the treatment samples are measured with one technology and all of the control samples are measured with a different technology (perhaps because of a lack of blinding), then you still have confounding.

The correct statement is not "correlation does not equal causation" but rather "no single study equals causation", regardless of whether it was an observational study or a controlled experiment. Of course, a very tightly controlled and rigorously conducted controlled experiment will be more valuable than a similarly conducted observational study. But in general, all studies should simply be considered as further evidence for or against an hypothesis. We should not be lulled into thinking that any single study about an important question can truly be definitive.


Unbundling the educational package

I just got back from the World Economic Forum's summer meeting in Tianjin, China and there was much talk of disruption and innovation there. Basically, if you weren't disrupting, you were furniture. Perhaps not surprisingly, one topic area that was universally considered ripe for disruption was Education.

There are many ideas bandied about with respect to "disrupting" education and some are interesting to consider. MOOCs were the darlings of...last year...but they're old news now. Sam Lessin has a nice piece in the The Information (total paywall, sorry, but it's worth it) about building a subscription model for universities. Aswath Damodaran has what I think is a nice framework for thinking about the "education business".

One thing that I latched on to in Damodaran's piece is the idea of education as a "bundled product". Indeed, I think the key aspect of traditional on-site university education is the simultaneous offering of

  1. Subject matter content (i.e. course material)
  2. Mentoring and guidance by faculty
  3. Social and professional networking
  4. Other activities (sports, arts ensembles, etc.)

MOOCs have attacked #1 for many subjects, typically large introductory courses. Endeavors like the Minerva project are attempting to provide lower-cost seminar-style courses (i.e. anti-MOOCs).

I think the extent to which universities will truly be disrupted will hinge on how well we can unbundle the four (or maybe more?) elements described above and provide them separately but at roughly the same level of quality. Is it possible? I don't know.


Introducing people to R: 14 years and counting

I've been introducing people to R for quite a long time now and I've been doing some reflecting today on how that process has changed quite a bit over time. I first started using R around 1998--1999 I think I first started talking about R informally to my fellow classmates (and some faculty) back when I was in graduate school at UCLA. There, the department was officially using Lisp-Stat (which I loved) and only later converted its courses over to R. Through various brown-bag lunches and seminars I would talk about R, and the main selling point at the time was "It's just like S-PLUS but it's free!" As it turns out, S-PLUS was basically abandoned by academics and its ownership changed hands a number of times over the years (it is currently owned by TIBCO). I still talk about S-PLUS when I talk about the history of R but I'm not sure many people nowadays actually have any memories of the product.

When I got to Johns Hopkins in 2003 there wasn't really much of a modern statistical computing class, so Karl Broman, Rafa Irizarry, Brian Caffo, Ingo Ruczinski, and I got together and started what we called the "KRRIB" class, which was basically a weekly seminar where one of us talked about a computing topic of interest. I gave some of the R lectures in that class and when I asked people who had heard of R before, almost no one raised their hand. And no one had actually used it before. My approach was pretty much the same at the time, although I left out the part about S-PLUS because no one had used that either. A lot of people had experience with SAS or Stata or SPSS. A number of people had used something like Java or C/C++ before and so I often used that a reference frame. No one had ever used a functional-style of programming language like Scheme or Lisp.

Over time, the population of students (mostly first-year graduate students) slowly shifted to the point where many of them had been introduced to R while they were undergraduates. This trend mirrored the overall trend with statistics where we are seeing more and more students do undergraduate majors in statistics (as opposed to, say, mathematics). Eventually, by 2008--2009, when I'd ask how many people had heard of or used R before, everyone raised their hand. However, even at that late date, I still felt the need to convince people that R was a "real" language that could be used for real tasks.

R has grown a lot in recent years, and is being used in so many places now, that I think its essentially impossible for a person to keep track of everything that is going on. That's fine, but it makes "introducing" people to R an interesting experience. Nowadays in class, students are often teaching me something new about R that I've never seen or heard of before (they are quite good at Googling around for themselves). I feel no need to "bring people over" to R. In fact it's quite the opposite--people might start asking questions if I weren't teaching R.

Even though my approach to introducing R has evolved over time, with the topics that I emphasize or de-emphasize changing, I've found there are a few topics that I always  stress to people who are generally newcomers to R. For whatever reason, these topics are always new or at least a little unfamiliar.

  • R is a functional-style language. Back when most people primarily saw something like C as a first programming language, it made sense to me that the functional style of programming would seem strange. I came to R from Lisp-Stat so the functional aspect was pretty natural for me. But many people seem to get tripped up over the idea of passing a function as an argument or not being able to modify the state of an object in place. Also, it sometimes takes people a while to get used to doing things like lapply() and map-reduce types of operations. Everyone still wants to write a for loop!
  • R is both an interactive system and a programming language. Yes, it's a floor wax and a dessert topping--get used to it. Most people seem expect one or the other. SAS users are wondering why you need to write 10 lines of code to do what SAS can do in one massive PROC statement. C programmers are wondering why you don't write more for loops. C++ programmers are confused by the weird system for object orientation. In summary, no one is ever happy.
  • Visualization/plotting capabilities are state-of-the-art. One of the big selling points back in the "old days" was that from the very beginning R's plotting and graphics capabilities where far more elegant than the ASCII-art that was being produced by other statistical packages (true for S-PLUS too). I find it a bit strange that this point has largely remained true. While other statistical packages have definitely improved their output (and R certainly has some areas where it is perhaps deficient), R still holds its own quite handily against those other packages. If the community can continue to produce things like ggplot2 and rgl, I think R will remain at the forefront of data visualization.

I'm looking forward to teaching R to people as long as people will let me, and I'm interested to see how the next generation of students will approach it (and how my approach to them will change). Overall, it's been just an amazing experience to see the widespread adoption of R over the past decade. I'm sure the next decade will be just as amazing.


Jan de Leeuw owns the Internet

One of the best things to happen on the Internet recently is that Jan de Leeuw has decided to own the Twitter/Facebook universe. If you do not already, you should be following him. Among his many accomplishments, he founded the Department of Statistics at UCLA (my alma mater), which is currently thriving. On the occasion of the Department's 10th birthday, there was a small celebration, and I recall Don Ylvisaker mentioning that the reason they invited Jan to UCLA way back when was because he "knew everyone and knew everything". Pretty accurate description, in my opinion.

Jan's been tweeting quite a bit of late, but recently had this gem:

followed by

I'm not sure what Jan's thinking behind the first tweet was, but I think many in statistics would consider it a "good thing" to be a minor subfield of data science. Why get involved in that messy thing called data science where people are going wild with data in an unprincipled manner?

This is a situation where I think there is a large disconnect between what "should be" and what "is reality". What should be is that statistics should include the field of data science. Honestly, that would be beneficial to the field of statistics and would allow us to provide a home to many people who don't necessarily have one (primarily, people working not he border between two fields). Nate Silver made reference to this in his keynote address to the Joint Statistical Meetings last year when he said data science was just a fancy term for statistics.

The reality though is the opposite. Statistics has chosen to limit itself to a few areas, such as inference, as Jan mentions, and to willfully ignore other important aspects of data science as "not statistics". This is unfortunate, I think, because unlike many in the field of statistics, I believe data science is here to stay. The reason is because statistics has decided not to fill the spaces that have been created by the increasing complexity of modern data analysis. The needs of modern data analyses (reproducibility, computing on large datasets, data preprocessing/cleaning) didn't fall into the usual statistics curriculum, and so they were ignored. In my view, data science is about stringing together many different tools for many different purposes into an analytic whole. Traditional statistical modeling is a part of this (often a small part), but statistical thinking plays a role in all of it.

Statisticians should take on the challenge of data science and own it. We may not be successful in doing so, but we certainly won't be if we don't try.


New book on implementing reproducible research

9781466561595I have mentioned this in a few places but my book edited with Victoria Stodden and Fritz Leisch, Implementing Reproducible Research, has just been published by CRC Press. Although it is technically in their "R Series", the chapters contain information on a wide variety of useful tools, not just R-related tools. 

There is also a supplementary web site hosted through Open Science Framework that contains a lot of additional information, including the list of chapters.


The Real Reason Reproducible Research is Important

Reproducible research has been on my mind a bit these days, partly because it has been in the news with the Piketty stuff, and also perhaps because I just published a book on it and I'm teaching a class on it as we speak (as well as next month and the month after...).

However, as I watch and read many discussions over the role of reproducibility in science, I often feel that many people miss the point. Now, just to be clear, when I use the word "reproducibility" or say that a study is reproducible, I do not mean "independent verification" as in a separate investigator conducted an independent study and came to the same conclusion as the original study (that is what I refer to as "replication"). By using the word reproducible, I mean that the original data (and original computer code) can be analyzed (by an independent investigator) to obtain the same results of the original study. In essence, it is the notion that the data analysis can be successfully repeatedReproducibility is particularly important in large computational studies where the data analysis can often play an outsized role in supporting the ultimate conclusions.

Many people seem to conflate the ideas of reproducible and correctness, but they are not the same thing. One must always remember that a study can be reproducible and still be wrong. By "wrong", I mean that the conclusion or claim can be wrong. If I claim that X causes Y (think "sugar causes cancer"), my data analysis might be reproducible, but my claim might ultimately be incorrect for a variety of reasons. If my claim has any value, then others will attempt to replicate it and the correctness of the claim will be determined by whether others come to similar conclusions.

Then why is reproducibility so important? Reproducibility is important because it is the only thing that an investigator can guarantee about a study.

Contrary to what most press releases would have you believe, an investigator cannot guarantee that the claims made in a study are correct (unless they are purely descriptive). This is because in the history of science, no meaningful claim has ever been proven by a single study. (The one exception might be mathematics, whether they are literally proving things in their papers.) So reproducibility is important not because it ensures that the results are correct, but rather because it ensures transparency and gives us confidence in understanding exactly what was done.

These days, with the complexity of data analysis and the subtlety of many claims (particularly about complex diseases), reproducibility is pretty much the only thing we can hope for. Time will tell whether we are ultimately right or wrong about any claims, but reproducibility is something we can know right now.


Post-Piketty Lessons

The latest crisis in data analysis comes to us (once again) from the field of Economics. Thomas Piketty, a French economist recently published a book titled Capital in the 21st Century that has been a best-seller. I have not read the book, but based on media reports, it appears to make the claim that inequality has increased in recent years and will likely increase into the future. The book argues that this increase in inequality is driven by capitalism’s tendency to reward capital more than labor. This is my non-economist’s understanding of the book, but the details specific claims of the book are not what I want to discuss here (there is much discussion elsewhere).

An interesting aspect of Piketty’s work, from my perspective, is that he has made all of his data and analysis available on the web. From what I can tell, his analysis was not trivial—data were collected and merged from multiple disparate sources and adjustments were made to different data series to account for various incompatibilities. To me, this sounds like a standard data analysis, in the sense that all meaningful data analyses are complicated. As noted by Nate Silver, data do not arise from a “virgin birth”, and in any example worth discussing, much work has to be done to get the data into a state in which statistical models can be fit, or even more simply, plots can be made.

Chris Giles, a journalist for the Financial Times, recently published a column (unfortunately blocked by paywall) in which he claimed that much of the analysis that Piketty had done was flawed or incorrect. In particular, he claimed that based on his (Giles’) analysis, inequality was not growing as much over time as Piketty claimed. Among other points, Giles claims that numerous errors were made in assembling the data and in Piketty’s original analysis.

This episode smacked of the recent Reinhart-Rogoff kerfuffle in which some fairly basic errors were discovered in those economists' Excel spreadsheets. Some of those errors only made small differences to the results, but a critical methodological component, in which the data were weighted in a special way, appeared to have a significant impact on the results if alternate approaches were taken.

Piketty has since responded forcefully to the FT's column, defending all of the work he has done and addressing the criticisms one by one. To me, the most important result of the FT analysis is that Piketty’s work appears to be largely reproducible. Piketty made his data available, with reasonable documentation (in addition to his book), and Giles was able to come up with the same numbers Piketty came up with. This is a good thing. Piketty’s work was complex, and the only way to communicate the entirety of it was to make the data and code available.

The other aspects of Giles’ analysis are, from an academic standpoint, largely irrelevant to me, particularly because I am not an economist. The reason I find them irrelevant is because the objections are largely over whether he is correct or not. This is an obviously important question, but in any field, no single study or even synthesis can be determined to be "correct" at that instance. Time will tell, and if his work is "correct", his predictions will be borne out by nature. It's not so satisfying to have to wait many years to know if you are correct, but that's how science works.

In the meantime, economists will have a debate over the science and the appropriate methods and data used for analysis. This is also how science works, and it is only (really) possible because Piketty made his work reproducible. Otherwise, the debate would be largely uninformed.


JHU Data Science: More is More

Today Jeff Leek, Brian Caffo, and I are launching 3 new courses on Coursera as part of the Johns Hopkins Data Science Specialization. These courses are

I'm particularly excited about Reproducible Research, not just because I'm teaching it, but because I think it's essentially the first of its kind being offered in a massive open format. Given the rich discussions about reproducibility that have occurred over the past few years, I'm happy to finally be able to offer this course for free to a large audience.

These courses are launching in addition to the first 3 courses in the sequence: The Data Scientist's Toolbox, R Programming, and Getting and Cleaning Data, which are also running this month in case you missed your chance in April.

All told we have 6 of the 9 courses in the Specialization available as of today. We're really looking forward to next month where we will be launching the final 3 courses: Regression Models, Practical Machine Learning, and Developing Data Products. We also have some exciting announcements coming soon regarding the Capstone Projects.

Every course will be available every month, so don't worry about missing a session. You can always come back next month.


This is how an important scientific debate is being used to stop EPA regulation

Environmental regulation in the United States has protected human health for over 40 years. Since the Clean Air Act was enacted in 1970, levels of outdoor air pollution have dropped dramatically, changing the landscape of once heavily-polluted cities like Los Angeles and Pittsburgh. A 2011 cost-benefit analysis conducted by the U.S. Environmental Protection Agency estimated that the 1990 amendments to the CAA prevented 160,000 deaths and 13 million lost work days in the year 2010 alone. They estimated that the monetary benefits of the CAA were 30 times greater than the costs of implementing the regulations.

The benefits of environmental regulations like the CAA significantly outweigh their costs. But there are still costs, and those costs must be borne by someone. The burden is usually put on the polluters, such as the automobile and power generation industries, which have long fought any notion of air pollution regulation as a threat to their existence. Initially, as air pollution and health studies were still emerging, opponents of regulation often challenged the science itself, claiming flaws in the methodology, the measurements, or the interpretation. But when study after study demonstrated a connection between outdoor air pollution and a variety of health problems, it became increasingly difficult for critics to mount a credible challenge. Lawsuits are another tactic used by industry, with one case brought by the American Trucking Association going all the way to the U.S. Supreme Court.

The latest attack comes from the House of Representatives in the form of the Secret Science Reform Act, or H.R. 4102. In summary, the proposed bill requires that every scientific paper cited by the EPA to justify a new rule or regulation needs to be reproducible. What exactly does this mean? To answer that question we need to take a brief diversion into some recent important developments in statistical science.

The idea behind reproducibility is simple. All the data used in a scientific paper and all the computer code used to analyze that data should be made available to other researchers and the public. It may be surprising that much of this data actually isn’t already available. The primary reason most data isn’t available is because, until recently, most people didn’t ask scientists for their data. The data was often small and collected for a specific purpose so other scientists and the general public just weren’t that interested. If a scientist were interested in checking the truth of a claim, she could simply repeat the experiment in her lab to see if the claim could be replicated.

The nature of science has changed quickly over the last three decades. There has been an explosion of data, fueled by the decreasing cost of data collection technologies and computing power. At the same time, increased access to sophisticated computing power has let scientists conduct more sophisticated analyses on their data. The massive growth in data and the increasing sophistication of the analyses has made communicating what was done in a scientific study more complicated.

The traditional medium of journal publications has proven to be inadequate for describing the important details of a data analysis. As a result, it has been said that scientific articles are merely the “advertising” for the research that was conducted. The real research is buried in the data and the computer code actually used to compute the results. Journals have traditionally not required that data or computer code be published along with papers. As a result, many important details may be lost and prevent key studies from being fully reproducible.

The explosion of data has also made completely replicating a large study by an independent scientist much more difficult and costly. A large study is expensive to conduct in the first place; there is usually little appetite or funding to repeat it.  The result is that much of published scientific research cannot be reproduced by other scientists because the necessary data and analytic details are not available to others.

The scientific community is currently engaged in a debate over how to improve reproducibility across all of science. You might be tempted to ask, why not just share the data? Even if we could get everyone to agree with that in principle, it’s not clear how to do it.

Imagine if everyone in the U.S. decided we were all going to share our movie collections, and suppose for the sake of this example that the movie industry did not object. How would it work? Numerous questions immediately arise. Where would all these movies be stored? How would they be transferred from one person to another? How would I know what movies everyone else had? If my movies are all on the old DVD format, do I need to convert them to some other format before I can share? My Internet connection is very slow, how can I download a 3 hour HD movie? My mother doesn’t use computers much, but she has a great movie collection that I think others should have access to. What should she do? And who is going to pay for all of this? While each question may have a reasonable answer, it’s not clear what is the optimal combination and how you might scale it to the entire country.

Some of you may recall that the music industry had a brilliant sharing service that essentially allowed everyone to share their music collections. It was called Napster. Napster solved many of the problems raised above except for one -- they failed to survive. So even when a decent solution is found, there’s no guarantee that it will always be there.

As outlandish as this example may seem, minor variations on these exact questions come up when we discuss how to share scientific data. The volume of data being produced today is enormous and making all of it available to everyone is not an easy task. That’s not to say it is impossible. If smart people get together and work constructively, it is entirely possible that a reasonable approach could be found. But at this point, a credible long-term solution has yet to emerge.

This brings us back to the Secret Science Reform Act. The latest tactic by opponents of air quality regulation is to force the EPA to ensure that all of the studies that it cites to support new regulations are reproducible. A cursory reading of the bill gives the impression that the sponsors are genuinely concerned about making science more transparent to the public. But when one reads the language of the bill in the context of ongoing discussions about reproducibility, it becomes clear that the sponsors of the bill have no such goal in mind. The purpose of H.R. 4102 is to prevent the Environmental Protection Agency from proposing new regulations.

The EPA develops rules and regulations on the basis of scientific evidence. For example, the Clean Air Act requires EPA to periodically review the scientific literature for the latest evidence on the health effects of air pollution. The science the EPA considers needs to be published in peer-reviewed journals. This makes the EPA a key consumer of scientific knowledge and it uses this knowledge to make informed decisions about protecting public health. What the EPA is not is a large funder of scientific studies. The entire budget for the Office of Research and Development at EPA is roughly $550 million (fiscal 2014), or less than 2 percent of the budget for the National Institutes of Health (about $30 billion for fiscal 2014). This means EPA has essentially no influence over the scientists behind many of the studies it cites because it funds very few of those studies. The best the EPA can do is politely ask scientists to make their data available. If a scientist refuses, there’s not much the EPA can use as leverage.

The latest controversy to come up involves the Harvard Six Cities study published in 1993. This landmark study found a large difference in mortality rates comparing cities with high and low air pollution, even after adjusting for smoking and other factors. The House committee has been trying to make the data for this study publicly available so that it can ensure that regulations are “backed by good science”. However, the Committee has either forgotten or never knew that this particular study has been fully reproduced by independent investigators. In 2005, independent investigators found that they were “ to reproduce virtually all of the original numerical results, including the 26 percent increase in all-cause mortality in the most polluted city (Stubenville, OH) as compared to the least polluted city (Portage, WI). The audit and validation of the Harvard Six Cities Study conducted by the reanalysis team generally confirmed the quality of the data and the numerical results reported by the original investigators.”

It would be hard to find an air pollution study that has been subject to more scrutiny than the Six Cities studies. Even if you believed the Six Cities study was totally wrong, its original findings have been replicated numerous times since its publication, with different investigators, in different populations, using different analysis techniques, and in different countries. If you’re looking for an example where the science was either not reproducible or not replicable, sorry, but this is not your case study.

Ultimately, it is clear that the sponsors of this bill are cynically taking advantage of a genuine (but difficult) scientific debate over reproducibility to push a political agenda. Scientists are in agreement that reproducibility is important, but there is no consensus yet on how to make it happen for everyone. By forcing the EPA to ensure reproducibility of the science on which it bases regulation, lawmakers are asking the EPA to solve a problem that the entire scientific community has yet to figure out. The end result of passing a bill like H.R. 4102 is that the EPA will be forced to stop proposing any new regulation, handing a major victory to opponents of air quality standards and dealing a major blow to public health in the U.S.