Simply Statistics

12
Feb

Is Reproducibility as Effective as Disclosure? Let's Hope Not.

Jeff and I just this week published a commentary in the Proceedings of the National Academy of Sciences on our latest thinking on reproducible research and its ability to solve the reproducibility/replication "crisis" in science (there's a version on arXiv too). In a nutshell, we believe reproducibility (making data and code available so that others can recompute your results) is an essential part of science, but it is not going to end the crisis of confidence in science. In fact, I don't think it'll even make a dent. The problem is that reproducibility, as a tool for preventing poor research, comes in at the wrong stage of the research process (the end). While requiring reproducibility may deter people from committing outright fraud (a small group), it won't stop people who just don't know what they're doing with respect to data analysis (a much larger group).

In an eerie coincidence, Jesse Eisinger of the investigative journalism non-profit ProPublica, has just published a piece on the New York Times Dealbook site discussing how requiring disclosure rules in the financial industry has produced meager results. He writes

Over the last century, disclosure and transparency have become our regulatory crutch, the answer to every vexing problem. We require corporations and government to release reams of information on food, medicine, household products, consumer financial tools, campaign finance and crime statistics. We have a booming “report card” industry for a range of services, including hospitals, public schools and restaurants.

The rationale for all this disclosure is that

someone, somewhere reads the fine print in these contracts and keeps corporations honest. It turns out what we laymen intuit is true: No one reads them, according to research by a New York University law professor, Florencia Marotta-Wurgler.

But disclosure is nevertheless popular because how could you be against it?

The disclosure bonanza is easy to explain. Nobody is against it. It’s politically expedient. Companies prefer such rules, especially in lieu of actual regulations that would curtail bad products or behavior. The opacity lobby — the remora fish class of lawyers, lobbyists and consultants in New York and Washington — knows that disclosure requirements are no bar to dodgy practices. You just have to explain what you’re doing in sufficiently incomprehensible language, a task that earns those lawyers a hefty fee.

In the now infamous Duke Saga, Keith Baggerly was able to reproduce the work of Potti et al. after roughly 2,000 hours of work because the data were publicly available (although the code was not). It's not clear how much time would have been saved if the code had been available, but it seems reasonable to assume that it would have taken some amount of time to understand the analysis, if not reproduce it. Once the errors in Potti's work were discovered, it took 5 years for the original Nature Medicine paper to be retracted.

Although you could argue that the process worked in some sense, it came at tremendous cost of time and money. Wouldn't it have been better if the analysis had been done right in the first place?

26
Jan

Reproducible Research Course Companion

Screen Shot 2015-01-26 at 4.14.26 PMI'm happy to announce that you can now get a copy of the Reproducible Research Course Companion from the Apple iBookstore. The purpose of this e-book is pretty simple. The book provides all of the key video lectures from my Reproducible Research course offered on Coursera, in a simple offline e-book format. The book can be viewed on a Mac, iPad, or iPad mini.

If you're interested in taking my Reproducible Research course on Coursera and would like a flavor of what the course will be like, then you can view the lectures through the book (the free sample contains three lectures). On the other hand, if you already took the course and would like access to the lecture material afterwards, then this might be a useful add-on. If you care currently enrolled in the course, then this could be a handy way for you to take the lectures on the road with you.

Please note that all of the lectures are still available for free on YouTube via my YouTube channel. Also, the book provides content only. If you wish to actually complete the course, you must take it through the Coursera web site.

15
Oct

Dear Laboratory Scientists: Welcome to My World

Consider the following question: Is there a reproducibility/replication crisis in epidemiology?

I think there are only two possible ways to answer that question:

  1. No, there is no replication crisis in epidemiology because no one ever believes the result of an epidemiological study unless it has been replicated a minimum of 1,000 times in every possible population.
  2. Yes, there is a replication crisis in epidemiology, and it started in 1854 when John Snow inferred, from observational data, that cholera was spread via contaminated water obtained from public pumps.

If you chose (2), then I don't think you are allowed to call it a "crisis" because I think by definition, a crisis cannot last 160 years. In that case, it's more of a chronic disease.

I had an interesting conversation last week with a prominent environmental epidemiologist over the replication crisis that has been reported about extensively in the scientific and popular press. In his view, he felt this was less of an issue in epidemiology because epidemiologists never really had the luxury of people (or at least fellow scientists) believing their results because of their general inability to conduct controlled experiments.

Given the observational nature of most environmental epidemiological studies, it's generally accepted in the community that no single study can be considered causal, and that many replications of a finding are need to establish a causal connection. Even the popular press knows now to include the phrase "correlation does not equal causation" when reporting on an observational study. The work of Sir Austin Bradford Hill essentially codifies the standard of evidence needed to draw causal conclusions from observational studies.

So if "correlation does not equal causation", it begs the question, what does equal causation? Many would argue that a controlled experiment, whether it's a randomized trial or a laboratory experiment, equals causation. But people who work in this area have long known that while controlled experiments do assign the treatment or exposure, there are still many other elements of the experiment that are not controlled.

For example, if subjects drop out of a randomized trial, you now essentially have an observational study (or at least a "broken" randomized trial). If you are conducting a laboratory experiment and all of the treatment samples are measured with one technology and all of the control samples are measured with a different technology (perhaps because of a lack of blinding), then you still have confounding.

The correct statement is not "correlation does not equal causation" but rather "no single study equals causation", regardless of whether it was an observational study or a controlled experiment. Of course, a very tightly controlled and rigorously conducted controlled experiment will be more valuable than a similarly conducted observational study. But in general, all studies should simply be considered as further evidence for or against an hypothesis. We should not be lulled into thinking that any single study about an important question can truly be definitive.

22
Sep

Unbundling the educational package

I just got back from the World Economic Forum's summer meeting in Tianjin, China and there was much talk of disruption and innovation there. Basically, if you weren't disrupting, you were furniture. Perhaps not surprisingly, one topic area that was universally considered ripe for disruption was Education.

There are many ideas bandied about with respect to "disrupting" education and some are interesting to consider. MOOCs were the darlings of...last year...but they're old news now. Sam Lessin has a nice piece in the The Information (total paywall, sorry, but it's worth it) about building a subscription model for universities. Aswath Damodaran has what I think is a nice framework for thinking about the "education business".

One thing that I latched on to in Damodaran's piece is the idea of education as a "bundled product". Indeed, I think the key aspect of traditional on-site university education is the simultaneous offering of

  1. Subject matter content (i.e. course material)
  2. Mentoring and guidance by faculty
  3. Social and professional networking
  4. Other activities (sports, arts ensembles, etc.)

MOOCs have attacked #1 for many subjects, typically large introductory courses. Endeavors like the Minerva project are attempting to provide lower-cost seminar-style courses (i.e. anti-MOOCs).

I think the extent to which universities will truly be disrupted will hinge on how well we can unbundle the four (or maybe more?) elements described above and provide them separately but at roughly the same level of quality. Is it possible? I don't know.

29
Jul

Introducing people to R: 14 years and counting

I've been introducing people to R for quite a long time now and I've been doing some reflecting today on how that process has changed quite a bit over time. I first started using R around 1998--1999 I think I first started talking about R informally to my fellow classmates (and some faculty) back when I was in graduate school at UCLA. There, the department was officially using Lisp-Stat (which I loved) and only later converted its courses over to R. Through various brown-bag lunches and seminars I would talk about R, and the main selling point at the time was "It's just like S-PLUS but it's free!" As it turns out, S-PLUS was basically abandoned by academics and its ownership changed hands a number of times over the years (it is currently owned by TIBCO). I still talk about S-PLUS when I talk about the history of R but I'm not sure many people nowadays actually have any memories of the product.

When I got to Johns Hopkins in 2003 there wasn't really much of a modern statistical computing class, so Karl Broman, Rafa Irizarry, Brian Caffo, Ingo Ruczinski, and I got together and started what we called the "KRRIB" class, which was basically a weekly seminar where one of us talked about a computing topic of interest. I gave some of the R lectures in that class and when I asked people who had heard of R before, almost no one raised their hand. And no one had actually used it before. My approach was pretty much the same at the time, although I left out the part about S-PLUS because no one had used that either. A lot of people had experience with SAS or Stata or SPSS. A number of people had used something like Java or C/C++ before and so I often used that a reference frame. No one had ever used a functional-style of programming language like Scheme or Lisp.

Over time, the population of students (mostly first-year graduate students) slowly shifted to the point where many of them had been introduced to R while they were undergraduates. This trend mirrored the overall trend with statistics where we are seeing more and more students do undergraduate majors in statistics (as opposed to, say, mathematics). Eventually, by 2008--2009, when I'd ask how many people had heard of or used R before, everyone raised their hand. However, even at that late date, I still felt the need to convince people that R was a "real" language that could be used for real tasks.

R has grown a lot in recent years, and is being used in so many places now, that I think its essentially impossible for a person to keep track of everything that is going on. That's fine, but it makes "introducing" people to R an interesting experience. Nowadays in class, students are often teaching me something new about R that I've never seen or heard of before (they are quite good at Googling around for themselves). I feel no need to "bring people over" to R. In fact it's quite the opposite--people might start asking questions if I weren't teaching R.

Even though my approach to introducing R has evolved over time, with the topics that I emphasize or de-emphasize changing, I've found there are a few topics that I always  stress to people who are generally newcomers to R. For whatever reason, these topics are always new or at least a little unfamiliar.

  • R is a functional-style language. Back when most people primarily saw something like C as a first programming language, it made sense to me that the functional style of programming would seem strange. I came to R from Lisp-Stat so the functional aspect was pretty natural for me. But many people seem to get tripped up over the idea of passing a function as an argument or not being able to modify the state of an object in place. Also, it sometimes takes people a while to get used to doing things like lapply() and map-reduce types of operations. Everyone still wants to write a for loop!
  • R is both an interactive system and a programming language. Yes, it's a floor wax and a dessert topping--get used to it. Most people seem expect one or the other. SAS users are wondering why you need to write 10 lines of code to do what SAS can do in one massive PROC statement. C programmers are wondering why you don't write more for loops. C++ programmers are confused by the weird system for object orientation. In summary, no one is ever happy.
  • Visualization/plotting capabilities are state-of-the-art. One of the big selling points back in the "old days" was that from the very beginning R's plotting and graphics capabilities where far more elegant than the ASCII-art that was being produced by other statistical packages (true for S-PLUS too). I find it a bit strange that this point has largely remained true. While other statistical packages have definitely improved their output (and R certainly has some areas where it is perhaps deficient), R still holds its own quite handily against those other packages. If the community can continue to produce things like ggplot2 and rgl, I think R will remain at the forefront of data visualization.

I'm looking forward to teaching R to people as long as people will let me, and I'm interested to see how the next generation of students will approach it (and how my approach to them will change). Overall, it's been just an amazing experience to see the widespread adoption of R over the past decade. I'm sure the next decade will be just as amazing.

16
Jul

Jan de Leeuw owns the Internet

One of the best things to happen on the Internet recently is that Jan de Leeuw has decided to own the Twitter/Facebook universe. If you do not already, you should be following him. Among his many accomplishments, he founded the Department of Statistics at UCLA (my alma mater), which is currently thriving. On the occasion of the Department's 10th birthday, there was a small celebration, and I recall Don Ylvisaker mentioning that the reason they invited Jan to UCLA way back when was because he "knew everyone and knew everything". Pretty accurate description, in my opinion.

Jan's been tweeting quite a bit of late, but recently had this gem:

followed by

I'm not sure what Jan's thinking behind the first tweet was, but I think many in statistics would consider it a "good thing" to be a minor subfield of data science. Why get involved in that messy thing called data science where people are going wild with data in an unprincipled manner?

This is a situation where I think there is a large disconnect between what "should be" and what "is reality". What should be is that statistics should include the field of data science. Honestly, that would be beneficial to the field of statistics and would allow us to provide a home to many people who don't necessarily have one (primarily, people working not he border between two fields). Nate Silver made reference to this in his keynote address to the Joint Statistical Meetings last year when he said data science was just a fancy term for statistics.

The reality though is the opposite. Statistics has chosen to limit itself to a few areas, such as inference, as Jan mentions, and to willfully ignore other important aspects of data science as "not statistics". This is unfortunate, I think, because unlike many in the field of statistics, I believe data science is here to stay. The reason is because statistics has decided not to fill the spaces that have been created by the increasing complexity of modern data analysis. The needs of modern data analyses (reproducibility, computing on large datasets, data preprocessing/cleaning) didn't fall into the usual statistics curriculum, and so they were ignored. In my view, data science is about stringing together many different tools for many different purposes into an analytic whole. Traditional statistical modeling is a part of this (often a small part), but statistical thinking plays a role in all of it.

Statisticians should take on the challenge of data science and own it. We may not be successful in doing so, but we certainly won't be if we don't try.

24
Jun

New book on implementing reproducible research

9781466561595I have mentioned this in a few places but my book edited with Victoria Stodden and Fritz Leisch, Implementing Reproducible Research, has just been published by CRC Press. Although it is technically in their "R Series", the chapters contain information on a wide variety of useful tools, not just R-related tools. 

There is also a supplementary web site hosted through Open Science Framework that contains a lot of additional information, including the list of chapters.

06
Jun

The Real Reason Reproducible Research is Important

Reproducible research has been on my mind a bit these days, partly because it has been in the news with the Piketty stuff, and also perhaps because I just published a book on it and I'm teaching a class on it as we speak (as well as next month and the month after...).

However, as I watch and read many discussions over the role of reproducibility in science, I often feel that many people miss the point. Now, just to be clear, when I use the word "reproducibility" or say that a study is reproducible, I do not mean "independent verification" as in a separate investigator conducted an independent study and came to the same conclusion as the original study (that is what I refer to as "replication"). By using the word reproducible, I mean that the original data (and original computer code) can be analyzed (by an independent investigator) to obtain the same results of the original study. In essence, it is the notion that the data analysis can be successfully repeatedReproducibility is particularly important in large computational studies where the data analysis can often play an outsized role in supporting the ultimate conclusions.

Many people seem to conflate the ideas of reproducible and correctness, but they are not the same thing. One must always remember that a study can be reproducible and still be wrong. By "wrong", I mean that the conclusion or claim can be wrong. If I claim that X causes Y (think "sugar causes cancer"), my data analysis might be reproducible, but my claim might ultimately be incorrect for a variety of reasons. If my claim has any value, then others will attempt to replicate it and the correctness of the claim will be determined by whether others come to similar conclusions.

Then why is reproducibility so important? Reproducibility is important because it is the only thing that an investigator can guarantee about a study.

Contrary to what most press releases would have you believe, an investigator cannot guarantee that the claims made in a study are correct (unless they are purely descriptive). This is because in the history of science, no meaningful claim has ever been proven by a single study. (The one exception might be mathematics, whether they are literally proving things in their papers.) So reproducibility is important not because it ensures that the results are correct, but rather because it ensures transparency and gives us confidence in understanding exactly what was done.

These days, with the complexity of data analysis and the subtlety of many claims (particularly about complex diseases), reproducibility is pretty much the only thing we can hope for. Time will tell whether we are ultimately right or wrong about any claims, but reproducibility is something we can know right now.

03
Jun

Post-Piketty Lessons

The latest crisis in data analysis comes to us (once again) from the field of Economics. Thomas Piketty, a French economist recently published a book titled Capital in the 21st Century that has been a best-seller. I have not read the book, but based on media reports, it appears to make the claim that inequality has increased in recent years and will likely increase into the future. The book argues that this increase in inequality is driven by capitalism’s tendency to reward capital more than labor. This is my non-economist’s understanding of the book, but the details specific claims of the book are not what I want to discuss here (there is much discussion elsewhere).

An interesting aspect of Piketty’s work, from my perspective, is that he has made all of his data and analysis available on the web. From what I can tell, his analysis was not trivial—data were collected and merged from multiple disparate sources and adjustments were made to different data series to account for various incompatibilities. To me, this sounds like a standard data analysis, in the sense that all meaningful data analyses are complicated. As noted by Nate Silver, data do not arise from a “virgin birth”, and in any example worth discussing, much work has to be done to get the data into a state in which statistical models can be fit, or even more simply, plots can be made.

Chris Giles, a journalist for the Financial Times, recently published a column (unfortunately blocked by paywall) in which he claimed that much of the analysis that Piketty had done was flawed or incorrect. In particular, he claimed that based on his (Giles’) analysis, inequality was not growing as much over time as Piketty claimed. Among other points, Giles claims that numerous errors were made in assembling the data and in Piketty’s original analysis.

This episode smacked of the recent Reinhart-Rogoff kerfuffle in which some fairly basic errors were discovered in those economists' Excel spreadsheets. Some of those errors only made small differences to the results, but a critical methodological component, in which the data were weighted in a special way, appeared to have a significant impact on the results if alternate approaches were taken.

Piketty has since responded forcefully to the FT's column, defending all of the work he has done and addressing the criticisms one by one. To me, the most important result of the FT analysis is that Piketty’s work appears to be largely reproducible. Piketty made his data available, with reasonable documentation (in addition to his book), and Giles was able to come up with the same numbers Piketty came up with. This is a good thing. Piketty’s work was complex, and the only way to communicate the entirety of it was to make the data and code available.

The other aspects of Giles’ analysis are, from an academic standpoint, largely irrelevant to me, particularly because I am not an economist. The reason I find them irrelevant is because the objections are largely over whether he is correct or not. This is an obviously important question, but in any field, no single study or even synthesis can be determined to be "correct" at that instance. Time will tell, and if his work is "correct", his predictions will be borne out by nature. It's not so satisfying to have to wait many years to know if you are correct, but that's how science works.

In the meantime, economists will have a debate over the science and the appropriate methods and data used for analysis. This is also how science works, and it is only (really) possible because Piketty made his work reproducible. Otherwise, the debate would be largely uninformed.

05
May

JHU Data Science: More is More

Today Jeff Leek, Brian Caffo, and I are launching 3 new courses on Coursera as part of the Johns Hopkins Data Science Specialization. These courses are

I'm particularly excited about Reproducible Research, not just because I'm teaching it, but because I think it's essentially the first of its kind being offered in a massive open format. Given the rich discussions about reproducibility that have occurred over the past few years, I'm happy to finally be able to offer this course for free to a large audience.

These courses are launching in addition to the first 3 courses in the sequence: The Data Scientist's Toolbox, R Programming, and Getting and Cleaning Data, which are also running this month in case you missed your chance in April.

All told we have 6 of the 9 courses in the Specialization available as of today. We're really looking forward to next month where we will be launching the final 3 courses: Regression Models, Practical Machine Learning, and Developing Data Products. We also have some exciting announcements coming soon regarding the Capstone Projects.

Every course will be available every month, so don't worry about missing a session. You can always come back next month.