Issues with reproducibility at scale on Coursera

Tweet about this on TwitterShare on Facebook9Share on Google+8Share on LinkedIn0Email this to someone

As you know, we are big fans of reproducible research here at Simply Statistics. The scandal around the lack of reproducibility in the analyses performed by Anil Potti and subsequent fallout drove the importance of this topic home.

So when I started teaching a course on Data Analysis for Coursera, of course I wanted to focus on reproducible research. The students in the class will be performing two data analyses during the course. They will be peer evaluated using a rubric specifically designed for evaluating data analyses at scale. One of the components of the rubric was to evaluate whether the code people submitted with their assignments reproduced all the numbers in the assignment.

Unfortunately, I just had to cancel the reproducibility component of the first data analysis assignment. Here are the things I realized while trying to set up the process that may seem obvious but weren't to me when I was designing the rubric:

  1. Security I realized (thanks to a very smart subset of the students in the class who posted on the message boards) that there is a major security issue with exchanging R code and data files with each other. Even if they use only the data downloaded from the official course website, it is possible that people could use the code to try to hack/do nefarious things to each other. The students in the class are great and the probability of this happening is small, but with a class this size, it isn't worth the risk.
  2. Compatibility I'm requiring that people use R for the course. Even so, people are working on every possible operating system, with many different versions of R . In this scenario, it is entirely conceivable for a person to write totally reproducible code that works on their machine but won't work on a random peer-reviewers machine
  3. Computing Resources The range of computing resources used by people in the class is huge. Everyone from people using modern clusters to people running on a single old beat up laptop. Inefficient code on a fast computer is fine, but on a slow computer with little memory it could mean the difference between reproducibility and crashed computers.

Overall, I think the solution is to run some kind of EC2 instance with a standardized set of software. That is the only thing I can think of that would be scalable to a class this size. On the other hand that would both be expensive, a pain to maintain, and would require everyone to run code on EC2.

Regardless, it is a super interesting question. How do you do reproducibility at scale?

  • Titus Brown

    As I was reading, I thought "EC2!" and "github!" You hit 1 of 2 :). You should ask Amazon for a grant to cover this use; let me know if you need a contact. I have tons of tutorials up on how to start up an instance, etc., as well; just drop me an e-mail.

    • http://www.facebook.com/megansquire Megan Squire

      They have grant cycles twice per year only.

      • Titus Brown

        Not so -- their research grants, yes, but not their student grants. See: http://aws.amazon.com/grants/. Also, I am happy to attempt to provide personal introductions to people at Amazon that I know would be interested in this kind of thing. Drop me a note: titus@idyll.org.

  • anthony damico

    for code sharing, don't you just need the equivalent of "windows safe mode" for R -- some quick way to disable functions like `file.remove()` and `shell()` and `source()` and the like?

    for data sharing, there's enough public data out there that requiring students to use something publicly-downloadable (as opposed to their own data sets) shouldn't be much of a burden for a class like this? we have entered the golden era of public data

    • Fr.

      As one (brilliant) student in Jeff's course has suggested, the "safe mode" feature is in the works: http://hackme.rapporter.net/ -- a sandboxed R environment by the very dedicated team at Rapporter (disclosure: I'm beta testing it and loving it so far).

      Running everything on VMs might be the way to go. I have faced the same scalability issues in classes with R and Stata. I had not given thought to the security issue before, but it is indeed a serious one, especially on a large anonymous class.

      • Fr.

        P.S. You might want to ask the team at RStudio what they think about the issue, because sandboxing would be a really cool feature of R projects (e.g. kill all disk file calls outside of the working directory), if that's even possible.

  • Link Tran

    your 1st point is a good issue that I hadn't realized. Being open worldwide, I do see the potential for a malicious hacker to take advantage. Perhaps github might be worth considering, since you can first see the code in html before downloading it. However, it'd still require that your users know enough about coding to discern whether the code is safe or not.

    I feel like the 2nd point is less of an issue, especially if you were to require that (i) all the R-versions be the same or higher, and (ii) that only a certain set of packages can be used. I switch back and forth between Windows and Linux all the time, and always set ".pdir" as the parent directory. Then, every time I want to call a directory, I used paste0(.pdir, "etc").

    The 3rd point, I'm unsure about as well. Isn't this an introductory course? If you were to (i) use small data sets, and (ii) request basic analyses, I don't really see how old laptops couldn't run it (even with bad code, unless the bad code really doesn't run on the original machine either). I guess I'd need more information about what it is you're asking of the students.

  • rene

    check with the people running the interactive programming course at Coursera. They had a Python IDE running somewhere, accesible through a browser

    • Phillip Johnson

      Udacity does the same thing. Maybe something with RServer where it accepts the commands from the built-in IDE?

  • ryanmcgreevy

    I'm in this class and curious to see how the peer reviewing goes. Also, I like the EC2 idea, and as someone mentioned you can get an education grant from them. I've used EC2 grant credits for a workshop to teach GPU computing and also another course on coursera built a sandboxed web interface environment with EC2 backend (parallel computing from UIUC).

    • http://twitter.com/epigrad EpiGrad

      It should be noted that said parallel computing class has suffered from a fair number of implementation problems.

  • JT

    Also take a look at this R sandboxing demo. It might suit your needs.


  • Q. Iqbal

    How about creating another user on the Linux or Mac systems with the ordinary (non-admin) privileges and use that for running the downloaded code snippets under R? That way the max one can wipe out by a malicious code are the files on that user's directory structure. Later just remove the user's account to wipe out any malicious stuff that the downloaded code might have stored.

  • http://profiles.google.com/mdshw5 Matt Shirley

    I've thought a bit about how to get an entire class on a sandboxed, standardized platform.

    EC2 is a great idea, but it requires a great deal of commitment to setting up a cluster environment where everyone can reliably log in and get their work done, not to mention the costs. Yes, Amazon does give out grants, but they typically want those $100 AWS credits to go to the students, not be used as a pool for the instructor to set up a compute cluster.

    I think the best option for standardizing everyone in a sandboxed platform is to distribute a virtualized machine image of your favorite pre-configured Linux distribution to everyone using something like Virtualbox. Even catering to the lowest hardware common-denominator, most can run a virtualized Linux environment on their laptop. Set the environment to use one CPU thread and 512MB of RAM and you will have enough headroom for most analyses while forcing those that choose to implement compute intensive solutions to work within realistic confines. This solves all three issues:

    1. Any malicious code will be wiped away when the user shuts down the VM.
    2. You can standardize the versions of R, R libraries, and any external dependencies before distributing the VM. You can also include some data sets in the VM for ease of access.
    3. If the VM crashes or hangs, you simply hit "reset" and start over.

  • David Hood

    The virtual machine instance is going set the bar above the computing resources of some people in the paper (disclosure, I am a student in the paper, though not one challenged by computing resources). One other possible approach is something similar to that used by Andrew Ng in the Machine Learning paper (one of the first papers of Coursera). In using Octave the procedure was a) download the zipped directory with the programming assignment b) write and test function x in script y returning a value of type z c) Run the submission script in the assignment folder which possibly draws down sample data and submits an answer somewhere but mostly runs the function in a workflow with other validating scripts in the folder.

    As I see it, the scalability of the model is that the pulled code comes from a trusted source, the instructor, which validates the student code and provides a token of the validation (even if that token was type this number into the question 3 box for assignment 2).
    Possible issues: a) Internet filtering of particular countries (less of a problem if the validations are contained within the folder, as in emergencies the folder could be mirrored) b) The instructor provided validation (and or testing) scripts would need to be written thoroughly cross platform.

  • Michael

    Why not have the users simply attach a PDF containing the printout of their source for their analysis? At that point, you can see the code and determine if there is any malicious intent. If the grader wishes, they can input the code and run it if they deem it to be safe.

    Also, Jeff, could you let Roger know that there a couple of questions on the Coursera forums for his course concerning the grading? Some students want clarification about why they received 99% vs. 100%, as well as why the grading description on the certificate page only mentions 2 programming assignments, not 4. I'm sure a response from him would alleviate many of the present frustrations of those students. Thanks :)

  • http://twitter.com/PolSciReplicate PolSci Replication

    This is highly interesting. I've put in a proposal for a replication workshop at Cambridge for 8 weeks of replicating existing articles (for social scientists). I haven't heard back yet but I'm already looking around for examples of 'replication teaching' on how to do this - and what the pitfalls are.

    My cohort of R users is beginners and can only do OLS - that's our main challenge, to find 'simple' papers.

    I don't quite understand your point about security. What exactly is the problem with exchanging R code and data files? I might just not get it, please explain a bit. Is this an advanced user problem?

    Compatibility: Shouldn't R code work on every machine? Am I being naive? Again, this might be an advanced R user problem?

    I'd be curious to hear more about the experience!


    • Michael


      The security issue, according to my understanding, is the potential for students to embed malicious code inside their assignment. Of course, in this case, the presumption is students would be naively running the code without first inspecting it for anything suspicious.


      • Rob Searle

        As a student on the course (with approx 30 years programming experience in C, C++, Ada, Java and some security engineering experience)

        Inspecting for anything suspicious - Any R script for analysis has to install the packages and data set to my machine. I can make sure that the data URL looks right for the course but I am definitely not sufficiently experienced with R to identify an odd package/repository

        How far do you expect someone to go with code inspection - The base script, every function call, every function call inside every function call?
        I have to assume that someone willing to use this as a hack vector would be more capable in R than the expected student and willing to expend some effort to hide their activities - It is also clear from the discussion fora on the course that many of the students are learning R via google/Rstudio help/the course as they go.

        Not even sure that sandboxing would be sufficient in the general case because the threat model isn't clear. R is quite a capable language, I suspect that it can do a lot worse than delete a few files on my machine given LAN access (which it has to be allowed in order to reproduce the results).

  • Jeroen Ooms

    WRT security: have a look at OpenCPU and RAppArmor:

  • Jeroen Ooms

    WRT reproducibility between versions of R: have a look at http://arxiv.org/abs/1303.2140