Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Learning about Machine Learning with an Earthquake Example

Editor’s note: This is the fourth chapter of a book I’m working on called Demystifying Artificial Intelligence. I’ve also added a co-author, Divya Narayanan, a masters student here at Johns Hopkins! The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. We are developing the book over time - so if you buy the book on Leanpub know that there are only four chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this amazing tweet by Twitter user @notajf. Feedback is welcome and encouraged!

“A learning machine is any device whose actions are influenced by past experience.” - Nils John Nilsson

Machine learning describes exactly what you would think: a machine that learns. As we described in the previous chapter a machine “learns” just like humans from previous examples. With certain experiences that give them an understanding about a particular concept, machines can be trained to have similar experiences as well, or at least mimic them. With very routine tasks, our brains become attuned to characteristics that define different objects or activities.

Before we can dive into the algorithms - like neural networks - that are most commonly used for artificial intelligence, lets consider a real example to understand how machine learning works in practice.

The Big One

Earthquakes occur when the surface of the Earth experiences a shake due to displacement of the ground, and can readily occur along fault lines where there have already been massive displacements of rock or ground(Wikipedia 2017a). For people living in places like California where earthquakes occur relatively frequently, preparedness and safety are major concerns. One famous fault in southern California, called the San Andreas Fault, is expected to produce the next big earthquake in the foreseeable future, often referred to as the “Big One”. Naturally, some residents are concerned and may like to know more so they are better prepared.

The following data are pulled from fivethirtyeight, a political and sports blogging site, and describe how worried people are about the “Big One” (Hickey 2015). Here’s an example of the first few observations in this dataset:

  worry_general worry_bigone will_occur
1004 Somewhat worried Somewhat worried TRUE
1005 Not at all worried Not at all worried FALSE
1006 Not so worried Not so worried FALSE
1007 Not at all worried Not at all worried FALSE
1008 Not at all worried Not at all worried FALSE
1009 Not at all worried Not at all worried FALSE
1010 Not so worried Somewhat worried FALSE
1011 Not so worried Extremely worried FALSE
1012 Not at all worried Not so worried FALSE
1013 Somewhat worried Not so worried FALSE

Just by looking at this subset of the data, we can already get a feel for how many different ways it could be structured. Here, we see that there are 10 observations which represent 10 individuals. For each individual, we have information on 11 different aspects of earthquake preparedness and experience (only 3 of which are shown here). Data can be stored as text, logical responses (true/false), or numbers. Sometimes, and quite often at that, it may be missing; for example, observation 1013.

So what can we do with this data? For example, we could predict - or classify - whether or not someone was likely to have taken any precautions for an upcoming earthquake, like bolting their shelves to the wall or come up with an evacuation plan. Using this idea, we have now found a question that we’re interested in analyzing: are you prepared for an earthquake or not? And now, based on this question and the data that we have, we can see that you can either be prepared (seen above as “true”) or not (seen above as “false”).

Our question: How well can we predict whether or not someone is prepared for an earthquake?

An Algorithm – what’s that?

With our question in tow, we want to design a way for our machine to determine if someone is prepared for an earthquake or not. To do this, the machine goes through a flowchart-like set of instructions. At each fork in the flowchart, there are different answers which take the machine on a different path to get to the final answer. If you go through the correct series of questions and answers, it can correctly identify a person as being prepared. Here’s a small portion of the final flowchart for the San Andreas data which we will proceed to dissect (note: the ellipses on the right-hand side of the flowchart indicate where the remainder of the algorithm lies. This will be expanded later in the chapter):

The steps that we take through the flowchart, or tree make up the classification algorithm. An algorithm is essentially a set of step-by-step instructions that we follow to organize, or in other words, to make a prediction about our data. In this case, our goal is to classify an individual as prepared or not by working our way through the different branches of the tree. So how did we establish this particular set of questions to be in our framework of identifying a prepared individual?

CART, or a classification and regression tree, is one way to assess which of these characteristics is the most important in terms of splitting up the data into prepared and unprepared individuals (Wikipedia 2017b, Breiman et al. (1984)). There are multiple ways of implementing this method, often times with the earlier branches making larger splits in the data, and later branches making smaller splits.

Within an algorithm, there exists another level of organization composed of features and parameters.

In order to tell if someone is prepared for an earthquake or not, there have to be certain characteristics, known as features, that separate those who are prepared from those who are not. Features are basically the things you measured in your dataset that are chosen to give you insight into an individual and how to best classify them into groups. Looking at our sample data, we can see that some of the possible features are: whether or not an individual is worried about earthquakes in general, prior experiences with earthquakes, or their gender. As we will soon see, certain features will carry more weight in separating an individual into the two groups (prepared vs. unprepared).

If we were looking at how important previously experiencing an earthquake was in classifying someone as prepared, we’d say it plays a pretty big role, since it’s one of the first features that we encounter in our flowchart. The feature that seems to make a bigger split to our data is region, as it appears as the first feature in our algorithm shown above. We would expect that people in the Mountain and Pacific regions have more experience and knowledge about earthquakes, as that part of the country is more prone to experiencing an earthquake. However, someone’s age may not be as important in classifying a prepared individual. Since it doesn’t even show up in the top of our flowchart, it means it wasn’t as crucial to know this information to decide if a person is prepared or not and it didn’t separate the data much.

The second form of organization within an algorithm are the questions and cutoffs for moving one direction or another at each node. These are known as parameters of our algorithm. These parameters give us insight as to how the features we have established define the observation we are trying to identify. Let us consider an example using the feature of region. As we mentioned earlier, we would expect that those in the Mountain and Pacific regions would have more experience with earthquakes, which may reflect in their level of preparedness. Looking back at our abbreviated classification tree, the first node in our tree has a parameter of “Mountain or Pacific” for the feature region, which can be split into “yes” (those living in one of these regions) or “no” (living elsewhere in the US).

If we were looking at a continuous variable, say number of years living in a region, we may set a threshold instead, say greater than 5 years, as opposed to a yes/no distinction. In supervised learning, where we are teaching the machine to identify a prepared individual, we provide the machine multiple observations of prepared individuals and include different parameter values of features to show how a person could be prepared. To illustrate this point, let us consider the 10 sample observations below, and specifically examine the outcome, preparedness, with respect to the features: will_occur, female, and household income.

  prepared will_occur female hhold_income
1004 TRUE TRUE FALSE $50,000 to $74,999
1005 FALSE FALSE TRUE $10,000 to $24,999
1006 TRUE FALSE TRUE $200,000 and up
1007 FALSE FALSE FALSE $75,000 to $99,999
1008 FALSE FALSE TRUE Prefer not to answer
1009 FALSE FALSE FALSE Prefer not to answer
1010 TRUE FALSE TRUE $50,000 to $74,999
1011 FALSE FALSE TRUE Prefer not to answer
1012 FALSE FALSE TRUE $50,000 to $74,999
1013 FALSE FALSE NA NA

Of these ten observations, 7 are not prepared for the next earthquake and 3 are. But to make this information more useful, we can look at some of the features to see if there are any similarities that the machine can use as a classifier. For example, of the 3 individuals that are prepared, two are female and only one is male. But notice we get the same distribution of males and females by looking at those who are not prepared: you have 4 females and 2 males, the same 2:1 ratio. From such a small sample, the algorithm may not be able to tell how important gender is in classifying preparedness. But, by looking through the remaining features and a larger sample, it can start to classify individuals. This is what we mean when we say a machine learning algorithm learns.

Now, let us take a closer look at observations 1005, 1011, and 1012, and more specifically the household income feature:

  prepared will_occur female hhold_income
1005 FALSE FALSE TRUE $10,000 to $24,999
1011 FALSE FALSE TRUE Prefer not to answer
1012 FALSE FALSE TRUE $50,000 to $74,999

All three of these observations are females and believe that the “Big One” won’t occur in their lifetime. But despite the fact that they are all unprepared, they each report a different household income. Based on just these three observations, we may conclude that household income is not as important in determining preparedness. By showing a machine different examples of which features a prepared individual has (or unprepared, as in this case), it can start to recognize patterns and identify the features, or combination of features, and parameters that are most indicative of preparedness.

In summary, every flowchart will have the following components:

  1. The algorithm - The general workflow or logic that dictates the path the machine travels, based on chosen features and parameter values. In turn, the machine classifies or predicts which group an observation belongs to

  2. Features - The variables or types of information we have about each observation
  3. Parameters - The possible values a particular feature can have

Even with the experience of seeing numerous observations with various feature values, there is no way to show our machine information on every single person that exists in the world. What will it do when it sees a brand new observation that is not identified as prepared or unprepared? Is there a way to improve how well our algorithm performs?

Training and Testing Data

You may have heard of the terms sample and population. In case these terms are unfamiliar, think of the population as the entire group of people we want to get information from, study, and describe. This would be like getting a piece of information, say income, from every single person in the world. Wouldn’t that be a fun exercise!

If we had the resources to do this, we could then take all those incomes and find out the average income of an individual in the world. But since this is not possible, it might be easier to get that information from a smaller number of people, or sample, and use the average income of that smaller pool of people to represent the average income of the world’s population. We could only say that the average income of the sample is representative of the population if the sample of people that we picked have the same characteristics of the population.

For example, if we assumed that our population of interest was a company with 1,000 employees, where 200 of those employees make $60,000 each and 800 of them make $30,000 each. Since we have this information on everyone, we could easily calculate the average income of an employee in the company, which would be $36,000. Now, say we randomly picked a group of 100 individuals from the company as our sample. If all of those 100 individuals came from the group of employees that made $60,000, we might think that the average income for an employee was actually much higher than the true average of the population (the whole company). The opposite would be true if all 100 of those employees came from the group making less money - we would mistakenly think the average income of employees is lower. In order to accurately reflect the distribution of income of the company employees through our sample, or rather to have a representative sample, we would try to pick 20 individuals from the higher income group and 80 individuals from the lower income group to get an accurate representation of this company population.

Now heading back to our earthquake example, our big picture goal is to be able to feed our algorithm a brand new observation of someone who answered information about themselves and earthquake preparedness, and have the machine be able to correctly identify whether or not they are prepared for a future earthquake.

One definition of a population could consist of all individuals in the world. However, since we can’t get access to data on all these individuals, we decide to sample 1013 respondents and ask them earthquake related questions. Remember that in order for our machine to be able to accurately identify an individual as prepared or unprepared, it needs to have had some experience seeing some observations where features identify the individual as prepared, as well as some observations that aren’t. This seems a little counterintuitive though. If we want our algorithm to identify a prepared individual, why wouldn’t we show it all the observations that are prepared?

By showing our machine observations of respondents that are not prepared, it can better strengthen its idea of what features identify a prepared individual. But we also want to make our algorithm as robust as possible. For an algorithm to be robust, it should be able to take in a wide range of values for each feature, and appropriately go through the algorithm to make a classification. If we only show our machine a narrow set of experiences, say people who have an income of between $10,000 and $25,000, it will be harder for the algorithm to correctly classify an individual who has an income of $50,000.

One way we can give our machine this set of experiences is to take all 1013 observations and randomly split them up into two groups. Note: for simplification, any observations that had missing data (total: 7) for the outcome variable were removed from the original dataset, leaving 1006 observations for our analysis.

  1. Training data - This serves as the wide range of experiences that we want our machine to see to have a better understanding of preparedness

  2. Testing data - This data will allow us to evaluate our algorithm and see how well it was able to pick up on features and parameter values that are specific to prepared individuals and correctly label them as such

So what’s the point of splitting up our data into training and testing? We could have easily fed all the data that we have into the algorithm and have it detect the most important features and parameters we have based on the provided observations. But there’s an issue with that, known as overfitting. When an algorithm has overfit the data, it means that it has been fit specifically to the data at hand, and only that data. It would be harder to give our algorithm data that does not fit within the bounds of our training data (though it would perform very well in this sample set). Moreover, the algorithm would only accurately classify a very narrow set of observations. This example nicely illustrates the concept we introduced earlier - robustness - and demonstrates the importance of exposing our algorithm to a wide range of experiences. We want our algorithm to be useful, which means it needs to be able to take in all kinds of data with different distributions, and still be able to accurately classify them.

To create training and testing sets, we can adopt the following idea:

  1. Split the 1006 observations in half: roughly 500 for training, and the remainder for testing
  2. Feed the 500 training observations through the algorithm for the machine to understand what features best classify individuals as prepared or unprepared
  3. Once the machine is trained, feed the remaining test observations through the algorithm and see how well it classifies them

Algorithm Accuracy

Now that we’ve built up our algorithm and split our data into training and test sets, let’s take a look at the full classification algorithm:

Recall the question we set out to answer with respect to the earthquake data: How well can we predict whether or not someone is prepared for an earthquake? In a binary (yes/no) case like this, we can set up our results in a 2x2 table, where the rows represent predicted preparedness (based on the features of our algorithm) and the columns represent true preparedness (what their true label is).

This simple 2x2 table carries quite a bit of information. Essentially, we can grade our machine on how well it learned to tell whether individuals are prepared or unprepared. We can see how well our algorithm did at classifying new observations by calculating the predictive accuracy, done by summing cells A and C and dividing by the total number of observations, or more simply, (A + C) / N. Through this calculation, we see that the algorithm from our example correctly classified individuals as prepared or unprepared 77.9% of the time. Not bad!

When we feed the roughly 500 test observations through the algorithm, it is the first time the machine has seen those observations. As a result, there is a chance that despite going through the algorithm, the machine misclassified someone as prepared, when in fact they were unprepared. To determine how often this happens, we can calculate the test error rate from the 2x2 table from above. To calculate the test error rate, we take the total number of observations that are discordant, or dissimilar between true and predicted status, and divide this total by the total number of observations that were assessed. Based on the above table, the test error rate would be (B + C) / N, or 22.1%.

There are a number of reasons that a test error rate would be high. Depending on the data set, there might be different methods that are better for developing the algorithm. Additionally, despite randomly splitting our data into training and testing sets, there may be some inherent differences between the two (think back to the employee income example above), making it harder for the algorithm to correctly label an observation.

References

Breiman, Leo, Jerome H Friedman, Richard A Olshen, and Charles J Stone. 1984. “Classification and Regression Trees. Wadsworth & Brooks.” Monterey, CA.

Hickey, Walt. 2015. “The Rock Isn’t Alone: Lots of People Are Worried About ‘the Big One’.” FiveThirtyEight. FiveThirtyEight. https://fivethirtyeight.com/datalab/the-rock-isnt-alone-lots-of-people-are-worried-about-the-big-one/.

Wikipedia. 2017a. “Earthquake — Wikipedia, the Free Encyclopedia.” http://en.wikipedia.org/w/index.php?title=Earthquake&oldid=762614740.

———. 2017b. “Predictive analytics — Wikipedia, the Free Encyclopedia.” http://en.wikipedia.org/w/index.php?title=Predictive%20analytics&oldid=764577274.

My Podcasting Setup

I’ve gotten a number of inquiries over the last 2 years about my podcasting setup and I’ve been meaning to write about it but….

But here it is! I actually wanted to write this because I felt like there actually wasn’t a ton of good information about this on the Internet that wasn’t for people who wanted to do it professionally but were rather more “casual” podcasters. So here’s what I’ve got.

There are two types of podcasts roughly: The kind you record with everyone in the same room and the kind you record where everyone is in different rooms. They both require slightly different setups so I’ll talk about both. For me, Elizabeth Matsui and I record The Effort Report locally because we’re both at Johns Hopkins. But Hilary Parker and I record Not So Standard Deviations remotely because she’s on the other side of the country most of the time.

Recording Equipment

When Hilary and I first started we just used the microphone attached to the headphones you get with your iPhone or whatever. That’s okay but the sound feels very “narrow” to me. That said, it’s a good way to get started and it likely costs you nothing.

The next level up for many people is the Blue Yeti USB Microphone which is perfectly fine microphone and not too expensive. Also, it uses USB (as opposed to more professional XLR) so it connects to any computer, which is nice. However, it typically retails for $120, which isn’t nothing, and there are probably cheaper microphones that are just as good. For example, Jason Snell recommends the Audio Technica ATR2100 which is only about $70.

If you’re willing to shell out a little more money, I’d highly recommend the Zoom H4n portable recorder. This is actually two things: a microphone and a recorder. It has a nice stero microphone built into the top along with two XLR inputs on the bottom that allow you to record from external mics. It records to SD cards so it’s great for a portable setup where you don’t want to carry a computer around with you. It retails for about $200 so it’s not cheap, but in my opinion it is worth every penny. I’ve been using my H4n for years now.

Because we do a lot or recording for our online courses here, we’ve actually got a bit more equipment in the office. So for in-person podcasts I sometimes record using a Sennheiser MKH416-P48US attached to an Auray MS-5230T microphone stand which is decidedly not cheap but is a great piece of hardware.

By the way, a microphone stand is great to have, if you can get one, so you don’t have to set the microphone on your desk or table. That way if you bump the table by accident or generally like to bang the table, it won’t get picked up on the microphone. It’s not something to get right away, but maybe later when you make the big time.

Recording Software

If you’re recording by yourself, you can just hook up your microphone to your computer and record to any old software that records sound (on the Mac you can use Quicktime). If you have multiple people, you can either

  1. Speak into the same mic and have both your voices recorded on the same audio file
  2. Use separate mics (and separate computers) and record separtely on to separate audio files. This requires synching the audio files in an editor, but that’s not too big a deal if you only have 2-3 people.

For local podcasts, I actually just use the H4n and record directly to the SD card. This creates separate WAV files for each microphone that are already synced so you can just plop them in the editor.

For remote podcasts, you’ll need some communication software. Hilary and I use Zencastr which has its own VoIP system that allows you to talk to anyone by just sending your guests a link. So I create a session in Zencastr, send Hilary the link for the session, she logs in (without needing any credentials) and we just start talking. The web site records the audio directly off of your microphone and then uploads the audio files (one for each guest) to Dropbox. The service is really nice and there are now a few just like it. Zencastr costs $20 a month right now but there is a limited free tier.

The other approach is to use something like Skype and then use something like ecamm call-recorder to record the conversation. The downside with this approach is that if you have any network trouble that messes up the audio, then you will also record that. However, Zencastr (and related services) do not work on iOS devices and other devices that use WebKit based browsers. So if you have someone calling in on a mobile device via Skype or something, then you’ll have to use this approach. Otherwise, I prefer the Zencastr approach and can’t really see any downside except for the cost.

Editing Software

There isn’t a lot of software that’s specifically designed for editing podcasts. I actually started off editing podcasts in Final Cut Pro X (nonlinear video editor) because that’s what I was familiar with. But now I use Logic Pro X, which is not really designed for podcasts, but it’s a real digital audio workstation and has nice features (like strip silence). But I think something like Audacity would be fine for basic editing.

The main thing I need to do with editing is merge the different audio tracks together and cut off any extraneous material at the beginning or the end. I don’t usually do a lot of editing in the middle unless there’s a major mishap like a siren goes by or a cat jumps on the computer. Once the editing is done I bounce to an AAC or MP3 file for uploading.

Hosting

You’ll need a service for hosting your audio files if you don’t have your own server. You can technically host your audio files anywhere, but specific services have niceties like auto-generating the RSS feed. For Not So Standard Deviations I use SoundCloud and for The Effort Report I use Libsyn.

Of the two services, I think I prefer Libsyn, because it’s specifically designed for podcasting and has somewhat better analytics. The web site feels a little bit like it was designed in 2003, but otherwise it works great. Libsyn also has features for things like advertising and subscriptions, but I don’t use any of those. SoundCloud is fine but wasn’t really designed for podcasting and sometimes feels a little unnatural.

Summary

If you’re interested in getting started in podcasting, here’s my bottom line:

  1. Get a partner. It’s more fun that way!
  2. If you and your partner are remote, use Zencastr or something similar.
  3. Splurge for the Zoom H4n if you can, otherwise get a reasonable cheap microphone like the Audio Technica or the Yeti.
  4. Don’t focus too much on editing. Just clip off the beginning and the end.
  5. Host on Libsyn.

Data Scientists Clashing at Hedge Funds

There’s an interesting article over at Bloomberg about how data scientists have struggled at some hedge funds:

The firms have been loading up on data scientists and coders to deliver on the promise of quantitative investing and lift their ho-hum returns. But they are discovering that the marriage of old-school managers and data-driven quants can be rocky. Managers who have relied on gut calls resist ceding control to scientists and their trading signals. And quants, emboldened by the success of computer-driven funds like Renaissance Technologies, bristle at their second-class status and vie for a bigger voice in investing.

There are some interesting tidbits in the article that I think hold lessons for any collaboration between a data scientist or analyst and a non-data scientist (for lack of a better word).

At Point72, the family office successor to SAC Capital, problems at the quant unit (known as Aperio):

The divide between Aperio quants and fundamental money managers was also intellectual. They struggled to communicate about the basics, like how big data could inform investment decisions. [Michael] Recce’s team, which was stacked with data scientists and coders, developed trading signals but didn’t always fully explain the margin of error in the analysis to make them useful to fund managers, the people said.

It’s hard to know the details of what actually happened, but for data scientists collaborating with others, there always needs to be an explanation of “what’s going on”. There’s a general feeling that it’s okay that machine learning techniques build complicated uninterpretable models because they work better. But in my experience that’s not enough. People want to know why they work better, when they work better, and when they don’t work.

On over-theorizing:

Haynes, who joined Stamford, Connecticut-based Point72 in early 2014 after about two decades at McKinsey & Co., and other senior managers grew dissatisfied with Aperio’s progress and impact on returns, the people said. When the group obtained new data sets, it spent too much time developing theories about how to process them rather than quickly producing actionable results.

I don’t necessarily agree with this “criticism”, but I only put it here because the land of hedge funds isn’t generally viewed on the outside as a place where lots of theorizing goes on.

At BlueMountain, another hedge fund:

When quants showed their risk analysis and trading signals to fundamental managers, they sometimes were rejected as nothing new, the people said. Quants at times wondered if managers simply didn’t want to give them credit for their ideas.

I’ve seen this quite a bit. When a data scientist presents results to collaborators, there’s often two responses:

  1. “I knew that already” and so you haven’t taught me anything new
  2. “I didn’t know that already” and so you must be wrong

The common link here, of course, is the inability to admit that there are things you don’t know. Whether this is an inherent character flaw or something that can be overcome through teaching is not yet clear to me. But it is common when data is brought to bear on a problem that previously lacked data. One of the key tasks that a data scientist in any industry must prepare for is the task of giving people information that will make them uncomfortable.

Not So Standard Deviations Episode 32 - You Have to Reinvent the Wheel a Few Times

Hilary and I discuss training in PhD programs, estimating the variance vs. the standard deviation, the bias variance tradeoff, and explainable machine learning.

We’re also introducing a new level of support on our Patreon page, where you can get access to some of the outtakes from our episodes. Check out our Patreon page for details.

Show notes:

Download the audio for this episode

Listen here:

Reproducible Research Needs Some Limiting Principles

Over the past 10 years thinking and writing about reproducible research, I’ve come to the conclusion that much of the discussion is incomplete. While I think we as a scientific community have come a long way in changing people’s thinking about data and code and making them available to others, there are some key sticking points that keep coming up that are preventing further progress in the area.

When I used to write about reproducibility, I felt that the primary challenge/roadblock was a lack of tooling. Much has changed in just the last five years though, and many new tools have been developed to make life a lot easier. Packages like knitr (for R), markdown, and iPython notebooks, have made writing reproducible data analysis documents a lot easier. Web sites like GitHub and many others have made distributing analyses a lot simpler because now everyone effectively has a free web site (this was NOT true in 2005).

Even still, our basic definition of reproducibility is incomplete. Most people would say that a data analysis is reproducible if the analytic data and metadata are available and the code that did the analysis is available. Furthermore, it would be preferable to have some documentation to go along with both. But there are some key issues that need to be resolved to complete this general definition.

Reproducible for Whom?

In discussions about reproducibility with others, the topic of who should be able to reproduce the analysis only occasionally comes up. There’s a general sense, especially amongst academics, that anyone should be able to reproduce any analysis if they wanted to.

There is an analogy with free software here in the sense that free software can be free for some people and not for others. This made more sense in the days before the Internet when distribution was much more costly. The idea here was that I could write software for a client and give them the source code for that software (as they would surely demand). The software is free for them but not for anyone else. But free software ultimately only matters when it comes to distribution. Once I distribute a piece of software, that’s when all the restrictions come into play. However, if I only distribute it to a few people, I only need to guarantee that those few people have those freedoms.

Richard Stallman once said that something like 90% of software was free software because almost all software being written was custom software for individual clients (I have no idea where he got this number). Even if the number is wrong, the point still stands that if I write software for a single person, it can be free for that person even if no one in the world has access to the software.

Of course, now with the Internet, everything pretty much gets distributed to everyone because there’s nothing stopping someone from taking a piece of free software and posting it on a web site. But the idea still holds: Free software only needs to be free for the people who receive it.

That said, the analogy is not perfect. Software and research are not the same thing. They key difference is that you can’t call something research unless is generally available and disseminated. If Pfizer comes up with the cure for cancer and never tells anyone about it, it’s not research. If I discover that there’s a 9th planet and only tell my neighbor about it, it’s not research. Many companies might call those activities research (particularly from an tax/accounting point of view) but since society doesn’t get to learn about them, it’s not research.

If research is by definition disseminated to all, then it should therefore be reproducible by all. However, there are at least two circumstances in which we do not even pretend to believe this is possible.

  1. Imbalance of resources: If I conduct a data analysis that requires the world’s largest supercomputer, I can make all the code and data available that I want–few people will be able to actually reproduce it. That’s an extreme case, but even if I were to make use of a dramatically smaller computing cluster it’s unlikely that anyone would be able to recreate those resources. So I can distribute something that’s reproducible in theory but not in reality by most people.
  2. Protected data: Numerous analyses in the biomedical sciences make use of protected health information that cannot easily be disseminated. Privacy is an important issue, in part, because in many cases it allows us to collect the data in the first place. However, most would agree we cannot simply post that data for all to see in the name of reproducibility. First, it is against the law, and second it would likely deter anyone from agreeing to participate in any study in the future.

We can pretend that we can make data analyses reproducible for all, but in reality it’s not possible. So perhaps it would make sense for us to consider whether a limiting principle should be applied. The danger of not considering it is that one may take things to the extreme—if it can’t be made reproducible for all, then why bother trying? A partial solution is needed here.

For How Long?

Another question that needs to be resolved for reproducibility to be a widely implemented and sustainable phenomenon is for how long should something be reproducible? Ultimately, this is a question about time and resources because ensuring that data and code can be made available and can run on current platforms in perpetuity requires substantial time and money. In the academic community, where projects are often funded off of grants or contracts with finite lifespans, often the money is long gone even though the data and code must be maintained. The question then is who pays for the maintainence and the upkeep of the data and code?

I’ve never heard a satisfactory answer to this question. If the answer is that data analyses should be reproducible forever, then we need to consider a different funding model. This position would require a perpetual funds model, essentially an endowment, for each project that is disseminated and claims to be reproducible. The endowment would pay for things like servers for hosting the code and data and perhaps engineers to adapt and adjust the code as the surrounding environment changes. While there are a number of repositories that have developed scalable operating models, it’s not clear to me that the funding model is completely sustainable.

If we look at how scientific publications are sustained, we see that it’s largely private enterprise that shoulders the burden. Journals house most of the publications out there and they charge a fee for access (some for profit, some not for profit). Whether the reader pays or the author pays is not relevant, the point is that a decision has been made about who pays.

The author-pays model is interesting though. Here, an author pays a publication charge of ~$2,000, and the reader never pays anything for access (in perpetuity, presumably). The $2,000 payment by the author is like a one-time capital expense for maintaining that one publication forever (a mini-endowment, in a sense). It works for authors because grant/contract supported research often budget for one-time publication charges. There’s no need for continued payments after a grant/contract has expired.

The publication system is quite a bit simpler because almost all publications are the same size and require the same resources for access—basically a web site that can serve up PDF files and people to maintain it. For data analyses, one could see things potentially getting out of control. For a large analysis with terabytes of data, what would the one-time up-front fee be to house the data and pay for anyone to access it for free forever?

Using Amazon’s monthly cost estimator we can get a rough sense of what the pure data storage might cost. Suppose we have a 10GB dataset that we want to store and we anticipate that it might be downloaded 10 times per month. This would cost about $7.65 per month, or $91.80 per year. If we assume Amazon raises their prices about 3% per year and a discount rate of 5%, the total cost for the storage is $4,590. If we tack on 20% for other costs, that brings us to $5,508. This is perhaps not unreasonable, and the scenario would certainly include most people. For comparison a 1 TB dataset downloaded once a year, using the same formula gives us a one-time cost of about $40,000. This is real money when it comes to fixed research budgets and would likely require some discussion of trade-offs.

Summary

Reproducibility is a necessity in science, but it’s high time that we start considering the practical implications of actually doing the job. There are still holdouts when it comes to the basic idea of reproducibiltiy, but they are fewer and farther between. If we do not seriously consider the details of how to implement reproducibility, perhaps by introducing some limiting principles, we may never be able to achieve any sort of widespread adoption.