Last week I talked about how we might be able to improve data analyses by moving towards “evidence-based” data analysis and to use data analytic techniques that are proven to be useful based on statistical research rather. My feeling was this approach attacks the most “upstream” aspect of data analysis before problems have the chance to filter down into things like publications, or even worse, clinical decision-making.
In this third (and final!) post on this topic I wanted to describe a little how we could implement evidence-based data analytic pipelines. Depending on your favorite software system you could imagine a number of ways to do this. If the pipeline were implemented in R, you could imagine it as an R package. The precise platform is not critical at this point; I would imagine most complex pipelines would involve multiple different software systems tied together.
Below is a rough diagram of how I think the various pieces of an evidence-based data analysis pipeline would fit together.
There are a few key elements of this diagram that I’d like to stress:
Clearly, one pipeline is not enough. We need many of them for different problems. So what do we do with all of them?
I think we could organize them in a central location (kind of a specialized GitHub) where people could search for, download, create, and contribute to existing data analysis pipelines. An analogy (but not exactly a model) is the Cochrane Collaboration which serves as a repository for evidence-based medicine. There are already a number of initiatives along these lines, such as the Galaxy Project for bioinformatics. I don’t know whether it’d be ideal to have everything in one place or have a number of sub-projects for specialized areas.
Each pipeline would have a leader (or “friendly dictator”) who would organize the contributions and determine which components would go where. This could obviously be contentious, more some in some areas than in others, but I don’t think any more contentious than your average open source project (check the archives of the Linus Kernel or Git mailing lists and you’ll see what I mean).
So, to summarize, I think we need to organize lots of evidence-based data analysis pipelines and make them widely available. If I were writing this 5 or 6 years ago, I’d be complaining about a lack of infrastructure out there to support this. But nowadays, I think we have pretty much everything we need in terms of infrastructure. So what are we waiting for?