Statisticians and computer scientists - if there is no code, there is no paper

I think it has been beat to death that the incentives in academia lean heavily toward producing papers and less toward producing/maintaining software. There are people that are way, way more knowledgeable than me about building and maintaining software. For example, Titus Brown hit a lot of the key issues in his interview. The open source community is also filled with advocates and researchers who know way more about this than I do.

This post is more about my views on changing the perspective of code/software in the data analysis community. I have been frustrated often with statisticians and computer scientists who write papers where they develop new methods and seem to demonstrate that those methods blow away all their competitors. But then no software is available to actually test and see if that is true. Even worse, sometimes I just want to use their method to solve a problem in our pipeline, but I have to code it from scratch!

I have also had several cases where I emailed the authors for their software and they said it "wasn't fit for distribution" or they "don't have code" or the "code can only be run on our machines". I totally understand the first and last, my code isn't always pretty (I have zero formal training in computer science so messy code is actually the most likely scenario) but I always say, "I'll take whatever you got and I'm willing to hack it out to make it work". I often still am turned down.

So I have a new policy when evaluating CV's of candidates for jobs, or when I'm reading a paper as a referee. If the paper is about a new statistical method or machine learning algorithm and there is no software available for that method - I simply mentally cross it off the CV. If I'm reading a data analysis and there isn't code that reproduces their analysis - I mentally cross it off. In my mind, new methods/analyses without software are just vapor ware. Now, you'd definitely have to cross a few papers off my CV, based on this principle. I do that. But I'm trying really hard going forward to make sure nothing gets crossed off.

In a future post I'll talk about the new issue I'm struggling with - maintaing all that software I'm creating.

 

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.
  • anthony damico

    amen

  • Pingback: Shome Me Code or It Didn’t Happen | Planet3.0

  • Shane Tolmie

    I heartily agree.

  • Alex

    While I absolutely agree that code needs to be published far more often than it is, I think it's a bit extreme to ignore all research that has no code attached. Computer Science is part engineering, part science, and largely math. It's perfectly fine to contribute new math along with proofs without any code.

    Though I can see that in the area of biostatistics, code is likely far more relevant than in some other areas (like logics or pure statistics, or mathematical analysis of algorithm complexity).

    • Rafael Irizarry

      Note that the post is referring to papers that present "new methods and seem to demonstrate that those methods blow away all their competitor" not theoretical papers.

  • Ken Beath

    One option is to publish a paper about your software in the Journal of Statistical Software or the R Journal, thus leveraging a publication out of your efforts.

  • EnlightenedDuck

    I just want to praise Dr. Leek for acting on his statements; I e-mailed him earlier this week about code related to a paper he co-authored that appeared on the ARXIV, and got a reply within 2 hours with a link to the relevant GIT repository.

    Much better than my usual experience with such requests!

    • jtleek

      I'm more impressed someone read our paper basically the day it came out, thanks!

  • Thomas Lumley

    I think actually have a reasonable record for making software available, but this month I was asked for the software from a statistical methods paper written before I started my PhD, when I was a lowly minion at the Sydney clinical trials centre. Specifically, I was asked for R or S-PLUS code.

    The program might still have been on one of the 3.5in disks I threw out when moving to NZ two years ago. If I still had the disk, I don't have a disk drive any more, though the CompSci department has a retrocomputing archive and could probably find one. Even if the code could be found, I wrote it in SPIDA, a statistical macro system that you haven't heard of and don't have any access to.

    Now we have personal web pages and github and r-forge and so on, but back then there wasn't even CRAN. I suppose I could have sent it to Statlib.

    It's not just maintenance that can be difficult over the long run

    • jtleek

      Unless those old methods have been written up in software (by you or others) they I wouldn't have any way of knowing if they were any good. I understand us young whipersnappers have access to some stuff you didn't back in the day. But if you care about a method, I think you should take the time to code it up.

      That being said, I hope it is clear from my post I'm small potatoes in this community and this is just my personal feeling.

  • Computer Scientist

    I find your approach absurd. Maybe it comes from frustration but it is still absurd. It is like expecting mathematicians to write formal proofs. Almost no one does that. We develop algorithms, if you want software employ a software developer to implement it.

    • Engineer

      If you have not implemented your "algorithm" to a level where it can run on real data, you have no idea if it works. None whatsoever. Much less do you have an idea if it advances the state of the art.

      • csist

        Of course I do. That's how people have been designing algorithms for decades. Experimenting with an algorithm on a set of test cases is not the only or the best way of knowing if an algorithm works or advances the state of art. You want to simplify your own job when you need the code for your research. If the code is available that's good, if it is not it doesn't mean there is anything wrong with the paper or the algorithm.

        The worst thing is that bad code can be worst than no code. Developing software and libraries is a serious job. In place of enforcing people whose job is not writing programs, pay a professional programmer to do it.

  • markbetnel

    I completely agree that this is a good thing to do, and that it can be used to push the culture in the right direction --- but I think it would need to be applied with sensitivity, at least at first. What about the new PhD applying for a position, who has worked with an advisor who won't allow their code to be released? The student is not in a strong enough position to force it to happen.

  • Janesh

    thats true and happened to myself when i asked authors for their models

  • ghattem

    This reminds me of one of the #overlyhonestmethods twitter posts I saw a couple weeks ago:

    "You can download our code from the URL supplied. Good luck downloading the only postdoc who can get it to run, though"

    I've definitely seen this happen before. I really hate when I have to go back and code it all from scratch from what often turns out to be an incomplete description of the algorithm.

  • S.E.L.

    A great idea, and it could be extended to apply to scientific papers that don't make their data publicly available (maybe with some exceptions for clinical data due to confidentiality issues).

  • Pingback: Perversiones del método científico | Enchufa2

  • Holger K. von Jouanne-Diedrich

    I cannot agree more with this article - I would even suggest that renowned journals turn down publications where the authors are not prepared to publish their code too!

    Research has to be reproducible!

  • Jeff Laake

    While I agree with your thoughts and I maintain a lot of software on github, it is better to provide a carrot than a stick. Methods that are accompanied with software are much more likely to be accepted and used than those that aren't. So if you want to have an impact, provide the code. Simple as that.

  • Computer Scientists

    I think the author's view is an ideal, it would be good but currently is not possible or vital. Genomics is a rapidly changing field. For most of us (grad students) the priority of research is in the order of: 1) devising a new method with a good contribution, 2) writing a clear paper on it, 3) running the experiments and 4) implementing a prototype software. It makes me uncomfortable to hear that I will be evaluated first on how good my software prototype is or how easily it is accessible. Engineering good software is a full time job on its own and focusing on it too much, I believe, will slow down the progress. Think this way, while a grad student spends too much time to perfect and release a software, another might come up with a new way that may make the perfect software obsolete.

  • thecity2

    There's really no excuse for not publishing open-source software on github these days.

  • William Shipman

    Surprised no one else has raised this. I have always lived in the world where, yes you can publish your idea/algorithm, after the paper has been carefully checked to make sure that no company secrets are being disclosed, but publishing code is out of the question. The argument being that some other competing company could go ahead and use the code to make a profit without investing as much effort in coding it. I suppose you could argue that publishing the idea already lets out enough information that the argument above is stupid, but that doesn't stop companies.

    My suggestion is that erasing those papers from your thoughts is too harsh, some people just are not allowed to publish code. If a corporate author or sponsor is acknowledged then be kind to those authors and evaluate their work as best one can.

    As for academic work that is not sponsored by a company, I recognise the value in at least providing a binary library or executable if one doesn't want to release the source code for some personal reason.

  • Sidney

    I perfectly agree with your point of view.

    As an Automation & Process Control Engineer, when I have to develop a control strategy for a dynamic system or industrial process, I must run some data analysis with the input(s) and output(s) of the system or process, like: noise filtration, steady state detection, outliers removal, cross-correlation, etc. In certain situations, I have had to reproduce some strategies I found on papers, without having their code available. In such cases, we need to work with only the model equations of the problem and our own data.

    Several years ago, when I was working on my Master's thesis, I had to reproduce a specific computer vision strategy using only the mathematical foundations from its original paper. After several simulations, I discovered that one of the conclusions of the original paper (that "the strategy is insensitive to the focal lenght of the camera") could not be true at all, as stated by the paper. Then I wrote (3 times...) to the authors asking them for some justifications and for the code they implemented, but there was no reply. That was not a problem for my work, and I argued that I found a "critical failure" in that strategy. However, I think that if the authors had shared their work a little more, maybe we could have discovered something else interesting. And this is just the goal of any science or engineering.

    Best,

    Sidney AA Viana

  • Pingback: Somewhere else, part 31 | Freakonometrics

  • http://twitter.com/PolSciReplicate PolSci Replication

    Emails I received when I asked for data and code (in political economy) stated:

    I will definitely send a .dta file when I can clean it up a bit.

    I will see what I can find in my files for you.

    We don’t have R code available for the imputations.

    I don’t have a ready-made set of do files and datasets, although I
    would be happy to collect these for you once I have access to the files.

    I’m travelling.

    I only have some of my electronic files with me during my trip.

  • ezracolbert

    A story I heard while working at MIT
    A postdoc is working by on a project by himself [eg, he is in a large lab, but he is th only one on this particular project], and he gets a letter from one of the BIGGEST NAMES IN THE FIELD for some chemical reagents.

    Terrified that the big name will put several people on the project and scoop him, the postdoc digs up old, old papers from the BIG NAME and sends back a request for 20 reagents, thinking, he can't possibly have any of them left.

    The next day, a courier arrives from the BIG NAME with a box, with *all 20 reagents*, and a polite note, asking for the postdoc's reagents...

    those of you with a biochemical bent will also know of the Guillemin/Schally war, and how Guillemin saved all his fractions, so that when someone published a protein purification, Guillimen could go to the freezre and pull out a fraction from , say, an ion exchange run @ 0.2M salt, and there would be the new protein...