Simply Statistics


A menagerie of messed up data analyses and how to avoid them

Update: I realize this may seem like I'm picking on people. I really don't mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from "I got a big one here" when I made a huge mistake as a first year assistant professor.

In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a "menagerie" of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I'm doing the non-comprehensive list :).



Outcodirection411me switching

What it is: Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn't really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.

An example: In this article they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug.

What you can do: Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's pre-registration challenge.


Garden of forking paths

What it is: In this case you may or may not have specified your outcome and stuck with it. Let's assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn't specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being "sensible"

An exampleThis article gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.

What you can do: Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn't pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren't dramatically different.



What it is: The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn't correct for multiple testing, or uses any of these other means to cheat and get a result that they like.

An example: This one gets talked about a lot and there is some evidence that it happens. But it is usually pretty hard to ascribe purely evil intentions to people and I'd rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.

What to do: Know how to do an analysis well and don't cheat.

Update:  Some people define p-hacking differently as when "when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman & Loken, 2014; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011; Vazire, 2015)." This coincides with the definition of "garden of forking paths". I have been asked to point this out on Twitter. It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think "p-hacking" corresponds to my definition above, although I agree with folks that isn't helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.


paypal15Uncorrected multiple testing 

What it is: This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.

An example:  The most famous example is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P < 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P < 0.05 just by chance.

What you can do: Correct for multiple testing. When you calculate a large number of p-values make sure you know what their distribution is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.


animal162I got a big one here

What it is: One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.

An example: In a paper authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another. This might explain a large fraction of these differences.

What you can do: Be deeply suspicious of big effects in data analysis. If you find something huge and counterintuitive, especially in a well established research area, spend a lot of time trying to figure out why it could be a mistake. If you don't, others definitely will, and you might be embarrassed.

man298Double complication

What it is: When faced with a large and complicated data set, beginning analysts often feel compelled to use a big complicated method. Imagine you have collected data on thousands of genes or hundreds of thousands of voxels and you want to use this data to predict some health outcome. There is a severe temptation to use deep learning or blend random forests, boosting, and five other methods to perform the prediction. The problem is that complicated methods fail for complicated reasons, which will be extra hard to diagnose if you have a really big, complicated data set.

An example: There are a large number of examples where people use very small training sets and complicated methods. One example (there were many other problems with this analysis, too) is when people tried to use complicated prediction algorithms to predict which chemotherapy would work best using genomics. Ultimately this paper was retracted for may problems, but the complication of the methods plus the complication of the data made it hard to detect.

What you can do: When faced with a big, messy data set, try simple things first. Use linear regression, make simple scatterplots, check to see if there are obvious flaws with the data. If you must use a really complicated method, ask yourself if there is a reason it is outperforming the simple methods because often with large data sets even simple things work.






Image credits:


On research parasites and internet mobs - let's try to solve the real problem.

A couple of days ago one of the editors of the New England Journal of Medicine posted an editorial showing some moderate level of support for data sharing but also introducing the term "research parasite":

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:

  1. "The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good data sharing policies and respond to queries from people using their data promptly then this should not be a problem at all.
  2. "... but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited." The idea that no one should be able to try to disprove ideas with the authors data has been covered in other blogs/on Twitter. One thing I do think is worth considering here is the concern about credit. I think that the traditional way credit has accrued to authors has been citations. But if you get a major study funded, say for 50 million dollars, run that study carefully, sit on a million conference calls, and end up with a single major paper, that could be frustrating. Which is why I think that a better policy would be to have the people who run massive studies get credit in a way that is not papers. They should get some kind of formal administrative credit. But then the data should be immediately and publicly available to anyone to publish on. That allows people who run massive studies to get credit and science to proceed normally.
  3. "The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own."  The story that follows about a group of researchers who collaborated with the NSABP to validate their gene expression signature is very encouraging. But it isn't the only way science should work. Researchers shouldn't be constrained to one model or another. Sometimes collaboration is necessary, sometimes it isn't, but in neither case should we label the researchers "symbiotic" or "parasitic", terms that have extreme connotations.
  4. "How would data sharing work best? We think it should happen symbiotically, not parasitically." I think that it should happen automatically. If you generate a data set with public funds, you should be required to immediately make it available to researchers in the community. But you should get credit for generating the data set and the hypothesis that led to the data set. The problem is that people who generate data will almost never be as fast at analyzing it as people who know how to analyze data. But both deserve credit, whether they are working together or not.
  5. "Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested." The trouble with this framework is that it preferentially accrues credit to data generators and doesn't accurately describe the role of either party. To flip this argument around,  you could just as easily say that anyone who uses Steven Salzberg's software for aligning or assembling short reads should make him a co-author. I think Dr. Drazen would agree that not everyone who aligned reads should add Steven as co-author, despite his contribution being critical for the completion of their work.

After the piece was posted there was predictable internet rage from data parasites, a dedicated hashtag, and half a dozen angry blog posts written about the piece. These inspired a follow up piece from Drazen. I recognize why these folks were upset - the "research parasites" thing was unnecessarily inflammatory. But I also sympathize with data creators who are also subject to a tough environment - particularly when they are junior scientists.

I think the response to the internet outrage also misses the mark and comes off as a defense of people with angry perspectives on data sharing. I would have much rather seen a more pro-active approach from a leading journal of medicine. I'd like to see something that acknowledges different contributions appropriately and doesn't slow down science. Something like:

  1. We will require all data, including data from clinical trials, to be made public immediately on publication as long as it poses minimal risk to the patients involved or the patients have been consented to broad sharing.
  2. When data are not made publicly available they are still required to be deposited with a third party such as the NIH or Figshare to be held available for request from qualified/approved researchers.
  3. We will require that all people who use data give appropriate credit to the original data generators in terms of data citations.
  4. We will require that all people who use software/statistical analysis tools give credit to the original tool developers in terms of software citations.
  5. We will include a new designation for leaders of major data collection or software generation projects that can be included to demonstrate credit for major projects undertaken and completed.
  6. When reviewing papers written by experimentalists with no statistical/computational co-authors we will require no fewer than 2 statistical/computational referees to ensure there has not been a mistake made by inexperienced researchers.
  7. When reviewing papers written by statistical/computational authors with no experimental co-authors we will require no fewer than 2 experimental referees to ensure there has not been a mistake made by inexperienced researchers.



A non-comprehensive list of awesome things other people did in 2015

Editor's Note: This is the third year I'm making a list of awesome things other people did this year. Just like the lists for 2013 and 2014 I am doing this off the top of my head.   I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people's awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year's list is particularly "off the cuff" so I'd appreciate additions if you have 'em. I have surely missed awesome things people have done.

  1. I hear the Tukey conference put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on 50 years of Data Science.
  2. Sherri Rose wrote really accurate and readable guides on academic CVs, academic cover letters, and how to be an effective PhD researcher.
  3. I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on deep learning and neural networks. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy's blog post on whether you have a good selfie or not was fun.
  4. Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on statschat and more in depth posts like this one on pre-filtering multiple tests on notstatschat.
  5. David Robinson is making a strong case for top data science blogger with his series of awesome posts on empirical Bayes.
  6. Hadley Wickham doing Hadley Wickham things again. readr is the biggie for me this year.
  7. I've been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) STAT.
  8. Ben Goldacre and co. launched OpenTrials for aggregating all the clinical trial data in the world in an open repository.
  9. Christie Aschwanden's piece on why Science Isn't Broken  is a must read and one of the least polemic treatments of the reproducibility/replicability issue I've read. The p-hacking graphic is just icing on the cake.
  10. I'm excited about the new R Consortium and the idea of having more organizations that support folks in the R community.
  11. Emma Pierson's blog and writeups in various national level news outlets continue to impress. I thought this one on changing the incentives for sexual assault surveys was particularly interesting/good.
  12. Amanda Cox an co. created this interactive graphic , which is an amazing way to teach people about pre-conceived biases in the way we think about relationships and correlations. I love the crowd-sourcing view on data analysis this suggests.
  13. As usual Philip Guo was producing gold over on his blog. I appreciate this piece on twelve tips for data driven research.
  14. I am really excited about the new field of adaptive data analysis. Basically understanding how we can let people be "real data analysts" and still get reasonable estimates at the end of the day. This paper from Cynthia Dwork and co was one of the initial salvos that came out this year.
  15. Datacamp incorporated Python into their platform. The idea of interactive education for R/Python/Data Science is a very cool one and has tons of potential.
  16. I was really into the idea of Cross-Study validation that got proposed this year. With the growth of public data in a lot of areas we can really start to get a feel for generalizability.
  17. The Open Science Foundation did this incredible replication of 100 different studies in psychology with attention to detail and care that deserves a ton of attention.
  18. Florian's piece "You are not working for me; I am working with you." should be required reading for all students/postdocs/mentors in academia. This is something I still hadn't fully figured out until I read Florian's piece.
  19. I think Karl Broman's post on why reproducibility is hard is a great introduction to the real issues in making data analyses reproducible.
  20. This was the year of the f1000 post-publication review paper. I thought this one from Yoav and the ensuing fallout was fascinating.
  21. I love pretty much everything out of Di Cook/Heike Hoffman's groups. This year I liked the paper on visual statistical inference in high-dimensional low sample size settings.
  22. This is pretty recent, but Nathan Yau's day in the life graphic is mesmerizing.

This was a year where open source data people described their pain from people being demanding/mean to them for their contributions. As the year closes I just want to give a big thank you to everyone who did awesome stuff I used this year and have completely ungraciously failed to acknowledge.



Instead of research on reproducibility, just do reproducible research

Right now reproducibility, replicability, false positive rates, biases in methods, and other problems with science are the hot topic. As I mentioned in a previous post pointing out a flaw with a scientific study is way easier to do correctly than generating a new scientific study. Some folks have noticed that right now there is a huge market for papers pointing out how science is flawed. The combination of the relative ease of pointing out flaws and the huge payout for writing these papers is helping to generate the hype around the "reproducibility crisis".

I gave a talk a little while ago at an NAS workshop where I stated that all the tools for reproducible research exist (the caveat being really large analyses - although that is changing as well). To make a paper completely reproducible, open, and available for post publication review you can use the following approach with no new tools/frameworks needed.

  1. Use Github for version control.
  2. Use rmarkdown or iPython notebooks for your analysis code
  3. When your paper is done post it to arxiv or biorxiv.
  4. Post your data to an appropriate repository like SRA or a general purpose site like figshare.
  5. Send any software you develop to a controlled repository like CRAN or Bioconductor.
  6. Participate in the post publication discussion on Twitter and with a Blog

This is also true of open science, open data sharing, reproducibility, replicability, post-publication peer review and all the other issues forming the "reproducibility crisis". There is a lot of attention and heat that has focused on the "crisis" or on folks who make a point to take a stand on reproducibility or open science or post publication review. But in the background, outside of the hype, there are a large group of people that are quietly executing solid, open, reproducible science.

I wish that this group would get more attention so I decided to point out a few of them. Next time somebody asks me about the research on reproducibility or open science I'll just point them here and tell them to just follow the lead of people doing it.

This list was made completely haphazardly as all my lists are, but just to indicate there are a ton of people out there doing this. One thing that is clear too is that grad students and postdocs are adopting the approach I described at a very high rate.

Moreover there are people that have been doing parts of this for a long time (like the physics or biostatistics communities with preprints, or how people have used Sweave for a long time) . I purposely left people off the list like Titus and Ethan who have gone all in, even posting their grants online. I did this because they are very loud advocates of open science, but I wanted to highlight quieter contributors and point out that while there is a lot of noise going on over in one corner, many people are quietly doing really good science in another.


A thanksgiving dplyr Rubik's cube puzzle for you

Nick Carchedi is back visiting from DataCamp and for fun we came up with a dplyr Rubik's cube puzzle. Here is how it works. To solve the puzzle you have to make a 4 x 3 data frame that spells Thanksgiving like this:

To solve the puzzle you need to pipe this data frame in 

and pipe out the Thanksgiving data frame using only the dplyr commands arrange, mutate, slice, filter and select. For advanced users you can try our slightly more complicated puzzle:

See if you can do it this fast. Post your solutions in the comments and Happy Thanksgiving!


So you are getting crushed on the internet? The new normal for academics.

Roger and I were just talking about all the discussion around the Case and Deaton paper on death rates for middle class people. Andrew Gelman discussed it among many others. They noticed a potential bias in the analysis and did some re-analysis. Just yesterday an economist blogger wrote a piece about academics versus blogs and how many academics are taken by surprise when they see their paper being discussed so rapidly on the internet. Much of the debate comes down to the speed, tone, and ferocity of internet discussion of academic work - along with the fact that sometimes it isn't fully fleshed out.

I have been seeing this play out not just in the case of this specific paper, but many times that folks have been confronted with blogs or the quick publication process of f1000Research. I think it is pretty scary for folks who aren't used to "internet speed" to see this play out and I thought it would be helpful to make a few points.

  1. Everyone is an internet scientist now. The internet has arrived as part of academics and if you publish a paper that is of interest (or if you are a Nobel prize winner, or if you dispute a claim, etc.) you will see discussion of that paper within a day or two on the blogs. This is now a fact of life.
  2. The internet loves a fight. The internet responds best to personal/angry blog posts or blog posts about controversial topics like p-values, errors, and bias. Almost certainly if someone writes a blog post about your work or an f1000 paper it will be about an error/bias/correction or something personal.
  3. Takedowns are easier than new research and happen faster. It is much, much easier to critique a paper than to design an experiment, collect data, figure out what question to ask, ask it quantitatively, analyze the data, and write it up. This doesn't mean the critique won't be good/right it just means it will happen much much faster than it took you to publish the paper because it is easier to do. All it takes is noticing one little bug in the code or one error in the regression model. So be prepared for speed in the response.

In light of these three things, you have a couple of options about how to react if you write an interesting paper and people are discussing it - which they will certainly do (point 1), in a way that will likely make you uncomfortable (point 2), and faster than you'd expect (point 3). The first thing to keep in mind is that the internet wants you to "fight back" and wants to declare a "winner". Reading about amicable disagreements doesn't build audience. That is why there is reality TV. So there will be pressure for you to score points, be clever, be fast, and refute every point or be declared the loser. I have found from my own experience that is what I feel like doing too. I think that resisting this urge is both (a) very very hard and (b) the right thing to do. I find the best solution is to be proud of your work, but be humble, because no paper is perfect and thats ok. If you do the best you can , sensible people will acknowledge that.

I think these are the three ways to respond to rapid internet criticism of your work.

  • Option 1: Respond on internet time. This means if you publish a big paper that you think might be controversial  you should block off a day or two to spend time on the internet responding. You should be ready to do new analysis quickly, be prepared to admit mistakes quickly if they exist, and you should be prepared to make it clear when there aren't. You will need social media accounts and you should probably have a blog so you can post longer form responses. Github/Figshare accounts make it better for quickly sharing quantitative/new analyses. Again your goal is to avoid the personal and stick to facts, so I find that Twitter/Facebook are best for disseminating your more long form responses on blogs/Github/Figshare. If you are going to go this route you should try to respond to as many of the major criticisms as possible, but usually they cluster into one or two specific comments, which you can address all in one.
  • Option2 : Respond in academic time. You might have spent a year writing a paper to have people respond to it essentially instantaneously. Sometimes they will have good points, but they will rarely have carefully thought out arguments given the internet-speed response (although remember point 3 that good critiques can be faster than good papers). One approach is to collect all the feedback, ignore the pressure for an immediate response, and write a careful, scientific response which you can publish in a journal or in a fast outlet like f1000Research. I think this route can be the most scientific and productive if executed well. But this will be hard because people will treat that like "you didn't have a good answer so you didn't respond immediately". The internet wants a quick winner/loser and that is terrible for science. Even if you choose this route though, you should make sure you have a way of publicizing your well thought out response - through blogs, social media, etc. once it is done.
  • Option 3: Do not respond. This is what a lot of people do and I'm unsure if it is ok or not. Clearly internet facing commentary can have an impact on you/your work/how it is perceived for better or worse. So if you ignore it, you are ignoring those consequences. This may be ok, but depending on the severity of the criticism may be hard to deal with and it may mean that you have a lot of questions to answer later. Honestly, I think as time goes on if you write a big paper under a lot of scrutiny Option 3 is going to go away.

All of this only applies if you write a paper that a ton of people care about/is controversial. Many technical papers won't have this issue and if you keep your claims small, this also probably won't apply. But I thought it was useful to try to work out how to act under this "new normal".


How I decide when to trust an R package

One thing that I've given a lot of thought to recently is the process that I use to decide whether I trust an R package or not. Kasper Hansen took a break from trolling me on Twitter to talk about how he trusts packages on Github less than packages that are on CRAN and particularly Bioconductor.  A couple of points he makes that I think are very relevant. First, that having a package on CRAN/Bioconductor raises trust in that package:

The primary reason is because Bioc/CRAN demonstrate something about the developer's willingness to do the boring but critically important parts of package development like documentation, vignettes, minimum coding standards, and being sure that their code isn't just a rehash of something else. The other big point Kasper made was the difference between a repository - which is user oriented and should provide certain guarantees and Github - which is a developer platform and makes things easier/better for developers but doesn't have a user guarantee system in place.

This discussion got me thinking about when/how I depend on R packages and how I make that decision. The scenarios where I depend on R packages are:

  1. Quick and dirty analyses for myself
  2. Shareable data analyses that I hope are reproducible
  3. As dependencies of R packages I maintain

As you move from 1-3 it is more and more of a pain if the package I'm depending on breaks. If it is just something I was doing for fun, its not that big of a deal. But if it means I have to rewrite/recheck/rerelease my R package than that is a much bigger headache.

So my scale for how stringent I am about relying on packages varies by the type of activity, but what are the criteria I use to measure how trustworthy a package is? For me, the criteria are in this order:

  1. People prior 
  2. Forced competence
  3. Indirect data

I'll explain each criteria in a minute, but the main purpose of using these criteria is (a) to ensure that I'm using a package that works and (b) to ensure that if the package breaks I can trust it will be fixed or at least I can get some help from the developer.

People prior

The first thing I do when I look at a package I might depend on is look at who the developer is. If that person is someone I know has developed widely used, reliable software and who quickly responds to requests/feedback then I immediately trust the package. I have a list of people like Brian, or Hadley, or Jenny, or Rafa, who could post their package just as a link to their website and I would trust it. It turns out almost all of these folks end up putting their packages on CRAN/Bioconductor anyway. But even if they didn't I assume that the reason is either (a) the package is very new or (b) they have a really good reason for not distributing it through the normal channels.

Forced competence

For people who I don't know about or whose software I've never used, then I have very little confidence in the package a priori. This is because there are a ton of people developing R packages now with highly variable levels of commitment to making them work. So as a placeholder for all the variables I don't know about them, I use the repository they choose as a surrogate. My personal prior on the trustworthiness of a package from someone I don't know goes something like:

Screen Shot 2015-11-06 at 1.25.01 PM

This prior is based on the idea of forced competence. In general, you have to do more to get a package approved on Bioconductor than on CRAN (for example you have to have a good vignette) and you have to do more to get a package on CRAN (pass R CMD CHECK and survive the review process) than to put it on Github.

This prior isn't perfect, but it does tell me something about how much the person cares about their package. If they go to the work of getting it on CRAN/Bioc, then at least they cared enough to document it. They are at least forced to be minimally competent - at least at the time of submission and enough for the packages to still pass checks.

Indirect data

After I've applied my priors I then typically look at the data. For Bioconductor I look at the badges, like how downloaded it is, whether it passes the checks, and how well it is covered by tests. I'm already inclined to trust it a bit since it is on that platform, but I use the data to adjust my prior a bit. For CRAN I might look at the download stats provided by Rstudio. The interesting thing is that as John Muschelli points out, Github actually has the most indirect data available for a package:

If I'm going to use a package that is on Github from a person who isn't on my prior list of people to trust then I look at a few things. The number of stars/forks/watchers is one thing that is a quick and dirty estimate of how used a package is. I also look very carefully at how many commits the person has submitted to both the package in question and in general all other packages over the last couple of months. If the person isn't actively developing either the package or anything else on Github, that is a bad sign. I also look to see how quickly they have responded to issues/bug reports on the package in the past if possible. One idea I haven't used but I think is a good one is to submit an issue for a trivial change to the package and see if I get a response very quickly. Finally I look and see if they have some demonstration their package works across platforms (say with a travis badge). If the package is highly starred, frequently maintained, all issues are responded to and up-to-date, and passes checks on all platform then that data might overwhelm my prior and I'd go ahead and trust the package.


In general one of the best things about the R ecosystem is being able to rely on other packages so that you don't have to write everything from scratch. But there is a hard balance to strike with keeping the dependency list small. One way I maintain this balance is using the strategy I've outlined to worry less about trustworthy dependencies.


Faculty/postdoc job opportunities in genomics across Johns Hopkins

It's pretty exciting to be in genomics at Hopkins right now with three new Bloomberg professors in genomics areas, a ton of stellar junior faculty, and a really fun group of students/postdocs. If you want to get in on the action here is a non-comprehensive list of great opportunities.

Faculty Jobs

Job: Multiple tenure track faculty positions in all areas including in genomics
Department:  Biostatistics
To apply:
Deadline: Review ongoing

Job: Tenure track position in data intensive biology
Department:  Biology
To apply
Deadline: Nov 1st and ongoing

Job: Tenure track positions in bioinformatics, with focus on proteomics or sequencing data analysis
Department:  Oncology Biostatistics
To apply
Deadline: Review ongoing


Postdoc Jobs

Job: Postdoc(s) in statistical methods/software development for RNA-seq
Employer:  Jeff Leek
To apply: email Jeff (
Deadline: Review ongoing

Job: Data scientist for integrative genomics in the human brain (MS/PhD)
Employer:  Andrew Jaffe
To apply: email Andrew (
Deadline: Review ongoing

Job: Research associate for genomic data processing and analysis (BA+)
Employer:  Andrew Jaffe
To apply: email Andrew (
Deadline: Review ongoing

Job: PhD developing scalable software and algorithms for analyzing sequencing data
Employer:  Ben Langmead
To apply:
Deadline: See site

Job: Postdoctoral researcher developing scalable software and algorithms for analyzing sequencing data
Employer:  Ben Langmead
To apply:  email Ben (
Deadline: Review ongoing

Job: Postdoctoral researcher developing algorithms for challenging problems in large-scale genomics whole-genome assenbly, RNA-seq analysis, and microbiome analysis
Employer:  Steven Salzberg
To apply:  email Steven (
Deadline: Review ongoing

Job: Research associate for genomic data processing and analysis (BA+) in cancer
Employer:  Luigi Marchionni (with Don Geman)
To apply:  email Luigi (
Deadline: Review ongoing

Job: Postdoctoral researcher developing algorithms for biomarkers development and precision medicine application in cancer
Employer:  Luigi Marchionni (with Don Geman)
To apply:  email Luigi (
Deadline: Review ongoing

Job:Postdoctoral researcher developing methods in machine learning, genomics, and regulatory variation
Employer:  Alexis Battle
To apply:  email Alexis (
Deadline: Review ongoing

Job: Postdoctoral fellow with interests in biomarker discovery for Alzheimer’s disease
Employer:  Madhav Thambisetty / Ingo Ruczinski
To apply:
Deadline: Review ongoing

Job: Postdoctoral positions for research in the interface of statistical genetics, precision medicine and big data
Employer:  Nilanjan Chatterjee
To apply:
Deadline: Review ongoing

Job: Postdoctoral research developing algorithms and software for time course pattern detection in genomics data
Employer:  Elana Fertig
To apply:  email Elana (
Deadline: Review ongoing

Job: Postdoctoral fellow to develop novel methods for large-scale DNA and RNA sequence analysis related to human and/or plant genetics, such as developing methods for discovering structural variations in cancer or for assembling and analyzing large complex plant genomes.
Employer:  Mike Schatz
To apply:  email Mike (
Deadline: Review ongoing


We are all always on the hunt for good Ph.D. students. At Hopkins students are admitted to specific departments. So if you find a faculty member you want to work with, you can apply to their department. Here are the application details for the various departments admitting students to work on genomics:





The statistics identity crisis: am I really a data scientist?






Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST. 


I organized a session at JSM 2015 called "The statistics identity crisis: am I really a data scientist?" The session turned out to be pretty popular:

but it turns out not everyone fit in the room:

Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn't be there. So we will be hosting a Google Hangout with the following talks:

'Am I a Data Scientist?': The Applied Statistics Student's Identity CrisisAlyssa Frazee, Stripe
How Industry Views Data Science Education in Statistics DepartmentsChris Volinsky, AT&T
Evaluating Data Science Contributions in Teaching and ResearchLance Waller, Emory University
Teach Data Science and They Will ComeJennifer Bryan, The University of British Columbia

You can watch it on Youtube or Google Plus. Here is the link:

The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag #JSM2015 or you can watch later as the video will remain on Youtube.


A glass half full interpretation of the replicability of psychological science

tl;dr: 77% of replication effects from the psychology replication study were in (or above) the 95% prediction interval based on the original effect size. This isn't perfect and suggests (a) there is still room for improvement, (b) the scientists who did the replication study are pretty awesome at replicating, (c) we need a better definition of replication that respects uncertainty but (d) the scientific sky isn't falling. We wrote this up in a paper on arxiv; the code is here. 

A week or two ago a paper came out in Science on Estimating the reproducibility of psychological science. The basic behind the study was to take a sample of studies that appeared in a particular journal in 2008 and try to replicate each of these studies. Here I'm using the definition that reproducibility is the ability to recalculate all results given the raw data and code from a study and replicability is the ability to re-do the study and get a consistent result. 

The paper is pretty incredible and the authors did an amazing job of going back to the original sources and trying to be faithful to the original study designs. I have to admit when I first heard about the study design I was incredibly pessimistic about the results (I suppose grouchy is a natural default state for many statisticians –especially those with sleep deprivation). I mean 2008 was well before the push toward reproducibility had really taken off (Biostatistics was one of the first journals to adopt a policy on reproducible research and that didn't happen until 2009). More importantly, the student researchers from those studies had possibly moved on, study populations may change, there could be any number of minor variations in the study design and so forth. I thought the chances of getting any effects in the same range was probably pretty low. 

So when the results were published I was pleasantly surprised. I wasn’t the only one:

But that was definitely not the prevailing impression that the paper left on social and mass media. A lot of the discussion around the paper focused on the idea that only 36% of the studies had a p-value less than 0.05 in both the original and replication study. But many of the sample sizes were small and the effects were modest. So the first question I asked myself was, "Well what would we expect to happen if we replicated these studies?" The original paper measured replicability in several ways and tried hard to calibrate expected coverage of confidence intervals for the measured effects.

With Roger and Prasad we tried a little different approach. We estimated the 95% prediction interval for the replication effect given the original effect size.



72% of the replication effects were within the 95% prediction interval and 2 were above the interval (showed a stronger signal in replication in than predicted from original study). This definitely shows that there is still room for improvement in replication of these studies - we would expect 95% of the effects to fall into the 95% prediction interval. But at least my opinion is that 72% (or 77% if you count the 2 above the P.I.) of studies falling in the prediction interval is (a) not bad and (b) a testament to the authors of the reproducibility paper and their efforts to get the studies right.

An important point here is that replication and reproducibility aren't the same thing. When reproducing a study we expect the numbers and figures to be exactly the same. But a replication involves recollection of data and is subject to variation and so we don't expect the answer to be exactly the same in the replication. This is of course made more confusing by regression to the mean, publication bias, and the garden of forking paths.  Our use of a prediction interval measures both the variation expected in the original study and in the replication. One thing we noticed when re-analyzing the data is how many of the studies had very low sample sizes. samplesize_figure_nofilter


Sample sizes were generally bigger in the replication, but often very low regardless. This makes it more difficult to disentangle what didn't replicate from what is just expected variation for a small sample size study.  The point remains whether those small studies should be trusted in general, but for the purposes of measuring replication it makes the problem more difficult.

One thing I have been thinking about a lot and this study drove home is that if we are measuring replication we need a definition that incorporates uncertainty directly. Suppose that you collect a data set D0 from an original study and  D1 from a replication. Then replication means that the data from a study replicates if D0 ~ F and D1 ~ F. Informally, if the data are generated from the same distribution in both experiments then the study replicates. To get an estimate you apply a pipeline to the data set to get an estimate e0 = p(D0). If the study is also reproducible than p() is the same for both studies and p(D0) ~ G and p(D1) ~ G, subject to some conditions on p(). 

One interesting consequence of this definition is that each complete replication data set represents only a single data point for measuring replication. To measure replication with this definition you either need to make assumptions about the data generating distribution for D0 and D1 or you need to perform a complete replication of a study many times to determine if it replicates. However, it does mean that we can define replication even for studies with very small number of replicates as the data generating distribution may be arbitrarily variable in each case.

Regardless of this definition I was excited that the OSF folks did the study and pulled it off as well as they did and was a bit bummed about the most common  reaction. I think there is an easy narrative that "science is broken" which I think isn't a positive thing for a number of reasons. I love the way that {reproducibility/replicability/open science/open publication} are becoming more and more common, but often think we fall into the same trap in wanting to report these results as clear cut as we do when reporting exaggerations or oversimplifications of scientific discoveries in headlines. I'm excited to see how these kinds of studies look in 10 years when Github/open science/pre-prints/etc. are all the standards.