Simply Statistics: Sample mix-ups in datasets from large studies are more common than you think

If you have analyzed enough high throughput data you have seen it before: a male sample that is really a female, a liver that is a kidney, etc… As the datasets I analyze get bigger I see more and more sample mix-ups. When I find a couple of samples for which sex is incorrectly annotated (one can easily see this from examining data from X and Y chromosomes) I can’t help but wonder if there are more that are undetectable (e.g. swapping samples of same sex). Datasets that include two types of measurements, for example genotypes and gene expression, make it possible to detect sample swaps more generally. I recently attended a talk by Karl Broman on this topic (one of best talks I’ve seen.. check out the slides here). Karl reports an example in which it looks as if whoever was pipetting skipped a sample and kept on going, introducing an off-by-one error for over 50 samples. As I sat through the talk, I wondered how many of the large GWAS studies have mix-ups like this?

A recent paper (gated) published by Lude Franke and colleagues describes MixupMapper: a method for detecting and correcting mix-ups. They examined several public datasets and discovered mix-ups in all of them. The worst performing study, published in PLoS Genetics, was reported to have 23% of the samples swapped. I was surprised that the MixupMapper paper was not published in a higher impact journal. Turns out PLoS Genetics rejected the paper. I think this was a big mistake on their part: the paper is clear and well written, reports a problem with a PLoS Genetics papers, and describes a solution to a problem that should have us all quite worried. I think it’s important that everybody learn about this problem so I was happy to see that, eight months later, Nature Genetics published a paper reporting mix-ups (gated)… but they didn’t cite the MixupMapper paper! Sorry Lude, welcome to the reverse scooped club.