Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

A meta list of what to do at JSM 2016

I’m going to be heading out tomorrow for JSM 2016. If you want to catch up I’ll be presenting in the 6-8PM poster session on The Extraordinary Power of Data on Sunday and on data visualization (and other things) in MOOCs at 8:30am on Monday. Here is a little sneak preview, the first slide from my talk:

Was too scared to use GIFs

This year I am so excited that other people have done all the work of going through the program for me and picking out what talks to see. Here is a list of lists.

  • Karl Broman - if you like open source software, data viz, and genomics.
  • Rstudio - if you like Rstudio
  • Mine Cetinkaya Rundel - if you like stat ed, data science, data viz, and data journalism.
  • Julian Wolfson - if you like missing sessions and guilt.
  • Stephanie Hicks - if you like lots of sessions and can’t make up your mind (also stat genomics, open source software, stat computing, stats for social good…)

If you know about more lists, please feel free to tweet at me or send pull requests.

I also saw the materials for this awesome tutorial on webscraping that I’m sorry I’ll miss.

The relativity of raw data

“Raw data” is one of those terms that everyone in statistics and data science uses but no one defines. For example, we all agree that we should be able to recreate results in scientific papers from the raw data and the code for that paper.

But what do we mean when we say raw data?

When working with collaborators or students I often find myself saying - could you just give me the raw data so I can do the normalization or processing myself. To give a concrete example, I work in the analysis of data from high-throughput genomic sequencing experiments.

These experiments produce data by breaking up genomic molecules into short fragements of DNA - then reading off parts of those fragments to generate “reads” - usually 100 to 200 letters long per read. But the reads are just puzzle pieces that need to be fit back together and then quantified to produce measurements on DNA variation or gene expression abundances.

High throughput sequencing

Image from Hector Corrata Bravo’s lecture notes

When I say “raw data” when talking to a collaborator I mean the reads that are reported from the sequencing machine. To me that is the rawest form of the data I will look at. But to generate those reads the sequencing machine first (1) created a set of images for each letter in the sequence of reads, (2) measured the color at the spots on that image to get the quantitative measurement of which letter, and (3) calculated which letter was there with a confidence measure. The raw data I ask for only includes the confidence measure and the sequence of letters itself, but ignores the images and the colors extracted from them (steps 1 and 2).

So to me the “raw data” is the files of reads. But to the people who produce the machine for sequencing the raw data may be the images or the color data. To my collaborator the raw data may be the quantitative measurements I calculate from the reads. When thinking about this I realized an important characteristics of raw data.

Raw data is relative to your reference frame.

In other words the raw data is raw to you if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.

The implication for reproducibility and replicability is that we need a “chain of custody” just like with evidence collected by the police. As long as each person keeps a copy and record of the “raw data” to them you can trace the provencance of the data back to the original source.

Not So Standard Deviations Episode 18 - Divide by n-1, or n-2, or Whatever

Hilary and I talk about statistical software in fMRI analyses, the differences between software testing differences in proportions (a must listen!), and a preview of JSM 2016.

Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The books is available from Leanpub and will be updated as we record more episodes.

If you have questions you’d like us to answer, you can send them to nssdeviations @ or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes.

Subscribe to the podcast on Google Play.

Please leave us a review on iTunes!

Support us through our Patreon page.

Show Notes:

Download the audio for this episode.

Listen here:

Tuesday update

It Might All Be Wrong

Tom Nichols and colleagues have published a paper on the software used to analyze fMRI data:

Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

Criminal Justice Forecasts

The ongoing discussion over the use of prediction algorithms in the criminal justice system reminds me a bit of the introduction of DNA evidence decades ago. Ultimately, there is a technology that few people truly understand and there are questions as to whether the information they provide is fair or accurate.

Shameless Promotion

I have a new book coming out with Hilary Parker, based on our Not So Standard Deviations podcast. Sign up to be notified of its release (which should be Real Soon Now).

Not So Standard Deviations Episode 18 - Back on Planet Earth

With Hilary fresh from Use R! 2016, Hilary and I discuss some of the highlights from the conference. Also, some followup about a previous Free Advertising and the NSSD drinking game.

If you have questions you’d like us to answer, you can send them to nssdeviations @ or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes.

Subscribe to the podcast on Google Play.

Please leave us a review on iTunes!

Support us through our Patreon page.

Show notes:

Download the audio for this episode.