Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

My Podcast Podroll

As a self-interested podcaster, I’m obviously interested in getting more people to listen to more podcasts. For those who may be interested in getting into podcasts but wondering where to begin I thought I’d thought I’d make a list of what I’m currently listening to (perhaps to be updated as I acquire new ones). In no particular order: Hello Internet - YouTubers Brady Haran and CGP Grey basically just talk, often for an hour to an hour and a half.

Optimizing for User Experience

I’ve previously written about how business (that are built by humans) can initially become successful by optimizing the user experience. That great user experience is what defines the value that their product provides. However, over time, companies have to find a way to provide value that is defined outside of the business-user relationship. The difficulty here is that this new definition of value is not within the control of the company–it is defined by the community via laws.

The joy of no more violin plots

I dislike violin plots because they look like Christmas ornaments. It’s a pet peeve but there is somewhat of a practical reason as well. To demonstrate I created a dataset called dat that contains an outcome value from 25 different groups. One of the first steps I take when analyzing data is to look at the distribution of my data. If there are groups, I like to stratify and look at the distributions.

The Machines Learn But We Don't

I was genuinely amazed at this article by George Nott published in Computerworld quoting Peter Norvig on explainable artificial intelligence. From the article: Speaking at an event at UNSW in Sydney on Thursday, Norvig – who at NASA developed software that flew on Deep Space 1 – said: “You can ask a human, but, you know, what cognitive psychologists have discovered is that when you ask a human you’re not really getting at the decision process.

Lowering the GWAS threshold would save millions of dollars

A recent publication (pay-walled) by Boyle et al. introducing the concept of an omnigenic model has generated much discussion. It reminded me of a question I’ve had for a while about the way genetics data is analyzed. Before getting into this, I’ll briefly summarize the general issue. With the completion of the human genome project, human geneticists saw much promise in the possibility of scanning the entire genome for the genes associated with a trait.

The future of education is plain text

I was recently at a National Academy meeting on Envisioning the Data Science Curriculum. It was a fun meeting and one of the questions that came up was what kind of infrastructure do we need to enable shared curricula, compatibility across schools, and not reinventing the wheel. My answer to this question was that we need lecture notes stored in plain text files (like rmarkdown files) and data stored in csv files with direct links.

papr - rate papers on biorxiv in a single swipe and help science!

Like a lot of modern scientists I now find papers to read to a large extent based on what I see on social media. It is a great way to find out what my colleagues are reading and keep up with the newest cool research. A few weeks ago I released a really simple prototype for an app that would improve on this experience called papr, its like Tinder but for papers :).

Don't use deep learning your data isn't that big

The other day Brian was at a National Academies meeting and he gave one of his usual classic quotes: Best quote from NAS DS Round Table: "I mean, do we need deep learning to analyze 30 subjects?" - B Caffo @simplystats #datascienceinreallife — CMU Stats (@CMU_Stats) May 1, 2017 When I saw that quote I was reminded of the blog post Don’t use hadoop - your data isn’t that big.

Toward tidy analysis

Tidy data at its heart is a set of three rules for organizing a data set: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. This is an incredibly useful abstraction for thinking about organizing data sets for analysis. In particular any data set that can be conveniently rectangled to look like this fun diagram from Jenny Bryan Data rectangle

The Past and Future of Data Analysis

On May 3rd I gave my Dean’s lecture titled “The Past and Future of Data Analysis”, which was a lot of fun and gave me the opportunity to play lots of different kinds of music on stage! I talked a little bit about it on the latest episode of Not So Standard Deviations. Now the School has posted the full video of the lecture and you can watch it here: