Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Recording Podcasts with a Remote Co-Host

I previously wrote about my editing workflow for podcasts and I thought I’d follow up with some details on how I record both Not So Standard Deviations and The Effort Report. This post is again going to be a bit Mac-specific because, well, that’s what I do. Communication Both of my podcasts have a co-host who is not in the same physical location as me. Therefore, we need to use some sort of Internet-based communication software (Skype, Google Hangouts, FaceTime, etc.

Editing Podcasts with Logic Pro X

I thought I’d write a brief description of how I edit podcasts using Logic Pro X because when I was first getting into podcasts, I didn’t find a lot of useful stuff out there. A lot of it was YouTube videos of advanced editing or very basic stuff. I don’t consider myself a sound expert in any way, but I wanted a good workflow that would produce decent quality stuff.

Specialization and Communication in Data Science

I have been interested for a while now in how data scientists can better communicate data analysis activities to each other and to people outside the field. I believe that our current methods are inadequate because they have mostly been borrowed from other areas (notably, computer science). Many of those tools are useful, but they were not developed to communicate data analysis concepts specifically and often fall short. I talked about this problem in my Dean’s Lecture earlier this year and how the field of data science could benefit from developing its own theories, to simplify communication as other fields have done.

Moon Shots Cost More Than You Think

In a deeply reported article, Casey Ross and Ike Swetlitz report that IBM’s Watson isn’t living up to its hype when it comes to cancer care: The interviews suggest that IBM, in its rush to bolster flagging revenue, unleashed a product without fully assessing the challenges of deploying it in hospitals globally. While it has emphatically marketed Watson for cancer care, IBM hasn’t published any scientific papers demonstrating how the technology affects physicians and patients.

Deep Dive - Y. Ogata's Residual Analysis for Point Processes

For a long time now—actually ever since we started the blog—I’ve wanted to do a series of deep dives into specific papers that I thought were great. Clearly, it’s taken a bit longer than I expected, but I figure better late than never. Actually, that’s become a bit of a theme for my work these days! One problem I have with much academic writing on the Internet is that I feel like most of it is devoted to (1) promoting one’s own work; or (2) identifying weaknesses in others’ work.

Data Science on a Chromebook

About nine months ago I announced that I was attempting a Chromebook experiment for the 2nd time. At first I thought it was going to be a short term experiment just to see if it was possible to function with only a Chromebook. But in an interesting twist I got used to it and have been working exclusively on a Chromebook for the last few months since the experiment started.

Simple Queue Package for R

I recently coded up a simple package that implements a file-based queue abstract data type. This package was needed for a different package that I’m working on involving parallel processing (more on that in the near future). Actually, this othe package is a project that I started almost nine years ago but was never able to get off the ground. I tried to implement a queue interface in the filehash package but it never served the purpose that I needed.

Code for my educational gifs

During preparation for class I sometimes think up of animations that will explain the concept I am teaching. I sometimes share the resulting animations on social media via @rafalab. John Storey recently asked if the source code is publicly available. Because I am not that organized, and these ideas come about during last minute preparations, the code was spread across several unrelated files. John’s request motivated me to include the code in one post.

Announcing the tidypvals package

A few years ago I helped write a paper where we proposed scraping p-values from the medical literature to try to estimate the science-wise false discovery rate. The paper generated a ton of interesting discussion and inspired other groups to start collecting p-values from the literature. As I’ve mentioned before the p-value is the most popular statistic ever invented so there are a lot of published p-values out there. The tidypvals package is an effort to find previous collections of published p-values, synthesize them, and tidy them into one analyzable data set.

My unfunded HHMI teaching professors proposal

A little over a year ago I saw a request from the Howard Hughes Medical Institute for proposals focused on undergraduate teaching. I decided to apply for this grant since it combines all of the things I’m interested in: teaching, education research, biology, and data science. So I put together a proposal, got a couple of colleagues to write me letters of support, and sent it off. I was optimistic about the proposal since we have a cool opportunity through our work in scalable education to hit a large student population and we have been spending a lot of time thinking about using this platform to create a “science of data science” platform (more about that soon!