Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

When do we need interpretability?

I just saw a link to an interesting article by Finale Doshi-Velez and Been Kim titled “Towards A Rigorous Science of Interpretable Machine Learning”. From the abstract: Unfortunately, there is little consensus on what interpretability in machine learning is and how to evaluate it for benchmarking. Current interpretability evaluation typically falls into two categories. The first evaluates interpretability in the context of an application: if the system is useful in either a practical application or a simplified version of it, then it must be somehow interpretable.

Model building with time series data

A nice post by Alex Smolyanskaya over the Stitch Fix blog about some of the unique challenges of model building in a time series context: Cross validation is the process of measuring a model’s predictive power by testing it on randomly selected data that was not used for training. However, autocorrelations in time series data mean that data points are not independent from each other across time, so holding out some data points from the training set doesn’t necessarily remove all their associated information.

Reproducibility and replicability is a glossy science now so watch out for the hype

Reproducibility is the ability to take the code and data from a previous publication, rerun the code and get the same results. Replicability is the ability to rerun an experiment and get “consistent” results with the original study using new data. Results that are not reproducible are hard to verify and results that do not replicate in new studies are harder to trust. It is important that we aim for reproducibility and replicability in science.

Learning about Machine Learning with an Earthquake Example

_Editor’s note: This is the fourth chapter of a book I’m working on called Demystifying Artificial Intelligence. I’ve also added a co-author, Divya Narayanan, a masters student here at Johns Hopkins! The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. We are developing the book over time - so if you buy the book on Leanpub know that there are only four chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates.

My Podcasting Setup

I’ve gotten a number of inquiries over the last 2 years about my podcasting setup and I’ve been meaning to write about it but…. But here it is! I actually wanted to write this because I felt like there actually wasn’t a ton of good information about this on the Internet that wasn’t for people who wanted to do it professionally but were rather more “casual” podcasters. So here’s what I’ve got.

Data Scientists Clashing at Hedge Funds

There’s an interesting article over at Bloomberg about how data scientists have struggled at some hedge funds: The firms have been loading up on data scientists and coders to deliver on the promise of quantitative investing and lift their ho-hum returns. But they are discovering that the marriage of old-school managers and data-driven quants can be rocky. Managers who have relied on gut calls resist ceding control to scientists and their trading signals.

Not So Standard Deviations Episode 32 - You Have to Reinvent the Wheel a Few Times

Hilary and I discuss training in PhD programs, estimating the variance vs. the standard deviation, the bias variance tradeoff, and explainable machine learning. We’re also introducing a new level of support on our Patreon page, where you can get access to some of the outtakes from our episodes. Check out our Patreon page for details. Show notes: Explainable AI Stitch Fix Blog NBA Rankings David Robinson’s Empirical Bayes book

Reproducible Research Needs Some Limiting Principles

Over the past 10 years thinking and writing about reproducible research, I’ve come to the conclusion that much of the discussion is incomplete. While I think we as a scientific community have come a long way in changing people’s thinking about data and code and making them available to others, there are some key sticking points that keep coming up that are preventing further progress in the area. When I used to write about reproducibility, I felt that the primary challenge/roadblock was a lack of tooling.

Turning data into numbers

Editor’s note: This is the third chapter of a book I’m working on called Demystifying Artificial Intelligence. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there are only three chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates.

New class - Data App Prototyping for Public Health and Beyond

Are you interested in building data apps to help save the world, start the next big business, or just to see if you can? We are running a data app prototyping class for people interested in creating these apps. This will be a special topics class at JHU and is open to any undergrad student, grad student, postdoc, or faculty member at the university. We are also seeing if we can make the class available to people outside of JHU so even if you aren’t at JHU but are interested you should let us know below.