Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Distributed Masochism as a Pedagogical Model

Editor’s note: This is a guest post by Sean Kross. Sean is a software developer in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. Sean has contributed to several of our specializations including Data Science, Executive Data Science, and Mastering Software Development in R. He tweets @seankross.

Over the past few months I’ve been helping Jeff develop the Advanced Data Science class he’s teaching at the Johns Hopkins Bloomberg School of Public Health. We’ve been trying to identify technologies that we can teach to students which (we hope) will enable them to rapidly prototype data-based software applications which will serve a purpose in public health. We started with technologies that we’re familiar with (R, Shiny, static websites) but we’re also trying to teach ourselves new technologies (the Amazon Alexa Skills API, iOS and Swift). We’re teaching skills that we know intimately along with skills that we’re learning on the fly which is a style of teaching that we’ve practiced several times.

Jeff and I have come to realize that while building new courses with technologies that are new to us we experience particular pains and frustrations which, when documented, become valuable learning resources for our students. This process of documenting new-tech-induced pain is only a preliminary step. When we actually launch classes either online or in person our students run into new frustrations which we respond to with changes to either documentation or course content. This process of quickly iterating on course material is especially enhanced in online courses where the time span for a course lasts a few weeks compared to a full semester, so kinks in the course are ironed out at a faster rate compared to traditional in-person courses. All of the material in our courses is open-source and available on GitHub, and we teach our students how to use Git and GitHub. We can take advantage of improvements and contributions the students think we should make to our courses through pull requests that we recieve. Student contributions further reduce the overall start-up pain experienced by other students.

With students from all over the world participating in our online courses we’re unable to anticipate every technical need considering different locales, languages, and operating systems. Instead of being anxious about this reality we depend on a system of “distributed masochism” whereby documenting every student’s unique technical learning pains is an important aspect of improving the online learning experience. Since we only have a few months head start using some of these technologies compared to our students it’s likely that as instructors we’ve recently climbed a similar learning curve which makes it easier for us to help our students. We believe that this approach of teaching new technologies by allowing any student to contribute to open course material allows a course to rapidly adapt to students’ needs and to the inevitable changes and upgrades that are made to new technologies.

I’m extremely interested in communicating with anyone else who is using similar techniques, so if you’re interested please contact me via Twitter (@seankross) or send me an email: sean at

Not So Standard Deviations Episode 24 - 50 Minutes of Blathering

Another IRL episode! Hilary and I met at a Jimmy John’s to talk data science, like you do. Topics covered include RStudio Conf, polling, millennials, Karl Broman, and more!

If you have questions you’d like us to answer, you can send them to nssdeviations @ or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play. And please leave us a review on iTunes.

Support us through our Patreon page.

Get the Not So Standard Deviations book.

Show notes:

Download the audio for this episode

Listen here:

Should I make a chatbot or a better FAQ?

Roger pointed me to this interesting article (paywalled, sorry!) about Facebook’s chatbot service. I think the article made a couple of interesting points. The first thing I thought was interesting was their explicit acknowledgement of the process I outlined in a previous post for building an AI startup - (1) convince (or in this case pay) some humans to be your training set, and (2) collect the data on the humans and then use it to build your AI.

The other point that is pretty fascinating is that they realized how many data points they would need before they could reasonably replace a human with an AI chatbot. The original estimate was tens of thousands and the ultimate number was millions or more. I have been thinking a lot that the AI “revolution” is just a tradeoff between parameters and data points. If you have a billion parameter prediction algorithm it may work pretty awesome - as long as you have a few hundred billion data points to train it with.

But the theme of the article was that chatbots may have had some mis-steps/may not be ready for prime time. I think the main reason is that at the moment most AI efforts can only report facts, not intuit intention and alter the question for the user or go beyond the facts/state of the world.

One example I’ve run into recently was booking a ticket on an airline. I wanted to know if I could make a certain change to my ticket. The airline didn’t have any information about the change I wanted to make online. After checking thoroughly I clicked on the “Chat with an agent” button and was directed to what was clearly a chatbot. The chatbot asked a question or two and then sent me to the “make changes to a ticket” page of the website.

I eventually had to call and get a person on the phone, because what I wanted to ask about didn’t apply to the public information. They set me straight and I booked the ticket. The chatbot wasn’t helpful because it could only respond with information it had available on the website. It couldn’t identify a new situation, realize it had to ask around, figure out there was an edge case, and then make a ruling/help out.

I would guess that most of the time if a person interacts with a chatbot they are doing it only because they already looked at all the publicly available information on the FAQ, etc. and couldn’t find it. So an alternative solution, which would require a lot less work and a much smaller training set, is to just have a more complete FAQ.

The question to me is does anyone other than Facebook or Google have a big enough training set to make a chatbot worth it?

The Dangers of Weighting Up a Sample

There’s a great story by Nate Cohn over at the New York Times’ Upshot about the dangers of “weighting up” a sample from a survey. In this case, it is in regards to a U.S.C/LA Times poll asking who people will vote for President:

The U.S.C./LAT poll weights for many tiny categories: like 18-to-21-year-old men, which U.S.C./LAT estimates make up around 3.3 percent of the adult citizen population. Weighting simply for 18-to-21-year-olds would be pretty bold for a political survey; 18-to-21-year-old men is really unusual.

The U.S.C./LA Times poll apparently goes even further:

When you start considering the competing demands across multiple categories, it can quickly become necessary to give an astonishing amount of extra weight to particularly underrepresented voters — like 18-to-21-year-old black men. This wouldn’t be a problem with broader categories, like those 18 to 29, and there aren’t very many national polls that are weighting respondents up by more than eight or 10-fold. The extreme weights for the 19-year-old black Trump voter in Illinois are not normal.

It’s worth noting (as a good thing) that the U.S.C./LA Times poll data is completely open, thus allowing the NYT to reproduce this entire analysis.

I haven’t done much in the way of survey analyses, but I’ve done some inverse probability weighting and in my experience it can be a tricky procedure in ways that are not always immediately obvious. The article discusses weight trimming, but also notes the dangers of that procedure. Overall, a good treatment of a complex issue.

Information and VC Investing

Sam Lessin at The Information has a nice post (sorry, paywall, but it’s a great publication) about how increased measurement and analysis is changing the nature of venture capital investing.

This brings me back to what is happening at series A financings. Investors have always, obviously, tried to do diligence at all financing rounds. But series A investments used to be an exercise in a few top-level metrics a company might know, some industry interviews and analysis, and a whole lot of trust. The data that would drive capital market efficiency usually just wasn’t there, so capital was expensive and there were opportunities for financiers. Now, I am seeing more and more that after a seed round to boot up most companies, the backbone of a series A financing is an intense level of detail in reporting and analytics. It can be that way because the companies have the data

I’ve seen this happen in other areas where data comes in to disrupt the way things are done. Good analysis only gives you an advantage if no one else is doing it. Once everyone accepts the idea and everyone has the data (and a good analytics team), there’s no more value left in the market.

Time to search elsewhere.