Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Data scientist on a chromebook take two

My friend Fernando showed me his collection of old Apple dongles that no longer work with the latest generation of Apple devices. This coupled with the announcement of the Macbook pro that promises way more dongles and mostly the same computing, had me freaking out about my computing platform for the future. I’ve been using cloudy tools for more and more of what I do and so it had me wondering if it was time to go back and try my Chromebook experiment again. Basically the question is whether I can do everything I need to do comfortably on a Chromebook.

So to execute the experience I got a brand new ASUS chromebook flip and the connector I need to plug it into hdmi monitors (there is no escaping at least one dongle I guess :(). Here is what that badboy looks like in my home office with Apple superfanboy Roger on the screen.


In terms of software there have been some major improvements since I last tried this experiment out. Some of these I talk about in my book How to be a modern scientist. As of this writing this is my current setup:

That handles the vast majority of my workload so far (its only been a day :)). But I would welcome suggestions and I’ll report back when either I give up or if things are still going strong in a little while….

Not So Standard Deviations Episode 25 - How Exactly Do You Pronounce SQL?

Hilary and I go through the overflowing mailbag to respond to listener questions! Topics include causal inference in trend modeling, regression model selection, using SQL, and data science certification.

If you have questions you’d like us to answer, you can send them to nssdeviations @ or tweet us at @NSSDeviations.

Show notes:

Download the audio for this episode

Listen here:

Are Datasets the New Server Rooms?

Josh Nussbaum has an interesting post over at Medium about whether massive datasets are the new server rooms of tech business.

The analogy comes from the “old days” where in order to start an Internet business, you had to buy racks and servers, rent server space, buy network bandwidth, license expensive server software, backups, and on and on. In order to do all that up front, it required a substantial amount of capital just to get off the ground. As inconvenient as this might have been, it provided an immediate barrier to entry for any other competitors who weren’t able to raise similar capital.

Of course,

…the emergence of open source software and cloud computing completely eviscerated the costs and barriers to starting a company, leading to deflationary economics where one or two people could start their company without the large upfront costs that were historically the hallmark of the VC industry.

So if startups don’t have huge capital costs in the beginning, what costs do they have? Well, for many new companies that rely on machine learning, they need to collect data.

As a startup collects the data necessary to feed their ML algorithms, the value the product/service provides improves, allowing them to access more customers/users that provide more data and so on and so forth.

Collecting huge datasets ultimately costs money. The sooner a startup can raise money to get that data, the sooner they can defend themselves from competitors who may not yet have collected the huge datasets for training their algorithms.

I’m not sure the analogy between datasets and server rooms quite works. Even back when you had to pay a lot of up front costs to setup servers and racks, a lot of that technology was already a commodity, and anyone could have access to it for a price.

I see massive datasets used to train machine learning algorithms as more like the new proprietary software. The startups of yore spent a lot of time writing custom software for what we might now consider mundane tasks. This was a time-consuming activity but the software that was developed had value and was a differentiator for the company. Today, many companies write complex machine learning algorithms, but those algorithms and their implmentations are quickly becoming commodities. So the only thing that separates one company from another is the amount and quality of data that they have to train those algorithms.

Going forward, it will be interesting see what these companies will do with those massive datasets once they no longer need them. Will they “open source” them and make them available to everyone? Could there be an open data movement analogous to the open source movement?

For the most part, I doubt it. While I think many today would perhaps sympathize with the sentiment that software shouldn’t have owners, those same people I think would argue vociferously that data most certainly do have owners. I’m not sure how I’d feel if Facebook made all their data available to anyone. That said, many datasets are made available by various businesses, and as these datasets grow in number and in usefulness, we may see a day where the collection of data is not a key barrier to entry, and that you can train your machine learning algorithm on whatever is out there.

Distributed Masochism as a Pedagogical Model

Editor’s note: This is a guest post by Sean Kross. Sean is a software developer in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. Sean has contributed to several of our specializations including Data Science, Executive Data Science, and Mastering Software Development in R. He tweets @seankross.

Over the past few months I’ve been helping Jeff develop the Advanced Data Science class he’s teaching at the Johns Hopkins Bloomberg School of Public Health. We’ve been trying to identify technologies that we can teach to students which (we hope) will enable them to rapidly prototype data-based software applications which will serve a purpose in public health. We started with technologies that we’re familiar with (R, Shiny, static websites) but we’re also trying to teach ourselves new technologies (the Amazon Alexa Skills API, iOS and Swift). We’re teaching skills that we know intimately along with skills that we’re learning on the fly which is a style of teaching that we’ve practiced several times.

Jeff and I have come to realize that while building new courses with technologies that are new to us we experience particular pains and frustrations which, when documented, become valuable learning resources for our students. This process of documenting new-tech-induced pain is only a preliminary step. When we actually launch classes either online or in person our students run into new frustrations which we respond to with changes to either documentation or course content. This process of quickly iterating on course material is especially enhanced in online courses where the time span for a course lasts a few weeks compared to a full semester, so kinks in the course are ironed out at a faster rate compared to traditional in-person courses. All of the material in our courses is open-source and available on GitHub, and we teach our students how to use Git and GitHub. We can take advantage of improvements and contributions the students think we should make to our courses through pull requests that we recieve. Student contributions further reduce the overall start-up pain experienced by other students.

With students from all over the world participating in our online courses we’re unable to anticipate every technical need considering different locales, languages, and operating systems. Instead of being anxious about this reality we depend on a system of “distributed masochism” whereby documenting every student’s unique technical learning pains is an important aspect of improving the online learning experience. Since we only have a few months head start using some of these technologies compared to our students it’s likely that as instructors we’ve recently climbed a similar learning curve which makes it easier for us to help our students. We believe that this approach of teaching new technologies by allowing any student to contribute to open course material allows a course to rapidly adapt to students’ needs and to the inevitable changes and upgrades that are made to new technologies.

I’m extremely interested in communicating with anyone else who is using similar techniques, so if you’re interested please contact me via Twitter (@seankross) or send me an email: sean at

Not So Standard Deviations Episode 24 - 50 Minutes of Blathering

Another IRL episode! Hilary and I met at a Jimmy John’s to talk data science, like you do. Topics covered include RStudio Conf, polling, millennials, Karl Broman, and more!

If you have questions you’d like us to answer, you can send them to nssdeviations @ or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play. And please leave us a review on iTunes.

Support us through our Patreon page.

Get the Not So Standard Deviations book.

Show notes:

Download the audio for this episode

Listen here: