Simply Statistics

12
Feb

Not So Standard Deviations Episode 9 - Spreadsheet Drama

For this episode, special guest Jenny Bryan (@jennybryan) joins us from the University of British Columbia! Jenny, Hilary, and I talk about spreadsheets and why some people love them and some people despise them. We also discuss blogging as part of scientific discourse.

Subscribe to the podcast on iTunes.

Show notes:

Download the audio for this episode.

26
Jan

Exactly how risky is breathing?

This article by by George Johnson in the NYT describes a study by Kamen P. Simonov​​ and Daniel S. Himmelstein​ that examines the hypothesis that people living at higher altitudes experience lower rates of lung cancer than people living at lower altitudes.

All of the usual caveats apply. Studies like this, which compare whole populations, can be used only to suggest possibilities to be explored in future research. But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.

I'm not so much focused on the science itself, which is perhaps intriguing, but rather on the way the article was written. First, George Johnson links to the paper itself, already a major victory. Also, I thought he did a very nice job of laying out the complexity of doing a population-level study like this one--all the potential confounders, selection bias, negative controls, etc.

I remember particulate matter air pollution epidemiology used to have this feel. You'd try to do all these different things to make the effect go away, but for some reason, under every plausible scenario, in almost every setting, there was always some association between air pollution and health outcomes. Eventually you start to believe it....

24
Jan

Not So Standard Deviations Episode 8 - Snow Day

Hilary and I were snowed in over the weekend, so we recorded Episode 8 of Not So Standard Deviations. In this episode, Hilary and I talk about how to get your foot in the door with data science, the New England Journal's view on data sharing, Google's "Cohort Analysis", and trying to predict a movie's box office returns based on the movie's script.

Subscribe to the podcast on iTunes.

Follow @NSSDeviations on Twitter!

Show notes:

Apologies for my audio on this episode. I had a bit of a problem calibrating my microphone. I promise to figure it out for the next episode!

Download the audio for this episode.

 

21
Jan

Parallel BLAS in R

I'm working on a new chapter for my R Programming book and the topic is parallel computation. So, I was happy to see this tweet from David Robinson (@drob) yesterday:

What does this have to do with parallel computation? Briefly, the code generates 5,000 standard normal random variates, repeats this 5,000 times and stores them in a 5,000 x 5,000 matrix (`x'). Then it computes x x'. The second part is key, because it involves a matrix multiplication.

Matrix multiplication in R is handled, at a very low level, by the library that implements the Basic Linear Algebra Subroutines, or BLAS. The stock R that you download from CRAN comes with what's known as a reference implementation of BLAS. It works, it produces what everyone agrees are the right answers, but it is in no way optimized. Here's what I get when I run this code on my Mac using Studio and the CRAN version of R for Mac OS X:

system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
   user  system elapsed 
 59.622   0.314  59.927 

Note that the "user" time and the "elapsed" time are roughly the same. Note also that I use the tcrossprod() function instead of the otherwise equivalent expression x %*% t(x). Both crossprod() and tcrossprod() are generally faster than using the %*% operator.

Now, when I run the same code on my built-from-source version of R (version 3.2.3), here's what I get:

system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
   user  system elapsed 
 14.378   0.276   3.344 

Overall, it's faster when I don't run the code through RStudio (14s vs. 59s). Also on this version the elapsed time is about 1/4 the user time. Why is that?

The build-from-source version of R is linked to Apple's Accelerate framework, which is a large library that includes an optimized BLAS library for Intel chips. This optimized BLAS, in addition to being optimized with respect to the code itself, is designed to be multi-threaded so that it can split work off into chunks and run them in parallel on multi-core machines. Here, the tcrossprod() function was run in parallel on my machine, and so the elapsed time was about a quarter of the time that was "charged" to the CPU(s).

David's tweet indicated that when using Microsoft R Open, which is a custom built binary of R, that the (I assume?) elapsed time is 2.5 seconds. Looking at the attached link, it appears that Microsoft's R Open is linked against Intel's Math Kernel Library (MKL) which contains, among other things, an optimized BLAS for Intel chips. I don't know what kind of computer David was running on, but assuming it was similarly high-powered as mine, it would suggest Intel's MKL sees slightly better performance. But either way, both Accelerate and MKL achieve that speed up through custom-coding of the BLAS routines and multi-threading on multi-core systems.

If you're going to be doing any linear algebra in R (and you will), it's important to link to an optimized BLAS. Otherwise, you're just wasting time unnecessarily. Besides Accelerate (Mac) and Intel MKL, theres AMD's ACML library for AMD chips and the ATLAS library which is a general purpose tunable library. Also Goto's BLAS is optimized but is not under active development.

14
Jan

Profile of Hilary Parker

If you've ever wanted to know more about my Not So Standard Deviations co-host (and Johns Hopkins graduate) Hilary Parker, you can go check out the great profile of her on the American Statistical Association's This Is Statistics web site.

What advice would you give to high school students thinking about majoring in statistics?

It’s such a great field! Not only is the industry booming, but more importantly, the disciplines of statistics teaches you to think analytically, which I find helpful for just about every problem I run into. It’s also a great field to be interested in as a generalist– rather than dedicating yourself to studying one subject, you are deeply learning a set of tools that you can apply to any subject that you find interesting. Just one glance at the topics covered on The Upshot or 538 can give you a sense of that. There’s politics, sports, health, history… the list goes on! It’s a field with endless possibility for growth and exploration, and as I mentioned above, the more I explore the more excited I get about it.

12
Jan

Not So Standard Deviations Episode 7 - Statistical Royalty

The latest episode of Not So Standard Deviations is out, and boy does Hilary have a story to tell.

We also talk about Theranos and the pitfalls of diagnostic testing, Spotify's Discover Weekly playlist generation algorithm (and the need for human product managers), and of course, a little Star Wars. Also, Hilary and I start a new segment where we each give some "free advertising" to something interesting that they think other people should know about.

Show Notes:

Download the audio for this episode.

11
Jan

Jeff, Roger and Brian Caffo are doing a Reddit AMA at 3pm EST Today

Jeff Leek, Brian Caffo, and I are doing a Reddit AMA TODAY at 3pm EST. We're happy to answer questions about...anything...including our roles as Co-Directors of the Johns Hopkins Data Science Specialization as well as the Executive Data Science Specialization.

This is one of the few pictures of the three of us together.

IMG_0189

18
Dec

Not So Standard Deviations: Episode 6 - Google is the New Fisher

Episode 6 of Not So Standard Deviations is now posted. In this episode Hilary and I talk about the analytics of our own podcast, and analyses that seem easy but are actually hard.

If you haven't already, you can subscribe to the podcast through iTunes.

This will be our last episode for 2015 so see you in 2016!

Notes

Download the audio file for this episode.

03
Dec

Not So Standard Deviations: Episode 5 - IRL Roger is Totally With It

I just posted Episode 5 of Not So Standard Deviations so check your feeds! Sorry for the long delay since the last episode but we got a bit tripped up by the Thanksgiving holiday.

In this episode, Hilary and I open up the mailbag and go through some of the feedback we've gotten on the previous episodes. The rest of the time is spent talking about the importance of reproducibility in data analysis both in academic research and in industry settings.

If you haven't already, you can subscribe to the podcast through iTunes. Or you can use the SoundCloud RSS feed directly.

Notes:

Download the audio file for this episode.

Or you can listen right here:

10
Nov

Prediction Markets for Science: What Problem Do They Solve?

I've recently seen a bunch of press on this paper, which describes an experiment with developing a prediction market for scientific results. From FiveThirtyEight:

Although replication is essential for verifying results, the current scientific culture does little to encourage it in most fields. That’s a problem because it means that misleading scientific results, like those from the “shades of gray” study, could be common in the scientific literature. Indeed, a 2005 study claimed that most published research findings are false.

[...]

The researchers began by selecting some studies slated for replication in the Reproducibility Project: Psychology — a project that aimed to reproduce 100 studies published in three high-profile psychology journals in 2008. They then recruited psychology researchers to take part in two prediction markets. These are the same types of markets that people use to bet on who’s going to be president. In this case, though, researchers were betting on whether a study would replicate or not.

There are all kinds of prediction markets these days--for politics, general ideas--so having one for scientific ideas is not too controversial. But I'm not sure I see exactly what problem is solved by having a prediction market for science. In the paper, they claim that the market-based bets were better predictors of the general survey that was administrated to the scientists. I'll admit that's an interesting result, but I'm not yet convinced.

First off, it's worth noting that this work comes out of the massive replication project conducted by the Center for Open Science, where I believe they have a fundamentally flawed definition of replication. So I'm not sure I can really agree with the idea of basing a prediction market on such a definition, but I'll let that go for now.

The purpose of most markets is some general notion of "price discovery". One popular market is the stock market and I think it's instructive to see how that works. Basically, people continuously bid on the shares of certain companies and markets keep track of all the bids/offers and the completed transactions. If you are interested in finding out what people are willing to pay for a share of Apple, Inc., then it's probably best to look at...what people are willing to pay. That's exactly what the stock market gives you. You only run into trouble when there's no liquidity, so no one shows up to bid/offer, but that would be a problem for any market.

Now, suppose you're interested in finding out what the "true fundamental value" of Apple, Inc. Some people think the stock market gives you that at every instance, while others think that the stock market can behave irrationally for long periods of time. Perhaps in the very long run, you get a sense of the fundamental value of a company, but that may not be useful information at that point.

What does the market for scientific hypotheses give you? Well, it would be one thing if granting agencies participated in the market. Then, we would never have to write grant applications. The granting agencies could then signal what they'd be willing to pay for different ideas. But that's not what we're talking about.

Here, we're trying to get at whether a given hypothesis is true or not. The only real way to get information about that is to conduct an experiment. How many people betting in the markets will have conducted an experiment? Likely the minority, given that the whole point is to save money by not having people conduct experiments investigating hypotheses that are likely false.

But if market participants aren't contributing real information about an hypothesis, what are they contributing? Well, they're contributing their opinion about an hypothesis. How is that related to science? I'm not sure. Of course, participants could be experts in the field (although not necessarily) and so their opinions will be informed by past results. And ultimately, it's consensus amongst scientists that determines, after repeated experiments, whether an hypothesis is true or not. But at the early stages of investigation, it's not clear how valuable people's opinions are.

In a way, this reminds me of a time a while back when the EPA was soliciting "expert opinion" about the health effects of outdoor air pollution, as if that were a reasonable substitute for collecting actual data on the topic. At least it cost less money--just the price of a conference call.

There's a version of this playing out in the health tech market right now. Companies like Theranos and 23andMe are selling health products that they claim are better than some current benchmark. In particular, Theranos claims its blood tests are accurate when only using a tiny sample of blood. Is this claim true or not? No one outside Theranos knows for sure, but we can look to the financial markets.

Theranos can point to the marketplace and show that people are willing to pay for its products. Indeed, the $9 billion valuation of the private company is another indicator that people...highly value the company. But ultimately, we still don't know if their blood tests are accurate because we don't have any data. If we were to go by the financial markets alone, we would necessarily conclude that their tests are good, because why else would anyone invest so much money in the company?

I think there may be a role to play for prediction markets in science, but I'm not sure discovering the truth about nature is one of them.