Simply Statistics


Selling the Power of Statistics

A few weeks ago we learned that Warren Buffett is a big IBM fan (a $10 billion fan, that is). Having heard that I went over to the IBM web site to see what they’re doing these days. For starters, they’re not selling computers anymore! At least not the kind that I would use. One of the big things they do now is “Business Analytics and Optimization” (i.e. statistics), which is one of the reasons they bought SPSS and then later Algorithmics.

Roaming around the IBM web site, I found this little video on how IBM is involved with tennis matches like the US Open. It’s the usual promo video: a bit cheesy, but pretty interesting too. For example, they provide all the players an automatically generated post-game “match analysis DVD” that has summaries of all the data from their match with corresponding video.

It occurred to me that one of the challenges that a company like IBM faces is selling the “power of analytics” to other companies. They need to make these promo videos because, I guess, some companies are not convinced they need this whole analytics thing (or at least not from IBM). They probably need to do methods and software development too, but getting the deal in the first place is at least as important.

In contrast, here at Johns Hopkins, my experience has been that we don’t really need to sell the “power of statistics” to anyone. For the most part, researchers around here seem to be already “sold”. They understand that they are collecting a ton of data and they’re going to need statisticians to help them understand it. Maybe Hopkins is the exception, but I doubt it.

Good for us, I suppose, for now. But there is a danger that we take this kind of monopoly position for granted. Companies like IBM hire the same people we do (including one grad school classmate) and there’s no reason why they couldn’t become direct competitors. We need to continuously show that we can make sense of data in novel ways. 


Contributions to the R source

One of the nice things about tracking the R subversion repository using git instead of subversion is you can do

git shortlog -s -n

which gives you

 19855  ripley
  6302  maechler
  5299  hornik
  2263  pd
  1153  murdoch
   813  iacus
   716  luke
   661  jmc
   614  leisch
   472  ihaka
   403  murrell
   286  urbaneks
   284  rgentlem
   269  apache
   253  bates
   249  tlumley
   164  duncan
    92  r
    43  root
    40  paul
    40  falcon
    39  lyndon
    34  thomas
    33  deepayan
    26  martyn
    18  plummer
    15  (no author)
    14  guido
     3  ligges
     1  mike

These data are since 1997 so for Brian Ripley, that’s 3.6 commits per day for the last 15 years. 

I think that number 1 position will be out of reach for a while. 

By the way, I highly recommend to anyone tracking subversion repositories that they use git to do it. You get all of the advantages of git and there are essentially no downsides.


Reproducible Research and Turkey

Over the Thanksgiving recent break I naturally started thinking about reproducible research in between salting the turkey and making the turkey stock. Clearly, these things are all related. 

I sometimes get the sense that many people see reproducibility as essentially binary. A published paper is either reproducible, as in you can compute every single last numerical result to within epsilon precision, or it’s not. My feeling is that there is a spectrum of reproducibility when it comes to published scientific findings. Some papers are more reproducible than others. And that’s where cooking comes in.

I do a bit of cooking and I am a shameless consumer of food blogs/web sites. There seems pretty solid agreement (and my own experience essentially confirms) that the more you can make yourself and not have to rely on other people doing the cooking, the better. For example, for Thanksgiving, you could theoretically buy yourself a pre-roasted turkey that’s ready to eat. My brother tells me this is what homesick Americans do in China because so few people have an oven (I suppose you could steam a turkey?). Or you could buy an un-cooked turkey that is “flavor injected”. Or you could buy a normal turkey and brine/salt it yourself. Or you could get yourself one of those heritage turkeys. Or you could raise your own turkeys…. I think in all of these cases, the turkey would definitely be edible and maybe even tasty. But some would probably be more tasty than others. 

And that’s the point. There’s a spectrum when it comes to cooking and some methods result in better food than others. Similarly, when it comes to published research there is a spectrum of what authors can make available to reproduce their work. On the one hand, you have just the paper itself, which reveals quite a bit of information (i.e. the scientific question, the general approach) but usually too few details to actually reproduce (or even replicate) anything. Some authors might release the code, which allows you to study the algorithms and maybe apply them to your own work. Some might release the code and the data so that you can actually reproduce the published findings. Some might make a nice R package/vignette so that you barely have to lift a finger. Each case is better than the previous, but that’s not to say that I would only accept the last/best case. Some reproducibility is better than none.

That said, I don’t think we should shoot low. Ideally, we would have the best case, which would allow for full reproducibility and rapid dissemination of ideas. But while we wait for that best case scenario, it couldn’t hurt to have a few steps in between.


Apple this is ridiculous - you gotta upgrade to upgrade!?

So along with a few folks here around Hopkins we have been kicking around the idea of developing an app for the iPhone/Android. I’ll leave the details out for now (other than to say stay tuned!). 

But to start developing an app for the iPhone, you need a version of Xcode, Apple’s development environment. The latest version of Xcode is version 4, which can only be installed with the latest version of Mac OS X Lion (10.7, I think) and above. So I dutifully went off to download Lion. Except, whoops! You can only download Lion from the Mac App store. 

Now this wouldn’t be a problem, if you didn’t need OS X Snow Leopard (10.6 and above) to access the App store. Turns out I only have version 10.5 (must be OS X Housecat or something). I did a little searching and it looks like the only way I can get Lion is if I buy Snow Leopard first and upgrade to upgrade!

It isn’t the money so much (although it does suck to pay $60 for $30 worth of software), but the time and inconvenience this causes. Apple has done this a couple of times to me in the past with operating systems needing to be upgraded so I can buy things from iTunes. But this is getting out of hand….maybe I need to consider the alternatives


An R function to analyze your Google Scholar Citations page

Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself here.

I asked John Muschelli and Andrew Jaffe to write me a function that would download my Google Scholar Citations data so I could play with it. Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end.  

So how does it work? Here is the code. You can source the functions like so:


This will install the following packages if you don’t have them: wordcloud, tm, sendmailR, RColorBrewer. Then you need to find the url of a google scholar citation page. Here is Rafa Irizarry’s:

You can then call the googleCite function like this:

out = googleCite(“”,pdfname=”rafa_wordcloud.pdf”)

or search by name like this:

out = searchCite(“Rafa Irizarry”,pdfname=”rafa_wordcloud.pdf”)

The function will download all of Rafa’s citation data and put it in the matrix out. It will also make wordclouds of (a) the co-authors on his papers and (b) the titles of his papers and save them in the pdf file specified (There is an option to turn off plotting if you want). Here is what Rafa’s clouds look like: 

We have also written a little function to calculate many of the popular citation indices. You can call it on the output like so:


When you download citation data, an email with the data table will also be sent to Simply Statistics so we can collect information on who is using the function and perform population-level analyses. 

If you liked this function you might also be interesting in our R function to determine if you are a data scientist, or in some of the other stuff going on over at Simply Statistics



Data Scientist vs. Statistician

There’s in interesting discussion over at reddit on the difference between a data scientist and a statistician. My crude summary of the discussion seems to be that by and large they are the same but the phrase “data scientist” is just the hip new name for statistician that will probably sound stupid 5 years from now.

My question is why isn’t “statistician” hip? The comments don’t seem to address that much (although a few go in that direction).  There a few interesting comments about computing. For example from ByteMining:

Statisticians typically don’t care about performance or coding style as long as it gets a result. A loop within a loop within a loop is all the same as an O(1) lookup.

Another more down-to-earth comment comes from marshallp:

There is a real distinction between data scientist and statistician

  • the statistician spent years banging his/her head against blackboards full of math notation to get a modestly paid job

  • the data scientist gets s—loads of cash after having learnt a scripting language and an api

More people should be encouraged into data science and not pointless years of stats classes

 Not sure I fully agree but I see where he’s coming from!

[Note: See also our post on how determine whether you are a data scientist.]


Ozone rules

A recent article in the New York Times describes the backstory behind the decision to not revise the ozone national ambient air quality standard. This article highlights the reality of balancing the need to set air pollution regulation to protect public health and the desire to get re-elected. Not having ever served in politics (does being elected to the faculty senate count?) I can’t comment on the political aspect. But I wanted to highlight some of the scientific evidence that goes into developing these standards. 

A bit of background: the Clean Air Act of 1970 and its subsequent amendments requires that national ambient air quality standards be set to protect public health with “an adequate margin of safety”. Ozone (usually referred to as smog in the press) is one of the pollutants for which standards are set, along with particulate matter, nitrogen oxides, sulfur dioxide, carbon monoxide, and airborne lead. Importantly, the Clean Air Act requires that the EPA to set standards based on the best available scientific evidence.

The ozone standard was re-evaluated years ago under the (second) Bush administration. At the time, the EPA staff recommended a daily standard of between 60 and 70 ppb as providing an adequate margin of safety. Roughly speaking, if the standard is 70 ppb, this means that states cannot have levels of ozone higher than 70 ppb on any given day (that’s not exactly true but the real standard is a mouthful). Stephen Johnson, EPA administrator at the time, set the standard at 75 ppb, citing in part the lack of evidence showing a link between ozone and health at low levels.

We’ve conducted epidemiological analyses that show that ozone is associated with mortality even at levels far below 60 ppb (See Figure 2). Note, this paper was not published in time to make into the previous EPA review. The study suggests that if a threshold exists below which ozone has no health effect, it is probably at a level lower than the current standard, possibly nearing natural background levels. Detecting thresholds at very low levels is challenging because you start running out of data quickly. But other studies that have attempted to do this have found results similar to ours.

The bottom line is pollution levels below current air quality standards should not be misinterpreted as safe for human health.


Show 'em the data!

In a previous post I argued that students entering college should be shown job prospect by major data. This week I found out the American Bar Association might make it a requirement for law school accreditation.

Hat tip to Willmai Rivera.


Interview with Héctor Corrada Bravo

Héctor Corrada Bravo

Héctor Corrada Bravo is an assistant professor in the Department of Computer Science and the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park. He moved to College Park after finishing his Ph.D. in computer science at the University of Wisconsin and a postdoc in biostatistics at the Johns Hopkins Bloomberg School of Public Health. He has done outstanding work at the intersection of molecular biology, computer science, and statistics. For more info check out his webpage.

Which term applies to you: statistician/data scientist/computer
scientist/machine learner?

I want to understand interesting phenomena (in my case mostly in
biology and medicine) and I believe that our ability to collect a large number of relevant
measurements and infer characteristics of these phenomena can drive
scientific discovery and commercial innovation in the near future.
Perhaps that makes me a data scientist and means that depending on the
task at hand one or more of the other terms apply.

A lot of the distinctions many people make between these terms are
vacuous and unnecessary, but some are nonetheless useful to think
about. For example, both statisticians and machine learners [sic] know
how to create statistical algorithms that compute interesting and informative objects using measurements (perhaps) obtained through some stochastic or partially observed
process. These objects could be genomic tools for cancer screening, or
statistics that better reflect the relative impact of baseball players
on team success.

Both fields also give us ways to evaluate and characterize these objects.
However, there are times when these objects are tools that fulfill an
immediately utilitarian purpose and thinking like an engineer might
(as many people in Machine Learning do) is the right approach.
Other times, these objects are there to help us get insights about our
world and thinking in ways that many statisticians do is the right
approach.  You need both of these ways of thinking to do interesting
science and dogmatically avoiding either of them is a terrible idea.

How did you get into statistics/data science (i.e. your history)?

I got interested in Artificial Intelligence at one point, and found
that my mathematics background was nicely suited to work on this. Once
I got into it, thinking about statistics and how to analyze and
interpret data was natural and necessary. I started working with two
wonderful advisors at Wisconsin, Raghu Ramakrishnan (CS) and Grace Wahba (Statistics)
that helped shape the way I approach problems from different angles
and with different goals. The last piece was discovering that
computational biology is a fantastic setting in which to apply and
devise these methods to answer really interesting questions.

What is the problem currently driving you?

I’ve been working on cancer epigenetics to find specific genomic
measurements for which increased stochasticity appears to be general
across multiple cancer types. Right now, I’m really wondering how far
into the clinic can these discoveries be taken, if at all. For
example, can we build tools that use these genomic measurements to
improve cancer screening?

How do you see CS/statistics merging in the future?

I think that future got here some time ago, but is about to get much
more interesting.

Here is one example: Computer Science is about creating and analyzing
algorithms and building the systems that can implement them. Some of
what many computer scientists have done looks at problems concerning how to
keep, find and ship around information (Operating Systems, Networks,
Databases, etc.). Many times these have been driven by very specific
needs, e.g., commercial transactions in databases. In some ways,
companies have moved from from asking how do I use data to keep track
of my activities to how do I use data to decide which activities to do
and how to do them. Statistical tools should be used to answer these
questions, and systems built by computer scientists have statistical
algorithms at their core.

Beyond R, what are some really useful computational tools for
statisticians to know about?

I think a computational tool that everyone can benefit a lot from
understanding better is algorithm design and analysis. This doesn’t
have to be at a particularly deep level, but just getting a sense of
how long a particular process might take, and how to devise a different way of doing it that might make it more efficient is really useful. I’ve been toying with the idea of creating a CS course called (something like) “Highlights of continuous
mathematics for computer science” that reminds everyone of the cool
stuff that one learns in math now that we can appreciate their usefulness. Similarily, I think
statistics students can benefit from “Highlights of discrete
mathematics for statisticians”.

Now a request for comments below from you and readers: (5a) Beyond R,
what are some really useful statistical tools for computer scientists
to know about?

Review times in statistics journals are long, should statisticians
move to conference papers?

I don’t think so. Long review times (anything more than 3 weeks) are
really not necessary. We tend to publish in journals with fairly quick
review times that produce (for the most part) really useful and
insightful reviews.

I was recently talking to senior members in my field who were telling
me stories about the “old times” when CS was moving from mainly
publishing in journals to now mainly publishing in conferences. But
now, people working in collaborative projects (like computational biology) work in fields
that primarily publish in journals, so the field needs to be able to
properly evaluate their impact and productivity. There is no perfect

For instance, review requests in fields where conferences are the main
publication venue come in waves (dictated by conference schedule).
Reviewers have a lot of papers to go over in a relatively short time
which makes their job of providing really helpful and fair reviews not
so easy. So, in that respect, the journal system can be better. The one thing that is universally true is that you don’t need long review times.

Previous Interviews: Daniela Witten, Chris Barr, Victoria Stodden


Google Scholar Pages

If you want to get to know more about what we’re working on, you can check out our Google Scholar pages:

I’ve only been using it for a day but I’m pretty impressed by how much it picked up. My only problem so far is having to merge different versions of the same paper.