Simply Statistics


This graph makes me think Kobe is not that good, he just shoots a lot

I find it surprising that NBA commentators rarely talk about field goal percentage. Everybody knows that the more you shoot the more you score. But players that score a lot are admired without consideration of their FG%. Of course having a high FG% is not necessarily admirable as many players only take easy shots, while top-scorers need to take difficult ones. Regardless, missing is undesirable and players that miss more than usual are not criticized enough. Iverson, for example, had a lowly career FG% of 43 yet he regularly made the allstar team. But I am not surprised he never won an NBA championship: it’s hard to win when your top scorer misses so often.

Experts consider Kobe to be one of the all time greats and compare him to Jordan. They never mention that he is consistently among league leaders in missed shots. So far this year, Kobe has missed a whopping 279 times for a league leading 13.3 misses per game. In contrast, Lebron has missed 8.8 per game and has scored about the same per game. The plot above (made with this R script) shows career FG% for players considered to be superstars, top-scorers, and that have won multiple championships (red lines are 1st and 3rd quartiles). I also include Gasol, Lebron, Wade, and Dominique. Note that Kobe has the worst FG% in this group.  So how does he win 5 championships? Well perhaps Shaq and later Gasol made up for his misses. Note that the first year Kobe played without Shaq, the Lakers did not make the playoffs. Also, during Kobe’s career the Lakers’ record has been similar with and without him. Experts may compare Kobe to Jordan, but perhaps we should be comparing him to Dominique.

Update: Please see Brunsloe87’s comment for a much better analysis than mine. He/she points out that it’s too simplistic to look at FG%. Instead we should look at something closer to points scored per shot taken. This rewards players, like Kobe, that draw many fouls and has a high FT%. There is a weighted statistic called true scoring % (TS%) that tries to summarize this and below I include a plot of TS% for the same players. Kobe is no Jordan but he is not as bad as Dominique either. He is somewhere in the middle. 

The comment also points out that Magic didn’t shoot as much as other superstars so it’s unfair to include him. A better plot would plot TS% versus shots taken (e.g. FGA+FTA/2) but I’ll let someone with more time make that one. Anyways, this plot explains why the early 80s Lakers (Magic+Kareem) were so good.


Why in-person education isn't dead yet...but a statistician could finish it off

A growing tend in education is to put lectures online, for free. The Kahn Academy, Stanford’s recent AI course, and Gary King’s new quantitative government course at Harvard are three of the more prominent examples. This new pedagogical format is more democratic, free, and helps people learn at their own pace. It has led some, including us here at Simply Statistics, to suggest that the future of graduate education lies in online courses. Or to forecast the end of in-class lectures

All this excitement led John Cook to ask, “What do colleges sell?”. The answers he suggested were: (1) real credentials, like a degree, (2) motivation to ensure you did the work, and (3) feedback to tell you how you are doing. As John suggests, online lectures really only target motivated and self-starting learners. For graduate students, this may work (maybe), but for the vast majority of undergrads or high-school students, self-guided learning won’t work due to a lack of motivation. 

I would suggest that until the feedback, assessment,and credentialing problems have been solved, online lectures are still more edu-tainment than education. 

Of these problems, I think we are closest to solving the feedback problem with online quizes and tests to go with online lectures. What we haven’t solved are assessment and credentialing. The reason is there is no good system for verifying a person taking a quiz/test online is who they say they are. This issue has two consequences: (1) it is difficult to require that a person do online quizes/tests like we do with in-class quizes/tests and (2) it is difficult to believe credentials given to people who take courses online. 

What does this have to do with statistics? Well, what we need is an Completely Automated Online Test for Student Identity (COATSI). People will notice a similarity between my acronym and the acronym for CAPTCHAs, the simple online Turing tests used to prove that you are a human and not a computer. 

The properties of a COATSI need to be:

  1. Completely automated
  2. Provide tests that verify the identity of the student being assessed
  3. Can be used throughout an online quiz/test/assessment
  4. Are simple and easy to solve

I can’t think of a deterministic system that can be used for this purpose. My suspicion is that a COATSI will need to be statistical. For example, one idea is to have people sign in with Facebook, then at random intervals while they are solving problems, they have to identify their friends by name. If they do this quickly/consistently enough, they are verified as the person taking the test. 

I don’t have a good solution to this problem yet; I’d love to hear more suggestions. I also think this seems like a potentially hugely important and very challenging problem for a motivated grad student or postdoc….


Sunday data/statistics link roundup (1/29)

  1. A really nice D3 tutorial. I’m 100% on board with D3, if they could figure out a way to export the graphics as pdfs, I think this would be the best visualization tool out there. 
  2. A personalized calculator that tells you what number (of the 7 billion or so) that you are based on your birth day. I’m person 4,590,743,884. Makes me feel so special….
  3. An old post of ours, on dongle communism. One of my favorite posts, it came out before we had much traffic but deserves more attention. 
  4. This isn’t statistics/data related but too good to pass up. From the Bones television show, malware fractals shaved into a bone. I love TV science. Thanks to Dr. J for the link.
  5. Stats are popular

This simple bar graph clearly demonstrates that the US can easily increase research funding

Some NIH R01 paylines are down to 10%. This means only 10% of grants are being funded. The plot below highlights that all we need is a tiny litte slice from Defense, Medicare, Medicaid or Social Security to bring that back up to 20%. The plot was taken from Alex Tarrabok’s great article in the Atlantic.

Update: The y-axis unit is billions of US dollars.


When should statistics papers be published in Science and Nature?

Like many statisticians, I was amped to see a statistics paper appear in Science. Given the impact that statistics has on the scientific community, it is a shame that more statistics papers don’t appear in the glossy journals like Science or Nature. As I pointed out in the previous post, if the paper that introduced the p-value was cited every time this statistic was used, the paper would have over 3 million citations!

But a couple of our readers* have pointed to a response to the MIC paper published by Noah Simon and Rob Tibshirani. Simon and Tibshirani show that the MIC statistic is underpowered compared to another recently published statistic for the same purpose that came out in 2009 in the Annals of Applied Statistics. A nice summary of the discussion is provided by Florian over at his blog. 

If the AoAS statistic came out first (by 2 years) and is more powerful (according to simulation), should the MIC statistic have appeared in Science? 

The whole discussion reminds me of a recent blog post suggesting that journals need to pick one between groundbreaking and definitive. The post points out that groundbreaking and definitive are in many ways in opposition to each other. 

Again, I’d suggest that statistics papers get short shrift in the glossy journals and I would like to see more. And the MIC statistic is certainly groundbreaking, but it isn’t clear that it is definitive. 

As a comparison, a slightly different story played out with another recent high-impact statistical method, the false discovery rate (FDR). The original papers were published in statistics journals. Then when it was clear that the idea was going to be big, a more general-audience-friendly summary was published in PNAS (not Science or Nature but definitely glossy). This might be a better way for the glossy journals to know what is going to be a major development in statistics versus an exciting - but potentially less definitive - method. 

* Florian M. and John S.


The end of in-class lectures is closer than I thought

Our previous post on future of (statistics) graduate education was motivated by  he Stanford online course on Artificial Intelligence.  Here is an update on the class that had 160,000 people enroll. Some highlights: 1- Sebastian Thrun has given up his tenure at Stanford and he’s started a new online university called Udacity. 2- 248 students got a perfect score: they never got a single question wrong, over the entire course of the class. All 248 took the course online; not one was enrolled at Stanford. 3- Students from Afghanistan completed the course. What do you think are the chances these students could afford Stanford’s tuition? 4 - There were more students from Lithuania alone than there are students at Stanford altogether.

The class evaluations were not perfect. Here is a particularly harsh one. They also need to figure out how to evaluate online students. But I am sure there are plenty of people working on that problem. Here is an example. Regardless, this was the first such experiment and for a first try it seems like a huge success to me. As more professors try this, for example Harvard’s Gary King is conducting a similar class in Quantitative Research Methodology, it will become clearer that there is no future for in-class lectures as we know them today.

Thanks to Alex and Jeff for all the links. 


A wordcloud comparison of the 2011 and 2012 #SOTU

I wrote a quick (and very dirty) R script for creating a comparison cloud and a commonality cloud for President Obama’s 2011 and 2012 State of the Union speeches*. The cloud on the left shows words that have different frequencies between the two speeches and the cloud on the right shows the words in common between the two speeches. Here is a higher resolution version. 

The focus on jobs hasn’t changed much. But it is interesting how the 2012 speech seems to focus more on practical issues (tax, pay, manufacturing, oil) versus more emotional issues in 2011 (future, schools, laughter, success, dream). 

*The wordcloud R package does all the heavy lifting.


Why statisticians should join and launch startups

The tough economic times we live in, and the potential for big paydays, have made entrepreneurship cool. From the venture capitalist-in-chief, to the javascript coding mayor of New York, everyone is on board. No surprise there, successful startups lead to job creation which can have a major positive impact on the economy. 

The game has been dominated for a long time by the folks over in CS. But the value of many recent startups is either based on, or can be magnified by, good data analysis. Here are a few startups that are based on data/data analysis: 

  1. The Climate Corporation -analyzes climate data to sell farmers weather insurance.
  2. Flightcaster - uses public data to predict flight delays
  3. Quid - uses data on startups to predict success, among other things.
  4. 100plus - personalized health prediction startup, predicting health based on public data
  5. Hipmunk - The main advantage of this site for travel is better data visualization and an algorithm to show you which flights have the worst “agony”.

To launch a startup you need just a couple of things: (1) a good, valuable source of data (there are lots of these on the web) and (2) a good idea about how to analyze them to create something useful. The second step is obviously harder than the first, but the companies above prove you can do it. Then, once it is built, you can outsource/partner with developers - web and otherwise - to implement your idea. If you can build it in R, someone can make it an app. 

These are just a few of the startups whose value is entirely derived from data analysis. But companies from LinkedIn, to Bitly, to Amazon, to Walmart are trying to mine the data they are generating to increase value. Data is now being generated at unprecedented scale by computers, cell phones, even thremostats! With this onslaught of data, the need for people with analysis skills is becoming incredibly acute

Statisticians, like computer scientists before them, are poised to launch, and make major contributions to, the next generation of startups. 


Sunday Data/Statistics Link Roundup (1/21)

  1. Is the microarray dead? Jeremey Leipzig seems to think that statistical methods for microarrays should be. I’m not convinced, the technology has finally matured to the point we can use it for personalized medicine and we abandon it for the next hot thing? Not to Andrew for the link.
  2. Data from 5 billion webpages available from the Common Crawl. Want to build your own search tool - or just find out whats on the web? Get your Hadoop on. Nod to Peter S. for the heads up. 
  3. Simon and Tibhsirani criticize the greatly publicized MIC statistic. Nod to John S. for the link.
  4. A public/free statistics class being offered through the IQSS at Harvard. 

Interview With Joe Blitzstein

Joe Blitzstein
Joe Blitzstein is Professor of the Practice in Statistics at Harvard University and co-director of the graduate program. He moved to Harvard after obtaining his Ph.D. with Persi Diaconis at Stanford University. Since joining the faculty at Harvard, he has been immortalized in Youtube prank videos, been awarded a “favorite professor” distinction four times, and performed interesting research on the statistical analysis of social networks. Joe was also the first person to discover our blog on Twitter. You can find more information about him on his personal website. Or check out his Stat 110 class, now available from iTunes!
Which term applies to you: data scientist/statistician/analyst?

Statistician, but that should and does include working with data! I
think statistics at its best interweaves modeling, inference,
prediction, computing, exploratory data analysis (including
visualization), and mathematical and scientific thinking. I don’t
think “data science” should be a separate field, and I’m concerned
about people working with data without having studied much statistics
and conversely, statisticians who don’t consider it important ever to
look at real data. I enjoyed the discussions by Drew Conway and on
your blog (at and )
and think the relationships between statistics, machine learning, data
science, and analytics need to be clarified.

How did you get into statistics/data science (e.g. your history)?

I always enjoyed math and science, and became a math major as an
undergrad Caltech partly because I love logic and probability and
partly because I couldn’t decide which science to specialize in. One
of my favorite things about being a math major was that it felt so
connected to everything else: I could often help my friends who were
doing astronomy, biology, economics, etc. with problems, once they had
explained enough so that I could see the essential pattern/structure
of the problem. At the graduate level, there is a tendency for math to
become more and more disconnected from the rest of science, so I was
very happy to discover that statistics let me regain this, and have
the best of both worlds: you can apply statistical thinking and tools
to almost anything, and there are so many opportunities to do things
that are both beautiful and useful.

Who were really good mentors to you? What were the qualities that really
helped you?

I’ve been extremely lucky that I have had so many inspiring
colleagues, teachers, and students (far too numerous to list), so I
will just mention three. My mother, Steffi, taught me at an early age
to love reading and knowledge, and to ask a lot of “what if?”
questions. My PhD advisor, Persi Diaconis, taught me many beautiful
ideas in probability and combinatorics, about the importance of
starting with a simple nontrivial example, and to ask a lot of “who
cares?” questions. My colleague Carl Morris taught me a lot about how
to think inferentially (Brad Efron called Carl a “natural”
statistician in his interview at ,
by which I think he meant that valid inferential thinking does not
come naturally to most people), about parametric and hierarchical
modeling, and to ask a lot of “does that assumption make sense in the
real world?” questions.

How do you get students fired up about statistics in your classes?

Statisticians know that their field is both incredibly useful in the
real world and exquisitely beautiful aesthetically. So why isn’t that
always conveyed successfully in courses? Statistics is often
misconstrued as a messy menagerie of formulas and tests, rather than a
coherent approach to scientific reasoning based on a few fundamental
principles. So I emphasize thinking and understanding rather than
memorization, and try to make sure everything is well-motivated and
makes sense both mathematically and intuitively. I talk a lot about
paradoxes and results which at first seem counterintuitive, since
they’re fun to think about and insightful once you figure out what’s
going on.

And I emphasize what I call “stories,” by which I mean an
application/interpretation that does not lose generality. As a simple
example, if X is Binomial(m,p) and Y is Binomial(n,p) independently,
then X+Y is Binomial(m+n,p). A story proof would be to interpret X as
the number of successes in m Bernoulli trials and Y as the number of
successes in n different Bernoulli trials, so X+Y is the number of
successes in the m+n trials. Once you’ve thought of it this way,
you’ll always understand this result and never forget it. A
misconception is that this kind of proof is somehow less rigorous than
an algebraic proof; actually, rigor is determined by the logic of the
argument, not by how many fancy symbols and equations one writes out.

My undergraduate probability course, Stat 110, is now worldwide
viewable for free on iTunes U at
with 34 lecture videos and about 250 practice problems with solutions.
I hope that will be a useful resource, but in any case looking through
those materials says more about my teaching style than anything I can
write here does.

What are your main research interests these days?

I’m especially interested in the statistics of networks, with
applications to social network analysis and in public health. There is
a tremendous amount of interest in networks these days, coming from so
many different fields of study, which is wonderful but I think there
needs to be much more attention devoted to the statistical issues.
Computationally, most network models are difficult to work with since
the space of all networks is so vast, and so techniques like Markov
chain Monte Carlo and sequential importance sampling become crucial;
but there remains much to do in making these algorithms more efficient
and in figuring out whether one has run them long enough (usually the
answer is “no” to the question of whether one has run them long
enough). Inferentially, I am especially interested in how to make
valid conclusions when, as is typically the case, it is not feasible
to observe the full network. For example, respondent-driven sampling
is a link-tracing scheme being used all over the world these days to
study so-called “hard-to-reach” populations, but much remains to be
done to know how best to analyze such data; I’m working on this with
my student Sergiy Nesterko. With other students and collaborators I’m
working on various other network-related problems. Meanwhile, I’m also
finishing up a graduate probability book with Carl Morris,
“Probability for Statistical Science,” which has quite a few new
proofs and perspectives on the parts of probability theory that are
most useful in statistics.

You have been immortalized in several Youtube videos. Do you think this
helped make your class more “approachable”?

There were a couple strange and funny pranks that occurred in my first
year at Harvard. I’m used to pranks since Caltech has a long history
and culture of pranks, commemorated in several “Legends of Caltech”
volumes (there’s even a movie in development about this), but pranks
are quite rare at Harvard. I try to make the class approachable
through the lectures and by making sure there is plenty of support,
help, and encouragement is available from the teaching assistants and
me, not through YouTube, but it’s fun having a few interesting
occasions from the history of the class commemorated there.