List of cities/states with open data - help me find more!

It’s the beginning of 2012 and statistics/data science has never been hotter. Some of the most important data is data collected about civic organizations. If you haven’t seen Bill Gate’s TED Talk about the importance of state budgets, you should watch it now. A major key to solving a lot of our economic problems lies in understanding and using data collected about cites and states. U.S. cities and states are jumping on this idea and our own Baltimore was one of the earliest adopters.


The internet is the greatest source of publicly available data. One of the key skills to being able to obtain data from the web is “web-scraping”, where you use a piece of software to run through a website and collect information. This technique can be used for collecting data from databases or to collect data that is scattered across a website. Here is a very cool little exercise in web-scraping that can be used as an example of the things that are possible.

Defining data science

Rebranding of statistics as a field seems to be a popular topic these days and “data science” is one of the potential rebranding options. This article over at Revolutions is a nice summary of where the term comes from and what it means. This quote seems pretty accurate: My own take is that Data Science is a valuable rebranding of computer science and applied statistics skills.

Battling Bad Science

Here is a pretty awesome TED talk by epidemiologist Ben Goldacre where he highlights how science can be used to deceive/mislead. It’s sort of like epidemiology 101 in 15 minutes. This seems like a highly topical talk. Over on his blog, Steven Salzberg has pointed out that Dr. Oz has recently been engaging in some of these shady practices on his show. Too bad he didn’t check out the video first.

How do you spend your day?

I’ve seen visualizations of how people spend their time a couple of places. Here is a good one over at Flowing Data. 

Ideas/Data blogs I read

R bloggers - good R blogs aggregator Flowing Data - interesting data visualizations Marginal Revolution - an econ blog with lots of interesting ideas Revolutions - another news about R blog Steven Salzberg’s blog Andrew Gelman’s blog I’m sure there are a ton more good blogs like this out there. Any suggestions of what I should be reading? 

Communicating uncertainty visually

From a cool review about communicating risk to people without statistical/probabilistic training. Despite the burgeoning interest in infographics, there is limited experimental evidence on how different types of visualizations are processed and understood, although the effectiveness of some graphics clearly depends on the relative numeracy of an audience. 

When overconfidence is good

A paper came out in the latest issue of Nature called the “Evolution of Confidence”. The authors describe a simple model where two participants are competing for a resource. They can either both claim the resource, only one can claim the resource, or neither can. If the ratio of the value of the resource over the cost of competition is good enough, then it makes sense to be overconfident about your abilities to obtain it.

The Duke Saga

For those of you that don’t know about the saga involving genomic signatures, I highly recommend reading this very good summary published in The Economist. Baggerly and Coombes are two statisticians that can confidently say they have made an impact on clinical research and actually saved lives. A paper by this pair describing the details was published in the Annals of Applied Statistics as most of the Biology journals refused to publish their letters to the editor.

What is a Statistician?

This Column was written by Terry Speed in 2006 and is reprinted with permission from the IMS Bulletin, In the generation of my teachers, say from 1935 to 1960, relatively few statisticians were trained for the profession. The majority seemed to come from mathematics, without any specialized statistical training. There was also a sizeable minority coming from other areas, such as astronomy (I can think of one prominent example), chemistry or chemical engineering (three), economics (several), history (one), medicine (several), physics (two), and psychology (several).

[youtube] “Any other team wins the World Series, good for them…if we win, with this team … we’ll have changed the game.” Moneyball! Maybe the start of the era of data. Plus it is a feel good baseball movie where a statistician is the hero. I haven’t been this stoked for a movie in a long time. (Source:

Awesome Stat Ed Links

Openintro - A free online introduction to stats textbook, even the latex is free! One of the authors is Chris Barr, a former postdoc at Hopkins. The undergraduate guide to R - A free intro to R at a super-beginners level, the most popular (and free) statistical programming language. Written by an undergrad at Princeton.