Data

Sunday data/statistics link roundup (1/20/2013)

This might be short. I have a couple of classes starting on Monday. The first is our 1. This might be short. I have a couple of classes starting on Monday. The first is our class. This is one of my favorite classes to teach, our Ph.D. students are pretty awesome and they always amaze me with what they can do. The other is my Coursera debut in Data Analysis.

Sunday data/statistics link roundup (1/6/2013)

Not really statistics, but this is an interesting article about how rational optimization by individual actors does not always lead to an optimal solutiohn. Related, ere is the coolest street sign I think I’ve ever seen, with a heatmap of traffic density to try to influence commuters. An interesting paper that talks about how clustering is only a really hard problem when there aren’t obvious clusters. I was a little disappointed in the paper, because it defines the “obviousness” of clusters only theoretically by a distance metric.

Sunday data/statistics link roundup (12/30/12)

An interesting new app called 100plus, which looks like it uses public data to help determine how little decisions (walking more, one more glass of wine, etc.) lead to more or less health. Here’s a post describing it on the heathdata.gov blog. As far as I can tell, the app is still in beta, so only the folks who have a code can download it. Data on mass shootings from the Mother Jones investigation.

Sunday Data/Statistics Link Roundup (11/4/12)

Brian Caffo headlines the WaPo article about massive online open courses. He is the driving force behind our department’s involvement in offering these massive courses. I think this sums it up: `“I can’t use another word than unbelievable,” Caffo said. Then he found some more: “Crazy . . . surreal . . . heartwarming.”’ A really interesting discussion of why “A Bet is a Tax on B.S.”. It nicely describes why intelligent betters must be disinterested in the outcome, otherwise they will end up losing money.

A statistician loves the #insurancepoll...now how do we analyze it?

Amanda Palmer broke Twitter yesterday with her insurance poll. She started off just talking about how hard it is for musicians who rarely have health insurance, but then wandered into polling territory. She sent out a request for people to respond with the following information: quick twitter poll. 1) COUNTRY?! 2) profession? 3) insured? 4) if not, why not, if so, at what cost per month (or covered by job)?

Sunday Data/Statistics Link Roundup (9/2/2012)

Just got back from IBC 2012 in Kobe Japan. I was in an awesome session (organized by the inimitable Lieven Clement) with great talks by Matt McCall, Djork-Arne Clevert, Adetayo Kasim, and Willem Talloen. Willem’s talk nicely tied in our work and how it plays into the pharmaceutical development process and the bigger theme of big data. On the way home through SFO I saw this hanging in the airport.

A deterministic statistical machine

As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.

Sunday data/statistics link roundup (8/26/12)

First off, a quick apology for missing last week, and thanks to Augusto for noticing! On to the links: Unbelievably the BRCA gene patents were upheld by the lower court despite the Supreme Court coming down pretty unequivocally against patenting correlations between metabolites and health outcomes. I wonder if this one will be overturned if it makes it back up to the Supreme Court.  A really nice interview with David Spiegelhalter on Statistics and Risk.

Interview with C. Titus Brown - Computational biologist and open access champion

C. Titus Brown  C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering at Michigan State University. He develops computational software for next generation sequencing and the author of the blog, “Living in an Ivory Basement”. We talked to Titus about open access (he publishes his unfunded grants online!), improving the reputation of PLoS One, his research in computational software development, and work-life balance in academics.

Statistics/statisticians need better marketing

Statisticians have not always been great self-promoters. I think in part this comes from our tendency to be arbiters rather than being involved in the scientific process. In some ways, I think this is a good thing. Self-promotion can quickly become really annoying. On the other hand, I think our advertising shortcomings are hurting our field in a number of different ways. Here are a few: As Rafa points out even though statisticians are ridiculously employable right now it seems like statistics M.

Why we are teaching massive open online courses (MOOCs) in R/statistics for Coursera

Editor’s Note: This post written by Roger Peng and Jeff Leek.  A couple of weeks ago, we announced that we would be teaching free courses in Computing for Data Analysis and Data Analysis on the Coursera platform. At the same time, a number of other universities also announced partnerships with Coursera leading to a large number of new offerings. That, coupled with a new round of funding for Coursera, led to press coverage in the New York Times, the Atlantic, and other media outlets.

Sunday Data/Statistics Link Roundup (7/22/12)

This paper is the paper describing how Uri Simonsohn identified academic misconduct using statistical analyses. This approach has received a huge amount of press in the scientific literature. The basic approach is that he calculates the standard deviations of mean/standard deviation estimates across groups being compared. Then he simulates from a Normal distribution and shows that under the Normal model, it is unlikely that the means/standard deviations are so similar.

Interview with Lauren Talbot - Quantitative analyst for the NYC Financial Crime Task Force

Lauren Talbot Lauren Talbot is a quantitative analyst for the New York City Financial Crime Task Force. Before working for NYC she was an analyst at Acumen LLC and got her degree in economics from Stanford University. She is a key player turning spatial data in NYC into new tools for government management. We talked to Lauren about her work, how she is using open data to do things like predict where fires might occur, and how she got started in the Financial Crime Task Force.

Help me find the good JSM talks

I’m about to head out for JSM in a couple of weeks. The sheer magnitude of the conference means it is pretty hard to figure out what talks I should attend. One approach I’ve used in the past is to identify people who I know give good talks and go to their talks. But that isn’t a very good talk-discovery mechanism. So this year I’m trying a crowd-sourcing experiment. First, some background on what kind of talks I like.

Sunday Data/Statistics Link Roundup (7/15/12)

A really nice list of journals software/data release policies from Titus’ blog. Interesting that he couldn’t find a data/release policy for the New England Journal of Medicine. I wonder if that is because it publishes mostly clinical studies, where the data are often protected for privacy reasons? It seems like there is going to eventually be a big discussion of the relative importance of privacy and open data in the clinical world.

Motivating statistical projects

It seems like half of the battle in statistics is identifying an important/unsolved problem. In math, this is easy, they have a list. So why is it harder for statistics? Since I have to think up projects to work on for my research group, for classes I teach, and for exams we give, I have spent some time thinking about ways that research problems in statistics arise. I borrowed a page out of Roger’s book and made a little diagram to illustrate my ideas (actually I can’t even claim credit, it was Roger’s idea to make the diagram).

A plot of my citations in Google Scholar vs. Web of Science

There has been some discussion about whether Google Scholar or one of the proprietary software companies numbers are better for citation counts. I personally think Google Scholar is better for a number of reasons: Higher numbers, but consistently/adjustably higher It’s free and the data are openly available.  It covers more ground (patents, theses, etc.) to give a better idea of global impact It’s easier to use I haven’t seen a plot yet relating Web of Science citations to Google Scholar citations, so I made one for my papers.

Sunday data/statistics link roundup (1/29)

A really nice D3 tutorial. I’m 100% on board with D3, if they could figure out a way to export the graphics as pdfs, I think this would be the best visualization tool out there. A personalized calculator that tells you what number (of the 7 billion or so) that you are based on your birth day. I’m person 4,590,743,884. Makes me feel so special…. An old post of ours, on dongle communism.

Sunday Data/Statistics Link Roundup

Statistics help for journalists (don’t forget to keep rating stories!) This is the kind of thing that could grow into a statisteracy page. The author also has a really nice plug for public schools.  An interactive graphic to determine if you are in the 1% from the New York Times (I’m not…). Mike Bostock’s d3.js presentation, this is some really impressive visualization software. You have to change the slide numbers manually but it is totally worth it.

In the era of data what is a fact?

The Twitter universe is abuzz about this article in the New York Times. Arthur Brisbane, who responds to reader’s comments, asks  I’m looking for reader input on whether and when New York Times news reporters should challenge “facts” that are asserted by newsmakers they write about. He goes on to give a couple of examples of qualitative facts that reporters have used in stories without questioning the veracity of the claims.

Help us rate health news reporting with citizen-science powered http://www.healthnewsrater.com

We here at Simply Statistics are big fans of science news reporting. We read newspapers, blogs, and the news sections of scientific journals to keep up with the coolest new research. But health science reporting, although exciting, can also be incredibly frustrating to read. Many articles have sensational titles, like “How using Facebook could raise your risk of cancer”. The articles go on to describe some research and interview a few scientists, then typically make fairly large claims about what the research means.

Sunday Data/Statistics Link Roundup

A few data/statistics related links of interest: Eric Lander Profile The math of lego (should be “The statistics of lego”) Where people are looking for homes. Hans Rosling’s Ted Talk on the Developing world (an oldie but a goodie) Elsevier is trying to make open-access illegal (not strictly statistics related, but a hugely important issue for academics who believe government funded research should be freely accessible), more here. 

Where do you get your data?

Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?” My contention is that if someone asks you these questions, start looking for the exits. There are of course legitimate reasons why someone might ask you this question. For example, they might be interested in the source of the data to verify its quality.

List of cities/states with open data - help me find more!

It’s the beginning of 2012 and statistics/data science has never been hotter. Some of the most important data is data collected about civic organizations. If you haven’t seen Bill Gate’s TED Talk about the importance of state budgets, you should watch it now. A major key to solving a lot of our economic problems lies in understanding and using data collected about cites and states. U.S. cities and states are jumping on this idea and our own Baltimore was one of the earliest adopters.

OK Cupid data on Infochimps - anybody got $1k for data?

OK Cupid is an online dating site that has grown its visibility in part through a pretty awesome blog called OK Trends, where they have analyzed their online dating data to, for example, show you what kind of profile picture works best. Now, they have compiled data from their personality survey and made it available online through Infochimps. We have talked about Infochimps before, it is basically a site for distributing/selling data.

Web-scraping

The internet is the greatest source of publicly available data. One of the key skills to being able to obtain data from the web is “web-scraping”, where you use a piece of software to run through a website and collect information. This technique can be used for collecting data from databases or to collect data that is scattered across a website. Here is a very cool little exercise in web-scraping that can be used as an example of the things that are possible.

APIs!

Application programming interfaces (APIs) are tools that are built by companies/governments/organizations to allow software engineers to interact with their websites. One of the main uses of these APIs is to allow software engineers to build apps on top of Facebook/Twitter/etc. Many APIs are really helpful for statisticians/data scientists as well. Using APIs, it is generally very easy to collect large amounts of interesting data. Here are some examples of APIs (you may need to sign up for accounts to get access to some of these).

OracleWorld Claims and Sensations

Larry Ellison, the CEO of Oracle, like most technology CEOs, has a tendency for the over-the-top sales pitch. But it’s fun to keep track of what these companies are up to just to see what they think the trends are. It seems clear that companies like IBM, Oracle, and HP, which focus substantially on the enterprise (or try to), think the future is data data data. One piece of evidence is the list of companies that they’ve acquired recently.

The Open Data Movement

I’m not sure which of the categories this infographic on open data falls into, but I find it pretty exciting anyway. It shows the rise of APIs and how data are increasingly open. It seems like APIs are all over the place in the web development community, but less so in health statistics. Although, from the comments, John M. posts places to find free government data including some health data:  1) CDC’s National Center for Health Statistics, http://www.

Private health insurers to release data

It looks like four major private health insurance companies will be releasing data for use by academic researchers. They will create a non-profit institute called the Health Care Cost Institute and deposit the data there. Researchers can request the data from the institute by (I’m guessing) writing a short proposal. Health insurance billing claims data might not sound all that exciting, but they are a gold mine of very interesting information about population health.

Data Sources

Here are places you can get data sets to analyze (for class projects, fun and profit!) Data Market Infochimps Data.gov Factual.com I’m sure there are a ton more…would love to hear from people. 

Google Fusion Tables

Thanks to Hilary Parker for pointing out Google Fusion Tables. The coolest thing here, from my self-centered spatial statistics point of view, is that it automatically geocodes locations for you. So you can upload a spreadsheet of addresses and it will map them for you on Google Maps. Unfortunately, there doesn’t seem to be an easy way to extract the latitude/longitude values, but I’m hoping that’s just a quick hack away….

Data analysis companies getting gobbled up

Companies that specialize in data analysis, or essentially, statistics, are getting gobbled up by larger companies. IBM bought SPSS, then later Algorithmics. MSCI bought RiskMetrics. HP bought Autonomy. Who’s next? SAS?

Build your own pre-cog

Okay, this is not really about pre-cog, but just a pointer to some data that might be of interest to people. A number of cities post their crime data online, ready for scraping and data analysis. For example, the Baltimore Sun has a Google map of homicides in the city of Baltimore. There’s also some data for Oakland. Looking at the map is fun, but not particularly useful from a data analysis standpoint.