R

How to use Bioconductor to find empirical evidence in support of π being a normal number

Happy π day everybody! I wanted to write some simple code (included below) to the test parallelization capabilities of my new cluster. So, in honor of π day, I decided to check for evidence that π is a normal number. A normal number is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10−m. For example, using the Poisson approximation, we can predict that the pattern “123456789” should show up between 0 and 3 times in the first billion digits of π (it actually shows up twice starting, at the 523,551,502-th and 773,349,079-th decimal places).

A Shiny web app to find out how much medical procedures cost in your state.

Today the front page of the Huffington Post featured the Today the front page of the Huffington Post featured the that shows the cost of many popular procedures broken down by hospital. We here at Simply Statistics think you should be able to explore these data more easily. So we asked Today the front page of the Huffington Post featured the Today the front page of the Huffington Post featured the that shows the cost of many popular procedures broken down by hospital.

Introducing the healthvis R package - one line D3 graphics with R

We have been a little slow on the posting for the last couple of months here at Simply Stats. That’s bad news for the blog, but good news for our research programs! Today I’m announcing the new healthvis R package that is being developed by my student Prasad Patil (who needs a website like yesterday), Hector Corrada Bravo, and myself*. The basic idea is that I have loved D3 interactive graphics for a while.

Statisticians and computer scientists - if there is no code, there is no paper

I think it has been beat to death that the incentives in academia lean heavily toward producing papers and less toward producing/maintaining software. There are people that are way, way more knowledgeable than me about building and maintaining software. For example, Titus Brown hit a lot of the key issues in his interview. The open source community is also filled with advocates and researchers who know way more about this than I do.

Review of R Graphics Cookbook by Winston Chang

I just got a copy of Winston Chang’s book R Graphics Cookbook, published by O’Reilly Media. This book follows now a series of O’Reilly books on R, including an R Cookbook. Winston Chang is a graduate student at Northwestern University but is probably better known to R users as an active member of the ggplot2 mailing list and an active contributor to the ggplot2 source code. The book has a typical cookbook format.

Computing for Data Analysis Returns

I’m happy to announce that my course Computing for Data Analysis will return to Coursera on January 2nd, 2013. While I had previously announced that the course would be presented again right here, it made more sense to do it again on Coursera where it is (still) free and the platform there is much richer. For those of you who missed it the last time around, this is your chance to take it and learn a little R.

I give up, I am embracing pie charts

Most statisticians know that pie charts are a terrible way to plot percentages. You can find explanations here, here, and here as well as the R help file for the pie function which states: Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.

On weather forecasts, Nate Silver, and the politicization of statistical illiteracy

As you know, we have a thing for statistical literacy here at Simply Stats. So of course this column over at Politico got our attention (via Chris V. and others). The column is an attack on Nate Silver, who has a blog where he tries to predict the outcome of elections in the U.S., you may have heard of it… The argument that Dylan Byers makes in the Politico column is that Nate Silver is likely to be embarrassed by the outcome of the election if Romney wins.

Computing for Data Analysis (Simply Statistics Edition)

As the entire East Coast gets soaked by Hurricane Sandy, I can’t help but think that this is the perfect time to…take a course online! Well, as long as you have electricity, that is. I live in a heavily tree-lined area and so it’s only a matter of time before the lights cut out on me (I’d better type quickly!). I just finished teaching my course Computing for Data Analysis through Coursera.

A statistical project bleg (urgent-ish)

We all know that politicians can play it a little fast and loose with the truth. This is particularly true in debates, where politicians have to think on their feet and respond to questions from the audience or from each other. Usually, we find out about how truthful politicians are in the “post-game show”. The discussion of the veracity of the claims is usually based on independent fact checkers such as PolitiFact.

Why we are teaching massive open online courses (MOOCs) in R/statistics for Coursera

Editor’s Note: This post written by Roger Peng and Jeff Leek.  A couple of weeks ago, we announced that we would be teaching free courses in Computing for Data Analysis and Data Analysis on the Coursera platform. At the same time, a number of other universities also announced partnerships with Coursera leading to a large number of new offerings. That, coupled with a new round of funding for Coursera, led to press coverage in the New York Times, the Atlantic, and other media outlets.

Really Big Objects Coming to R

I noticed in the development version of R the following note in the NEWS file: There is a subtle change in behaviour for numeric index values 2^31 and larger. These used never to be legitimate and so were treated as NA, sometimes with a warning. They are now legal for long vectors so there is no longer a warning, and x[2^31] <- y will now extend the vector on a 64-bit platform and give an error on a 32-bit one.

A closer look at data suggests Johns Hopkins is still the #1 US hospital

The US News best hospital 2012-20132 rankings are out. The big news is that Johns Hopkins has lost its throne. For 21 consecutive years Hopkins was ranked #1, but this year Mass General Hospital (MGH) took the top spot displacing Hopkins to #2. However, Elisabet Pujadas, an MD-PhD student here at Hopkins, took a close look at the data used for the rankings and made this plot (by hand!). The plot shows histograms of the rankings by speciality and shows Hopkins outperforming MGH.

Johns Hopkins Coursera Statistics Courses

Computing for Data Analysis [youtube http://www.youtube.com/watch?v=gk6E57H6mTs] Data Analysis [youtube http://www.youtube.com/watch?v=-lutj1vrPwQ] Mathematical Biostatistics Bootcamp [youtube http://www.youtube.com/watch?v=ekdpaf_WT_8]

This graph shows that President Obama's proposed budget treats the NIH even worse than G.W. Bush - Sign the petition to increase NIH funding!

The NIH provides financial support for a large percentage of biological and medical research in the United States. This funding supports a large number of US jobs, creates new knowledge, and improves healthcare for everyone. So I am signing this petition:  NIH funding is essential to our national research enterprise, to our local economies, to the retention and careers of talented and well-educated people, to the survival of our medical educational system, to our rapidly fading worldwide dominance in biomedical research, to job creation and preservation, to national economic viability, and to our national academic infrastructure.

A plot of my citations in Google Scholar vs. Web of Science

There has been some discussion about whether Google Scholar or one of the proprietary software companies numbers are better for citation counts. I personally think Google Scholar is better for a number of reasons: Higher numbers, but consistently/adjustably higher It’s free and the data are openly available.  It covers more ground (patents, theses, etc.) to give a better idea of global impact It’s easier to use I haven’t seen a plot yet relating Web of Science citations to Google Scholar citations, so I made one for my papers.

Statistics project ideas for students

Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea. Happy data crunching! Data Collection/Synthesis Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling.

Prediction: the Lasso vs. just using the top 10 predictors

One incredibly popular tool for the analysis of high-dimensional data is the lasso. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction. Suppose you have an outcome Y and several predictors X1,…,XM, the lasso fits a model: Y = B + B1 X1 + B2 X2 + … + BM XM + E

An R script for estimating future inflation via the Treasury market

One factor that is critical for any financial planning is estimating what future inflation will be. For example, if you’re saving money in an instrument that gains 3% per year, and inflation is estimated to be 4% per year, well then you’re losing money in real terms. There are a variety of ways to estimate the rate of future inflation. You could, for example, use past rates as an estimate of future rates.

Why don't we hear more about Adrian Dantley on ESPN? This graph makes me think he was as good an offensive player as Michael Jordan.

In my last post I complained about efficiency not being discussed enough by NBA announcers and commentators. I pointed out that some of the best scorers have relatively low FG% or TS%. However, via the comments it was pointed out that top scorers need to take more difficult shots and thus are expected to have lower efficiency. The plot below (made with this R script) seems to confirm this (click image to enlarge) .

This graph makes me think Kobe is not that good, he just shoots a lot

I find it surprising that NBA commentators rarely talk about field goal percentage. Everybody knows that the more you shoot the more you score. But players that score a lot are admired without consideration of their FG%. Of course having a high FG% is not necessarily admirable as many players only take easy shots, while top-scorers need to take difficult ones. Regardless, missing is undesirable and players that miss more than usual are not criticized enough.

A wordcloud comparison of the 2011 and 2012 #SOTU

I wrote a quick (and very dirty) R script for creating a comparison cloud and a commonality cloud for President Obama’s 2011 and 2012 State of the Union speeches. The cloud on the left shows words that have different frequencies between the two speeches and the cloud on the right shows the words in common between the two speeches. Here is a higher resolution version. The focus on jobs hasn’t changed much.

Baltimore gun offenders and where academics don't live

Jeff recently posted links to data from cities and states. He and I wrote R code that plots gun offender locations for Baltimore. Specifically we plot the locations that appear on this table. I added locations of the Baltimore neighborhoods where most of our Hopkins colleagues live as well as the location of the medical institutions where we work. Note the corridor with no points between the West side (Barksdale territory) and East side (Prop Joe territory).

An R function to map your Twitter Followers

I wrote a little function to make a personalized map of who follows you or who you follow on Twitter. The idea for this function was inspired by some plots I discussed in a previous post. I also found a lot of really useful code over at flowing data here. The function uses the packages twitteR, maps, geosphere, and RColorBrewer. If you don’t have the packages installed, when you source the twitterMap code, it will try to install them for you.

Plotting BeijingAir Data

Here’s a bit of R code for scraping the BejingAir Twitter feed and plotting the hourly PM2.5 values for the past 24 hours. The script defaults to the past 24 hours but you can modify that by simply changing the value for the variable ‘n’. You can just grab the code from this R script. Note that you need to use the latest version of the ‘twitteR’ package because the data structure has changed from previous versions.

Contributions to the R source

One of the nice things about tracking the R subversion repository using git instead of subversion is you can do git shortlog -s -n which gives you 19855 ripley 6302 maechler 5299 hornik 2263 pd 1153 murdoch 813 iacus 716 luke 661 jmc 614 leisch 472 ihaka 403 murrell 286 urbaneks 284 rgentlem 269 apache 253 bates 249 tlumley 164 duncan 92 r 43 root 40 paul 40 falcon 39 lyndon 34 thomas 33 deepayan 26 martyn 18 plummer 15 (no author) 14 guido 3 ligges 1 mike These data are since 1997 so for Brian Ripley, that’s 3.

An R function to analyze your Google Scholar Citations page

Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself here. I asked John Muschelli and Andrew Jaffeto write me a function that would download my Google Scholar Citations data so I could play with it. Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end.

Expected Salary by Major

In thisrecent editorialabout the Occupy Wall Street movement, Richard Kim profiles a protestor that despite having a master’s degree can’t find a job. This particular protestorquit his job as a school teacher three years ago and took out a $35K student loan to obtain a master’s degree in puppetry from the University of Connecticut. I wonder if, before taking his money, UConn showed this person data on job prospects for their puppetry graduates.

Computing on the Language Followup

My article on computing on the language was unexpectedly popular and so I wanted to quickly follow up on my own solution. Many of you got the answer, and in fact many got solutions that were quite a bit shorter than mine. Here’s how I did it: makeList <- function(...) { args <- substitute(list(...)) nms <- sapply(args[-1], deparse) vals <- list(...) names(vals) <- nms vals }  Baptiste pointed out that Frank Harrell has already implemented this function in his Hmisc package as the ‘llist’ function (thanks for the pointer!

Computing on the Language

And now for something a bit more esoteric…. I recently wrote a function to deal with a strange problem. Writing the function ended up being a fun challenge related to computing on the R language itself. Here’s the problem: Write a function that takes any number of R objects as arguments and returns a list whose names are derived from the names of the R objects. Perhaps an example provides a better description.

Colors in R

One of my favorite R packages that I use all the time is the RColorBrewer package. The package has been around for a while now and is written/maintained by Erich Neuwirth. The guts of the package are based on Cynthia Brewer’s very cool work on the use of color in cartography (check out the colorbrewer web site). As a side note, I think the ability to manipulate colors in plots/graphs/maps is one of R’s many great strengths.

An R function to determine if you are a data scientist

“Data scientist” is one of the buzzwords in the running for rebranding applied statistics mixed with some computing. David Champagne, over at Revolution Analytics, described the skills for being a data scientist with a Venn Diagram. Just for fun, I wrote a little R function for determining where you land on the data science Venn Diagram. Here is an example of a plot the function makes using the Simply Statistics bloggers as examples.

R Workshop: Reading in Large Data Frames

One question I get a lot about how to read large data frames into R. There are some useful tricks that can save you both time and memory when reading large data frames but I find that many people are not aware of them. Of course, your ability to read data is limited by your available memory. I usually do a rough calculation along the lines of # rows * # columns * 8 bytes / 2^20

R Workshop

I am going to start a continuing “R Workshop” series of posts with R tips and tricks. If you have questions you’d like answered or were wondering about certain aspects, please leave them in the comments.