Tag: R


How to use Bioconductor to find empirical evidence in support of π being a normal number

Happy π day everybody!

I wanted to write some simple code (included below) to the test parallelization capabilities of my  new cluster. So, in honor of  π day, I decided to check for evidence that π is a normal number. A normal number is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10−m. For example, using the Poisson approximation, we can predict that the pattern "123456789" should show up between 0 and 3 times in the first billion digits of π (it actually shows up twice starting, at the 523,551,502-th and  773,349,079-th decimal places).

To test our hypothesis, let Y1, ..., Y100 be the number of "00", "01", ...,"99" in the first billion digits of  π. If  π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01.  In the qq-plot below I show Z-scores (Y - 10,000,000) /  √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, 0.89, 0.92, and 0.99.


Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.


These observed counts should also be independent, and to explore this we can look at autocorrelation plots:


To do this in about an hour and with just a few lines of code (included below), I used the Bioconductor Biostrings package to match strings and the foreach function to parallelize.

registerDoParallel(cores = 48)
x=substr(x,3,nchar(x)) ##remove 3.
p <- 1/(10^d)
for(d in 2:4){
    } else{
    res <- foreach(pat=patterns,.combine=c) %dopar% countPattern(pat,x)
    z <- (res - n*p ) / sqrt( n*p*(1-p) )
    qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(d,"digits"))
    ##correction: original post had length(res)
    if(d<5) print(1-pchisq(sum ((res - n*p)^2/(n*p)),length(res)-1)) 
###Now count in segments
d <- 1
m <-10^5

patterns <-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
res <- foreach(pat=patterns,.combine=cbind) %dopar% {
    tmp2<-floor( (tmp-1)/m)
p <- 1/(10^d)
for(i in 1:ncol(res)){
    z <- (res[,i] - m*p) / sqrt( m*p*(1-p)  )
     qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(i-1))
##ACF plots
for(i in 1:ncol(res)) acf(res[,i])

NB: A normal number has the above stated property in any base. The examples above a for base 10.


A Shiny web app to find out how much medical procedures cost in your state.

Today the front page of the Huffington Post featured the new data available from the CMS that shows the cost of many popular procedures broken down by hospital. We here at Simply Statistics think you should be able to explore these data more easily. So we asked John Muschelli to help us build a Shiny App that allows you to interact with these data. You can choose your state and your procedure and see how much the procedure costs at hospitals in your state. It takes a second to load because it is a lot of data....

Here is the link the app. 

Here are some screenshots for intracranial hemmhorage for the US and for Idaho.

Screen Shot 2013-05-08 at 4.57.56 PMScreen Shot 2013-05-08 at 4.58.09 PM\

The R code is here if you want to tweak/modify.


Introducing the healthvis R package - one line D3 graphics with R

We have been a little slow on the posting for the last couple of months here at Simply Stats. That’s bad news for the blog, but good news for our research programs!

Today I’m announcing the new healthvis R package that is being developed by my student Prasad Patil (who needs a website like yesterday), Hector Corrada Bravo, and myself*. The basic idea is that I have loved D3 interactive graphics for a while. But they are hard to create from scratch, since they require knowledge of both Javascript and the D3 library.

Even with those skills, it can take a while to develop a new graphic. On the other hand, I know a lot about R and am often analyzing biomedical data where interactive graphics could be hugely useful. There are a couple of really useful tools for creating interactive graphics in R, most notably Shiny, which is awesome. But these tools still require a bit of development to get right and are designed for “stand alone” tools.

So we created an R package that builds specific graphs that come up commonly in the analysis of health data like survival curves, heatmaps, and icon arrays. For example, here is how you make an interactive survival plot comparing treated to untreated individuals with healthvis:

# Load libraries


# Run a cox proportional hazards regression

cobj &lt;- coxph(Surv(time, status)~trt+age+celltype+prior, data=veteran)

# Plot using healthvis - one line!

survivalVis(cobj, data=veteran, plot.title=&quot;Veteran Survival Data&quot;, group=&quot;trt&quot;, group.names=c(&quot;Treatment&quot;, &quot;No Treatment&quot;), line.col=c(&quot;#E495A5&quot;,&quot;#39BEB1&quot;))

The “survivalVis” command above  produces an interactive graphic like this. Here it is embedded (you may have to scroll to see the dropdowns on the right - we are working on resizing)

The advantage of this approach is that you can make common graphics interactive without a lot of development time. Here are some other unique features:

  • The graphics are hosted on Google App Engine. With one click you can get a permanent link and share it with collaborators.

  • With another click you can get the code to embed the graphics in your website.

  • If you have already created D3 graphics it only takes a few minutes to develop a healthvis version to let R users create their own - email us and we will make it part of the healthvis package!

  • healthvis is totally general - you can develop graphics that don’t have anything to do with health with our framework. Just email to get in touch if you want to be a developer at healthvis@gmail.com

We have started a blog over at healthvis.org where we will be talking about the tricks we learn while developing D3 graphics, updates to the healthvis package, and generally talking about visualization for new technologies like those developed by the CCNE and individualized health. If you are interested in getting involved as a developer, user or tester, drop us a line and let us know. In the meantime, happy visualizing!

* This project is supported by the JHU CCNE (U54CA151838) and the Johns Hopkins inHealth initiative.


Podcast #6: Data Analysis MOOC Post-mortem

Jeff and I talk about Jeff's recently completed MOOC on Data Analysis.


Statisticians and computer scientists - if there is no code, there is no paper

I think it has been beat to death that the incentives in academia lean heavily toward producing papers and less toward producing/maintaining software. There are people that are way, way more knowledgeable than me about building and maintaining software. For example, Titus Brown hit a lot of the key issues in his interview. The open source community is also filled with advocates and researchers who know way more about this than I do.

This post is more about my views on changing the perspective of code/software in the data analysis community. I have been frustrated often with statisticians and computer scientists who write papers where they develop new methods and seem to demonstrate that those methods blow away all their competitors. But then no software is available to actually test and see if that is true. Even worse, sometimes I just want to use their method to solve a problem in our pipeline, but I have to code it from scratch!

I have also had several cases where I emailed the authors for their software and they said it "wasn't fit for distribution" or they "don't have code" or the "code can only be run on our machines". I totally understand the first and last, my code isn't always pretty (I have zero formal training in computer science so messy code is actually the most likely scenario) but I always say, "I'll take whatever you got and I'm willing to hack it out to make it work". I often still am turned down.

So I have a new policy when evaluating CV's of candidates for jobs, or when I'm reading a paper as a referee. If the paper is about a new statistical method or machine learning algorithm and there is no software available for that method - I simply mentally cross it off the CV. If I'm reading a data analysis and there isn't code that reproduces their analysis - I mentally cross it off. In my mind, new methods/analyses without software are just vapor ware. Now, you'd definitely have to cross a few papers off my CV, based on this principle. I do that. But I'm trying really hard going forward to make sure nothing gets crossed off.

In a future post I'll talk about the new issue I'm struggling with - maintaing all that software I'm creating.



Review of R Graphics Cookbook by Winston Chang

I just got a copy of Winston Chang's book R Graphics Cookbook, published by O'Reilly Media. This book follows now a series of O'Reilly books on R, including an R Cookbook. Winston Chang is a graduate student at Northwestern University but is probably better known to R users as an active member of the ggplot2 mailing list and an active contributor to the ggplot2 source code.

The book has a typical cookbook format. After some preliminaries about how to install R packages and how to read data into R (Chapter 1), he quickly launches into exploratory data analysis and graphing. The basic outline of each section is:

  1. Statement of problem ("You want to make a histogram")
  2. Solution: If you can reasonably do it with base R graphics, here's how you do it. Oh, and here's how you do it in ggplot2. Notice how it's better? (He doesn't actually say that. He doesn't have to.)
  3. Discussion: This usually revolves around different options that might be set or alternative approaches.
  4. See also: Other recipes in the book.

Interestingly, nowhere in the book is the lattice package mentioned (except in passing). But I suppose that's because ggplot2 pretty much supersedes anything you might want to do in the lattice package. Recently, I've been wondering what the future of the lattice package is given that it doesn't seem to me to be going under very active development. But I digress....

Overall, the book is great. I learned quite a few things just in my initial read of the book and as I dug in a bit more there were some functions that I was not familiar with. Much of the material is straight up ggplot2 stuff so if you're an expert there you probably won't get a whole lot more. But my guess is that most are not experts and so will be able to get something out of the book.

The meat of the book covers a lot of different plotting techniques, enough to make your toolbox quite full. If you pick up this book and think something is missing, my guess is that you're making some pretty esoteric plots. I enjoyed the few sections on specifying colors as well as the recipes on making maps (one of ggplot2's strong points). I wish there were more map recipes, but hey, that's just me.

Towards the end there's a nice discussion of graphics file formats (PDF, PNG, WMF, etc.) and the advantages and disadvantages of each (Chapter 14: Output for Presentation). I particularly enjoyed the discussion of fonts in R graphics since I find this to be a fairly confusing aspect of R, even for seasoned users.

The book ends with a series of recipes related to data manipulation. It's funny how many recipes there are about modifying factor variables, but I guess this is just a function of how annoying it is to modify factor variables. There's also some highlighting of the plyr and reshape2 packages.

Ultimately, I think this is a nice complement to Hadley Wickham's ggplot2 as most of the recipes focus on implementing plots in ggplot2. I don't think you necessarily need to have a deep understanding of ggplot2 in order to use this book (there are some details in an appendix), but some people might want to grab Hadley's book for more background. In fact, this may be a better book to use to get started with ggplot2 simply because it focuses on specific applications. I kept thinking that if the book had been written using base graphics only, it'd probably have to be 2 or 3 times longer just to fit all the code in, which is a testament to the power and compactness of the ggplot2 approach.

One last note: I got the e-book version of the book, but I would recommend the paper version. With books like these, I like to flip around constantly (since there's no need to read it in a linear fashion) and I find e-readers like iBooks and Kindle Reader to be not so good at this.


R package meme

I just got this from a former student who is working on a project with me:





Computing for Data Analysis Returns

I'm happy to announce that my course Computing for Data Analysis will return to Coursera on January 2nd, 2013. While I had previously announced that the course would be presented again right here, it made more sense to do it again on Coursera where it is (still) free and the platform there is much richer. For those of you who missed it the last time around, this is your chance to take it and learn a little R.

I've gotten a number of emails from people who were interested in watching the videos for the course. If you just want to sit around and watch videos of me talking, I've created a set of four YouTube playlists based on the four weeks of the course:

The content in the YouTube playlists reflect the content from the first iteration of the course and will not reflect any new material I add to the second iteration (at least not for a little while).

I encourage everyone who is interested to enroll in the course on Coursera because there you'll have the benefit of in-video quizzes and other forms of assessment and will be able to interact with all of the great students who are also enrolled in the class. Also, if you're interested in signing up for Jeff Leek's Data Analysis course (starts on January 22, 2013) and are not very familiar with R, I encourage you to check out Computing for Data Analysis first to get yourself up to speed.

I look forward to seeing you there!


I give up, I am embracing pie charts

Most statisticians know that pie charts are a terrible way to plot percentages. You can find explanations here, here, and here as well as the R help file for the pie function which states:

Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.


I have only used the pie R function once and it was to make this plot (R code below):

So why are they ubiquitous? The best explanation I've heard is that they are easy to make in Microsoft Excel. Regardless, after years of training, lay people are probably better at interpreting pie charts than any other graph. So I'm surrendering and embracing the pie chart. Jeff's recent post shows we have bigger fish to fry.

for(i in 0:(N-1)){
system("convert Rplot*.png pacman.gif")
##system("rm *.png") edited to save caffo's pngs (see comments)
system("rm Rplot*.png")