Category: Uncategorized


P > 0.05? I can make any p-value statistically significant with adaptive FDR procedures

Everyone knows now that you have to correct for multiple testing when you calculate many p-values otherwise this can happen:


One of the most popular ways to correct for multiple testing is to estimate or control the false discovery rate. The false discovery rate attempts to quantify the fraction of made discoveries that are false. If we call all p-values less than some threshold t significant, then borrowing notation from this great introduction to false discovery rates 



So F(t) is the (unknown) total number of null hypotheses called significant and S(t) is the total number of hypotheses called significant. The FDR is the expected ratio of these two quantities, which, under certain assumptions can be approximated by the ratio of the expectations.




To get an estimate of the FDR we just need an estimate for  E[F(t)]  and E[S(t)]. The latter is pretty easy to estimate as just the total number of rejections (the number of p < t). If you assume that the p-values follow the expected distribution then E[F(t)]  can be approximated by multiplying the fraction of null hypotheses, multiplied by the total number of hypotheses and multiplied by t since the p-values are uniform. To do this, we need an estimate for \pi_0, the proportion of null hypotheses. There are a large number of ways to estimate this quantity but it is almost always estimated using the full distribution of computed p-values in an experiment. The most popular estimator compares the fraction of p-values greater than some cutoff to the number you would expect if every single hypothesis were null. This fraction is about the fraction of null hypotheses.

Combining the above equation with our estimates for E[F(t)]  and E[S(t)] we get:




The q-value is a multiple testing analog of the p-value and is defined as:



This is of course a very loose version of this and you can get a more technical description here. But the main thing to notice is that the q-value depends on the estimated proportion of null hypotheses, which depends on the distribution of the observed p-values. The smaller the estimated fraction of null hypotheses, the smaller the FDR estimate and the smaller the q-value. This suggests a way to make any p-value significant by altering its "testing partners". Here is a quick example. Suppose that we have done a test and have a p-value of 0.8. Not super significant. Suppose we perform this test in conjunction with a number of hypotheses that are null generating a p-value distribution like this.


Then you get a q-value greater than 0.99 as you would expect. But if you test that exact same p-value with a ton of other non-null hypotheses that generate tiny p-values in a distribution that looks like this:



Then you get a q-value of 0.0001 for that same p-value of 0.8. The reason is that the estimate of the fraction of null hypotheses goes essentially to zero, which drives down the q-value. You can do this with any p-value, if you make its testing partners have sufficiently low p-values then the q-value will also be as small as you like.

A couple of things to note:

  • Obviously doing this on purpose to change the significance of a calculated p-value is cheating and shouldn't be done.
  • For correctly calculated p-values on a related set of hypotheses this is actually a sensible property to have - if you have almost all very small p-values and one very large p-value, you are doing a set of tests where almost everything appears to be alternative and you should weight that in some sensible way.
  • This is the reason that sometimes a "multiple testing adjusted" p-value (or q-value) is smaller than the p-value itself.
  • This doesn't affect non-adaptive FDR procedures - but those procedures still depend on the "testing partners" of any p-value through the total number of tests performed. This is why people talk about the so-called "multiple testing burden". But that is a subject for a future post. It is also the reason non-adaptive procedures can be severely underpowered compared to adaptive procedures when the p-values are correct.
  • I've appended the code to generate the histograms and calculate the q-values in this post in the following gist.



UCLA Statistics 2015 Commencement Address

I was asked to speak at the UCLA Department of Statistics Commencement Ceremony this past June. As one of the first graduates of that department back in 2003, I was tremendously honored to be invited to speak to the graduates. When I arrived I was just shocked at how much the department had grown. When I graduated I think there were no more than 10 of us between the PhD and Master's programs. Now they have ~90 graduates per year with undergrad, Master's and PhD. It was just stunning.

Here's the text of what I said, which I think I mostly stuck to in the actual speech.


UCLA Statistics Graduation: Some thoughts on a career in statistics

When I asked Rick [Schoenberg] what I should talk about, he said to 'talk for 95 minutes on asymptotic properties of maximum likelihood estimators under nonstandard conditions". I thought this is a great opportunity! I busted out Tom Ferguson’s book and went through my old notes. Here we go. Let X be a complete normed vector space….

I want to thank the department for inviting me here today. It’s always good to be back. I entered the UCLA stat department in 1999, only the second entering class, and graduated from UCLA Stat in 2003. Things were different then. Jan was the chair and there were not many classes so we could basically do whatever we wanted. Things are different now and that’s a good thing. Since 2003, I’ve been at the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, where I was first a postdoctoral fellow and then joined the faculty. It’s been a wonderful place for me to grow up and I’ve learned a lot there.

It’s just an incredible time to be a statistician. You guys timed it just right. I’ve been lucky enough to witness two periods like this, the first time being when I graduated from college at the height of the dot come boom. Today, it’s not computer programming skills that the world needs, but rather it’s statistical skills. I wish I were in your shoes today, just getting ready to startup. But since I’m not, I figured the best thing I could do is share some of the things I’ve learned and talk about the role that these things have played in my own life.

Know your edge: What’s the one thing that you know that no one else seems to know? You’re not a clone—you have original ideas and skills. You might think they’re not valuable but you’re wrong. Be proud of these ideas and use them to your advantage. As an example, I’ll give you my one thing. Right now, I believe the greatest challenge facing the field of statistics today is getting the entire world to know what we in this room already know. Data are everywhere today and the biggest barrier to progress is our collective inability to process and analyze those data to produce useful information. The need for the things that we know has absolutely exploded and we simply have not caught up. That’s why I created, along with Jeff Leek and Brian Caffo, the Johns Hopkins Data Science Specialization, which is currently the most successful massive open online course program ever. Our goal is to teach the entire world statistics, which we think is an essential skill. We’re not quite there yet, but—assuming you guys don’t steal my idea—I’m hopeful that we’ll get there sometime soon.

At some point the edge you have will no longer work: That sounds like a bad thing, but it’s actually good. If what you’re doing really matters, then at some point everyone will be doing it. So you’ll need to find something else. I’ve been confronted with this problem at least 3 times in my life so far. Before college, I was pretty good at the violin, and it opened a lot of doors for me. It got me into Yale. But when I got to Yale, I quickly realized that there were a lot of really good violinists here. Suddenly, my talent didn’t have so much value. This was when I started to pick up computer programming and in 1998 I learned an obscure little language called R. When I got to UCLA I realized I was one of the only people who knew R. So I started a little brown bag lunch series where I’d talk about some feature of R to whomever would show up (which wasn’t many people usually). Picking up on R early on turned out to be really important because it was a small community back then and it was easy to have a big impact. Also, as more and more people wanted to learn R, they’d usually call on me. It’s always nice to feel needed. Over the years, the R community exploded and R’s popularity got to the point where it was being talked about in the New York Times. But now you see the problem. Saying that you know R doesn’t exactly distinguish you anymore, so it’s time to move on again. These days, I’m realizing that the one useful skill that I have is the ability to make movies. Also, my experience being a performer on the violin many years ago is coming in handy. My ability to quickly record and edit movies was one of the key factors that enabled me to create an entire online data science program in 2 months last year.

Find the right people, and stick with them forever. Being a statistician means working with other people. Choose those people wisely and develop a strong relationship. It doesn’t matter how great the project is or how famous or interesting the other person is, if you can’t get along then bad things will happen. Statistics and data analysis is a highly verbal process that requires constant and very clear communication. If you’re uncomfortable with someone in any way, everything will suffer. Data analysis is unique in this way—our success depends critically on other people. I’ve only had a few collaborators in the past 12 years, but I love them like family. When I work with these people, I don’t necessarily know what will happen, but I know it will be good. In the end, I honestly don’t think I’ll remember the details of the work that I did, but I’ll remember the people I worked with and the relationships I built.

So I hope you weren’t expecting a new asymptotic theorem today, because this is pretty much all I’ve got. As you all go on to the next phase of your life, just be confident in your own ideas, be prepared to change and learn new things, and find the right people to do them with. Thank you.


Correlation is not a measure of reproducibility

Biologists make wide use of correlation as a measure of reproducibility. Specifically, they quantify reproducibility with the correlation between measurements obtained from replicated experiments. For example, the ENCODE data standards document states

A typical R2 (Pearson) correlation of gene expression (RPKM) between two biological replicates, for RNAs that are detected in both samples using RPKM or read counts, should be between 0.92 to 0.98. Experiments with biological correlations that fall below 0.9 should be either be repeated or explained.

However, for  reasons I will explain here, correlation is not necessarily informative with regards to reproducibility. The mathematical results described below are not inconsequential theoretical details, and understanding them will help you assess new technologies, experimental procedures and computation methods.

Suppose you have collected data from an experiment

x1x2,..., xn

and want to determine if  a second experiment replicates these findings. For simplicity, we represent data from the second experiment as adding unbiased (averages out to 0) and statistically independent measurement error d to the first:

y1=x1+d1, y2=x2+d2, ... yn=xn+dn.

For us to claim reproducibility we want the differences

d1=y1-x1, d2=y2-x2,... ,dn=yn-xn

to be "small". To give this some context, imagine the x and y are log scale (base 2) gene expression measurements which implies the d represent log fold changes. If these differences have a standard deviation of 1, it implies that fold changes of 2 are typical between replicates. If our replication experiment produces measurements that are typically twice as big or twice as small as the original, I am not going to claim the measurements are reproduced. However, as it turns out, such terrible reproducibility can still result in correlations higher than 0.92.

To someone basing their definition of correlation on the current common language usage this may seem surprising, but to someone basing it on math, it is not. To see this, note that the mathematical definition of correlation tells us that because and x are independent:


This tells us that correlation summarizes the variability of relative to the variability of x. Because of the wide range of gene expression values we observe in practice, the standard deviation of x can easily be as large as 3 (variance is 9). This implies we expect to see correlations as high as 1/sqrt(1+1/9) = 0.95, despite the lack of reproducibility when comparing to y.

Note that using Spearman correlation does not fix this problem. A Spearman correlation of 1 tells us that the ranks of and are preserved, yet doest not summarize the actual differences. The problem comes down to the fact that we care about the variability of and correlation, Pearson or Spearman, does not provide an optimal summary. While correlation relates to the preservation of ranks, a much more appropriate summary of reproducibly is the distance between x and y which is related to the standard deviation of the differences d. A very simple R command you can use to generate this summary statistic is:


or the robust version:

median(abs(d)) ##multiply by 1.4826 for unbiased estimate of true sd

The equivalent suggestion for plots it to make an MA-plot instead of a scatterplot.

But aren't correlations and distances directly related? Sort of, and this actually brings up another problem. If the x and y are standardized to have average 0 and standard deviation 1 then, yes, correlation and distance are directly related:


However, if instead x and y have different average values, which would put into question reproducibility, then distance is sensitive to this problem while correlation is not. If the standard devtiation is 1, the formula is:



Once we consider units (standard deviations different from 1) then the relationship becomes even more complicated. Two advantages of distance you should be aware of are:

  1. it is in the same units as the data, while correlations have no units making it hard to interpret and select thresholds, and
  2. distance accounts for bias (differences in average), while correlation does not.

A final important point relates to the use of correlation with data that is not approximately normal. The useful interpretation of correlation as a summary statistic stems from the bivariate normal approximation: for every standard unit increase in the first variable, the second variable increased standard units, with the correlation. A  summary of this is here. However, when data is not normal this interpretation no longer holds. Furthermore, heavy tail distributions, which are common in genomics, can lead to instability. Here is an example of uncorrelated data with a single pointed added that leads to correlations close to 1. This is quite common with RNAseq data.




rafalib package now on CRAN

For the last several years I have been collecting functions I routinely use during exploratory data analysis in a private R package. Mike Love and I used some of these in our HarvardX course and now, due to popular demand, I have created man pages and added the rafalib package to CRAN. Mike has made several improvements and added some functions of his own. Here is quick descriptions of the rafalib functions I most use:

mypar - Before making a plot in R I almost always type mypar(). This basically gets around the suboptimal defaults of par. For example, it makes the margins (mar, mpg) smaller and defines RColorBrewer colors as defaults.  It is optimized for the RStudio window. Another advantage is that you can type mypar(3,2) instead of par(mfrow=c(3,2)). bigpar() is optimized for R presentations or PowerPoint slides.

as.fumeric - This function turns characters into factors and then into numerics. This is useful, for example, if you want to plot values x,y with colors defined by their corresponding categories saved in a character vector labsplot(x,y,col=as.fumeric(labs)).

shist (smooth histogram, pronounced shitz) - I wrote this function because I have a hard time interpreting the y-axis of density. The height of the curve drawn by shist can be interpreted as the height of a histogram if you used the units shown on the plot. Also, it automatically draws a smooth histogram for each entry in a matrix on the same plot.

splot (subset plot) - The datasets I work with are typically large enough that
plot(x,y) involves millions of points, which is a problem. Several solution are available to avoid over plotting, such as alpha-blending, hexbinning and 2d kernel smoothing. For reasons I won't explain here, I generally prefer subsampling over these solutions. splot automatically subsamples. You can also specify an index that defines the subset.

sboxplot (smart boxplot) - This function draws points, boxplots or outlier-less boxplots depending on sample size. Coming soon is the kaboxplot (Karl Broman box-plots) for when you have too many boxplots.

install_bioc - For Bioconductor users, this function simply does the source("") for you and then uses BiocLite to install.



Interested in analyzing images of brains? Get started with open access data.

Editor's note: This is a guest post by Ani Eloyan. She is an Assistant Professor of Biostatistics at Brown University. Dr. Eloyan’s work focuses on semi-parametric likelihood based methods for matrix decompositions, statistical analyses of brain images, and the integration of various types of complex data structures for analyzing health care data. She received her PhD in statistics from North Carolina State University and subsequently completed a postdoctoral fellowship in the Department of Biostatistics at Johns Hopkins University. Dr. Eloyan and her team won the ADHD200 Competition discussed in this article. She tweets @eloyan_ani.
Neuroscience is one of the exciting new fields for biostatisticians interested in real world applications where they can contribute novel statistical approaches. Most research in brain imaging has historically included studies run for small numbers of patients. While justified by the costs of data collection, the claims based on analyzing data for such small numbers of subjects often do not hold for our populations of interest. As discussed in this article, there is a huge demand for biostatisticians in the field of quantitative neuroscience; so called neuroquants or neurostatisticians. However, while more statisticians are interested in the field, we are far from competing with other substantive domains. For instance, a quick search of abstract keywords in the online program of the upcoming JSM2015 conference of “brain imaging” and “neuroscience” results in 15 records, while a search of the words “genomics” and “genetics” generates 76 records.
Assuming you are trained in statistics and an aspiring neuroquant, how would you go about working with brain imaging data? As a graduate student in the Department of Statistics at NCSU several years ago, I was very interested in working on statistical methods that would be directly applicable to solve problems in neuroscience. But I had this same question: “Where do I find the data?” I soon learned that to reallyapproach substantial relevant problems I also needed to learn about the subject matter underlying these complex data structures.
In recent years, several leading groups have uploaded their lab data with the common goal of fostering the collection of high dimensional brain imaging data to build powerful models that can give generalizable results. Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC) founded in 2006 is a platform for public data sharing that facilitates streamlining data processing pipelines and compiling high dimensional imaging datasets for crowdsourcing the analyses. It includes data for people with neurological diseases and neurotypical children and adults. If you are interested in Alzheimer’s disease, you can check out ADNI. ABIDE provides data for people with Autism Spectrum Disorder and neurotypical peers. ADHD200 was released in 2011 as a part of a competition to motivate building predictive methods for disease diagnoses using functional magnetic resonance imaging (MRI) in addition to demographic information to predict whether a child has attention deficit hyperactivity disorder (ADHD). While the competition ended in 2011, the dataset has been widely utilized afterwards in studies of ADHD.  According to Google Scholar, the paper introducing the ABIDE set has been cited 129 times since 2013 while the paper discussing the ADHD200 has been cited 51 times since 2012. These are only a few examples from the list of open access datasets that could of utilized by statisticians. 
Anyone can download these datasets (you may need to register and complete some paperwork in some cases), however, there are several data processing and cleaning steps to perform before the final statistical analyses. These preprocessing steps can be daunting for a statistician new to the field, especially as the tools used for preprocessing may not be available in R. This discussion makes the case as to why statisticians need to be involved in every step of preprocessing the data, while this R package contains new tools linking R to a commonly used platform FSL. However, as a newcomer, it can be easier to start with data that are already processed. This excellent overview by Dr. Martin Lindquist provides an introduction to the different types of analyses for brain imaging data from a statisticians point of view, while ourpaper provides tools in R and example datasets for implementing some of these methods. At least one course on Coursera can help you get started with functional MRI data. Talking to and reading the papers of biostatisticians working in the field of quantitative neuroscience and scientists in the field of neuroscience is the key.

Statistical Theory is our "Write Once, Run Anywhere"

Having followed the software industry as a casual bystander, I periodically see the tension flare up between the idea of writing "native apps", software that is tuned to a particular platform (Windows, Mac, etc.) and more cross-platform apps, which run on many platforms without too much modification. Over the years it has come up in many different forms, but they fundamentals are the same. Back in the day, there was Java, which was supposed to be the platform that ran on any computing device. Sun Microsystems originated the phrase "Write Once, Run Anywhere" to illustrate the cross-platform strengths of Java. More recently, Steve Jobs famously banned Flash from any iOS device. Apple is also moving away from standards like OpenGL and towards its own Metal platform.

What's the problem with "write once, run anywhere", or of cross-platform development more generally, assuming it's possible? Well, there are a number of issues: often there are performance penalties, it may be difficult to use the native look and feel of a platform, and you may be reduced to using the "lowest common denominator" of feature sets. It seems to me that anytime a new meta-platform comes out that promises to relieve programmers of the burden of having to write for multiple platforms, it eventually gets modified or subsumed by the need to optimize apps for a given platform as much as possible. The need to squeeze as much juice out of an app seems to be too important an opportunity to pass up.

In statistics, theory and theorems are our version of "write once, run anywhere". The basic idea is that theorems provide an abstract layer (a "virtual machine") that allows us to reason across a large number of specific problems. Think of the central limit theorem, probably our most popular theorem. It could be applied to any problem/situation where you have a notion of sample size that could in principle be increasing.

But can it be applied to every situation, or even any situation? This might be more of a philosophical question, given that the CLT is stated asymptotically (maybe we'll find out the answer eventually). In practice, my experience is that many people attempt to apply it to problems where it likely is not appropriate. Think, large-scale studies with a sample size of 10. Many people will use Normal-based confidence intervals in those situations, but they probably have very poor coverage.

Because the CLT doesn't apply in many situations (small sample, dependent data, etc.), variations of the CLT have been developed, as well as entirely different approaches to achieving the same ends, like confidence intervals, p-values, and standard errors (think bootstrap, jackknife, permutation tests). While the CLT an provide beautiful insight in a large variety of situations, in reality, one must often resort to a custom solution when analyzing a given dataset or problem. This should be a familiar conclusion to anyone who analyzes data. The promise of "write once, run anywhere" is always tantalizing, but the reality never seems to meet that expectation.

Ironically, if you look across history and all programming languages, probably the most "cross-platform" language is C, which was originally considered to be too low-level to be broadly useful. C programs run on basically every existing platform and the language has been completely standardized so that compilers can be written to produce well-defined output. The keys to C's success I think are that it's a very simple/small language which gives enormous (sometimes dangerous) power to the programmer, and that an enormous toolbox (compiler toolchains, IDEs) has been developed over time to help developers write applications on all platforms.

In a sense, we need "compilers" that can help us translate statistical theory for specific data analysis problems. In many cases, I'd imagine the compiler would "fail", meaning the theory was not applicable to that problem. This would be a Good Thing, because right now we have no way of really enforcing the appropriateness of a theorem for specific problems.

More practically (perhaps), we could develop data analysis pipelines that could be applied to broad classes of data analysis problems. Then a "compiler" could be employed to translate the pipeline so that it worked for a given dataset/problem/toolchain.

The key point is to recognize that there is a "translation" process that occurs when we use theory to justify certain data analysis actions, but this translation process is often not well documented or even thought through. Having an explicit "compiler" for this would help us to understand the applicability of certain theorems and may serve to prevent bad data analysis from occurring.


Autonomous killing machines won't look like the Terminator...and that is why they are so scary

Just a few days ago many of the most incredible minds in science and technology urged governments to avoid using artificial intelligence to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:




The reality is that robots that walk and talk are getting better but still have a ways to go:



Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.

The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads delivering Amazon products.




I also think that when people think about "artificial intelligence" they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, or pass the Turing test. But it turns out that the "artificial intelligence" you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:

  1. A drone with the ability to fly on its own
  2. The ability to make decisions about what people to target
  3. The ability to find those people and attack them


The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has used autopilot for at least some of the flight. I won't get into the details on this one because I think it is the least interesting - it has been around a while and we didn't get the dire warnings about autonomous agents.

The second issue, about deciding which people to target is already in existence as well. We have already seen programs like PRISM and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like PRISM and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few "true positives" these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don't work.

The second issue is being able to find people to attack them. This is where the real "artificial intelligence" comes in to play. But it isn't artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people's pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a paper up that demonstrates an algorithm that can identify people with near human level accuracy. This approach is based on something called deep neural nets, which sounds very intimidating, but is actually just a set of nested nonlinear logistic regression models. These models have gotten very good because (a) we are getting better at fitting them mathematically and computationally but mostly (b) we have much more data to train them with than we ever did before. The speed that this part of the process is developing is (I think) why there is so much recent concern about potentially negative applications like autonomous killing machines.

The scary thing is that these technologies could be combined *right now* to create such a system that was not controlled directly by humans but made automated decisions and flew drones to carry out those decisions. The technology to shrink these type of deep neural net systems to identify people is so good it can even be made simple enough to run on a phone for things like language translation and could easily be embedded in a drone.

So I am with Musk, Hawking, and others who would urge caution by governments in developing these systems. Just because we can make it doesn't mean it will do what we want. Just look at how well Facebook/Amazon/Google make suggestions for "other things you might like" to get an idea about how potentially disastrous automated killing systems could be.



Announcing the JHU Data Science Hackathon 2015

We are pleased to announce that the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health will be hosting the first ever JHU Data Science Hackathon (DaSH) on September 21-23, 2015 at the Baltimore Marriott Waterfront.

This event will be an opportunity for data scientists and data scientists-in-training to get together and hack on real-world problems collaboratively and to learn from each other. The DaSH will feature data scientists from government, academia, and industry presenting problems and describing challenges in their respective areas. There will also be a number of networking opportunities where attendees can get to know each other. We think this will be  fun event and we encourage people from all areas, including students (graduate and undergraduate), to attend.

To get more details and to sign up for the hackathon, you can go to the DaSH web site. We will be posting more information as the event nears.


  • Jeff Leek
  • Brian Caffo
  • Roger Peng
  • Leah Jager


  • National Institutes of Health
  • Johns Hopkins University



stringsAsFactors: An unauthorized biography

Recently, I was listening in on the conversation of some colleagues who were discussing a bug in their R code. The bug was ultimately traced back to the well-known phenomenon that functions like 'read.table()' and 'read.csv()' in R convert columns that are detected to be character/strings to be factor variables. This lead to the spontaneous outcry from one colleague of

Why does stringsAsFactors not default to FALSE????

The argument 'stringsAsFactors' is an argument to the 'data.frame()' function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in 'read.table()' and related functions because of the role these functions play in reading in table data and converting them to data frames. By default, 'stringsAsFactors' is set to TRUE.

This argument dates back to May 20, 2006 when it was originally introduced into R as the 'charToFactor' argument to 'data.frame()'. Soon afterwards, on May 24, 2006, it was changed to 'stringsAsFactors' to be compatible with S-PLUS by request from Bill Dunlap.

Most people I talk to today who use R are completely befuddled by the fact that 'stringsAsFactors' is set to TRUE by default. First of all, it should be noted that before the 'stringsAsFactors' argument even existed, the behavior of R was to coerce all character strings to be factors in a data frame. If you didn't want this behavior, you had to manually coerce each column to be character.

So here's the story:

In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by 'factor' vectors and so character columns got converted factor.

Why do we need factor variables to begin with? Because of modeling functions like 'lm()' and 'glm()'. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There's no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting 'stringsAsFactors = TRUE' when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.

There's also a more obscure reason. Factor variables are encoded as integers in their underlying representation. So a variable like "disease" and "non-disease" will be encoded as 1 and 2 in the underlying representation. Roughly speaking, since integers only require 4 bytes on most systems, the conversion from string to integer actually saved some space for long strings. All that had to be stored was the integer levels and the labels. That way you didn't have to repeat the strings "disease" and "non-disease" for as many observations that you had, which would have been wasteful.

Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use 'factors' for the modeling functions.

The difference nowadays is that R is being used a by a very wide variety of people doing all kinds of things the creators of R never envisioned. This is, of course, wonderful, but it introduces lots of use cases that were not originally planned for. I find that most often, the people complaining about 'stringsAsFactors' not being FALSE are people who are doing things that are not the traditional statistical modeling things (things that old-time statisticians like me used to do). In fact, I would argue that if you're upset about 'stringsAsFactors = TRUE', then it's a pretty good indicator that you're either not a statistician by training, or you're doing non-traditional statistical things.

For example, in genomics, you might have the names of the genes in one column of data. It really doesn't make sense to encode these as factors because they won't be used in any modeling function. They're just labels, essentially. And because of CHARSXP hashing, you don't gain anything from an efficiency standpoint by converting them to factors either.

But of course, given the long-standing behavior of R, many people depend on the default conversion of characters to factors when reading in tabular data. Changing this default would likely result in an equal number of people complaining about 'stringsAsFactors'.

I fully expect that this blog post will now make all R users happy. If you think I've missed something from this unauthorized biography, please let me know on Twitter (@rdpeng).


The statistics department Moneyball opportunity

Moneyball is a book and a movie about Billy Bean. It makes statisticians look awesome and I loved the movie. I loved it so much I'm putting the movie trailer right here:

The basic idea behind Moneyball was that the Oakland Athletics were able to build a very successful baseball team on a tight budget by valuing skills that many other teams undervalued. In baseball those skills were things like on-base percentage and slugging percentage. By correctly valuing these skills and their impact on a teams winning percentage, the A's were able to build one of the most successful regular season teams on a minimal budget. This graph shows what an outlier they were, from a nice fivethirtyeight analysis.




I think that the data science/data analysis revolution that we have seen over the last decade has created a similar moneyball opportunity for statistics and biostatistics departments. Traditionally in these departments the highest value activities have been publishing a select number of important statistics journals (JASA, JRSS-B, Annals of Statistics, Biometrika, Biometrics and more recently journals like Biostatistics and Annals of Applied Statistics). But there are some hugely valuable ways to contribute to statistics/data science that don't necessarily end with papers in those journals like:

  1. Creating good, well-documented, and widely used software
  2. Being primarily an excellent collaborator who brings in grant money and is a major contributor to science through statistics
  3. Publishing in top scientific journals rather than statistics journals
  4. Being a good scientific communicator who can attract talent
  5. Being a statistics educator who can build programs

Another thing that is undervalued is not having a Ph.D. in statistics or biostatistics. The fact that these skills are undervalued right now means that up and coming departments could identify and recruit talented people that might be missed by other departments and have a huge impact on the world. One tricky thing is that the rankings of department are based on the votes of people from other departments who may or may not value these same skills. Another tricky thing is that many industry data science positions put incredibly high value on these skills and so you might end up competing with them for people - a competition that will definitely drive up the market value of these data scientist/statisticians. But for the folks that want to stay in academia, now is a prime opportunity.