Simply Statistics


Rafa's citations above replacement in statistics journals is crazy high.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note:  I thought it would be fun to do some bibliometrics on a Friday. This is super hacky and the CAR/Y stat should not be taken seriously. 

I downloaded data on the 400 most cited papers between 2000-2010 in some statistical journals from Web of Science. Here is a boxplot of the average number of citations per year (from publication date - 2015) to these papers in the journals Annals of Statistics, Biometrics, Biometrika, Biostatistics, JASA, Journal of Computational and Graphical Statistics, Journal of Machine Learning Research, and Journal of the Royal Statistical Society Series B.




There are several interesting things about this graph right away. One is that JASA has the highest median number of citations, but has fewer "big hits" (papers with 100+ citations/year) than Annals of Statistics, JMLR, or JRSS-B. Another thing is how much of a lottery developing statistical methods seems to be. Most papers, even among the 400 most cited, have around 3 citations/year on average. But a few lucky winners have 100+ citations per year. One interesting thing for me is the papers that get 10 or more citations per year but aren't huge hits. I suspect these are the papers that solve one problem well but don't solve the most general problem ever.

Something that jumps out from that plot is the outlier for the journal Biostatistics. One of their papers is cited 367.85 times per year. The next nearest competitor is 67.75 and it is 19 standard deviations above the mean! The paper in question is: "Exploration, normalization, and summaries of high density oligonucleotide array probe level data", which is the paper that introduced RMA, one of the most popular methods for pre-processing microarrays ever created. It was written by Rafa and colleagues. It made me think of the statistic "wins above replacement" which quantifies how many extra wins a baseball team gets by playing a specific player in place of a league average replacement.

What about a "citations /year above replacement" statistic where you calculate for each journal:

Median number of citations to a paper/year with Author X - Median number of citations/year to an average paper in that journal

Then average this number across journals. This attempts to quantify how many extra citations/year a person's papers generate compared to the "average" paper in that journal. For Rafa the numbers look like this:

  • Biostatistics: Rafa = 15.475, Journal = 1.855, CAR/Y =  13.62
  • JASA: Rafa = 74.5, Journal = 5.2, CAR/Y = 69.3
  • Biometrics: Rafa = 4.33, Journal = 3.38, CAR/Y = 0.95

So Rafa's citations above replacement is (13.62 + 69.3 + 0.95)/3 =  27.96! There are a couple of reasons why this isn't a completely accurate picture. One is the low sample size, the second is the fact that I only took the 400 most cited papers in each journal. Rafa has a few papers that didn't make the top 400 for journals like JASA - which would bring down his CAR/Y.



Figuring Out Learning Objectives the Hard Way

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

When building the Genomic Data Science Specialization (which starts in June!) we had to figure out the learning objectives for each course. We initially set our ambitions high, but as you can see in this video below, Steven Salzberg brought us back to Earth.


Data analysis subcultures

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time 'panel data', to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as 'longitudinal data', and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials re-discovered across multiple disciplines.

But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they'd give you back very different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:

  • Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).
  • In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.
  • This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.
  • Psychology has a history of using parametric statistics, genomics is big into empirical Bayes, and you see a lot of Bayesian statistics in climate studies.
  • You see homoskedasticity tests used a lot in econometrics, but that is hardly ever done through formal hypothesis testing in biostatistics.
  • Training sets and test sets are used in machine learning for prediction, but rarely used for inference.

This is just a partial list I thought of off the top of my head, there are a ton more. These decisions matter a lot in a data analysis.  The problem is that the behavioral component of a data analysis is incredibly strong, no matter how much we'd like to think of the process as mathematico-theoretical. Until we acknowledge that the most common reason a method is chosen is because, "I saw it in a widely-cited paper in journal XX from my field" it is likely that little progress will be made on resolving the statistical problems in science.


Why is there so much university administration? We kind of asked for it.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The latest commentary on the rising cost of college tuition is by Paul F. Campos and is titled The Real Reason College Tuition Costs So Much. There has been much debate about this article and whether Campos is right or wrong...and I don't plan to add to that. However, I wanted to pick up on a major point of the article that I felt got left hanging out there: The rising levels of administrative personnel at universities.

Campos argues that the reason college tuition is on the rise is not that colleges get less and less money from the government (mostly state government for state schools), but rather that there is an increasing number of administrators at universities that need to be paid in dollars and cents. He cites a study that shows that for the California State University system, in a 34 year period, the number of of faculty rose by about 3% whereas the number of administrators rose by 221%.

My initial thinking when I saw the 221% number was "only that much?" I've been a faculty member at Johns Hopkins now for about 10 years, and just in that short period I've seen the amount of administrative work I need to do go up what feels like at least 221%. Partially, of course, that is a result of climbing up the ranks. As you get more qualified to do administrative work, you get asked to do it! But even adjusting for that, there are quite a few things that faculty need to do now that they weren't required to do before.  Frankly, I'm grateful for the few administrators that we do have around here to help me out with various things.

Campos seems to imply (but doesn't come out and say) that the bulk of administrators are not necessary. And that if we were to cut these people from the payrolls, that we could reduce tuition down to what it was in the old days. Or at least, it would be cheaper. This argument reminds me about debates over the federal budget: Everyone thinks the budget is too big, but no one wants to suggest something to cut.

My point here is that the reason there are so many administrators is that there's actually quite a bit of administration to do. And the amount of administration that needs to be done has increased over the past 30 years.

Just for fun, I decided to go to the Johns Hopkins University Administration web site to see who all these administrators were.  This site shows the President's Cabinet and the Deans of the individual schools, which isn't everybody, but it represents a large chunk. I don't know all of these people, but I have met and worked with a few of them.

For the moment I'm going to skip over individual people because, as much as you might think they are overpaid, no individual's salary is large enough to move the needle on college tuition. So I'll stick with people who actually represent large offices with staff. Here's a sample.

  • University President. Call me crazy, but I think the university needs a President. In the U.S. the university President tends to focus on outward facing activities like raising money from various sources, liasoning with the government(s), and pushing university initiatives around the world. This is not something I want to do (but I think it's necessary), I'd rather have the President take care of it for me.
  • University Provost. At most universities in the U.S. the Provost is the "senior academic officer", which means that he/she runs the university. This is a big job, especially at big universities, and require coordinating across a variety of constituencies. Also, at JHU, the Provost's office deals with a number of compliance related issues like Title IX, accreditation, Americans with Disabilities Act, and many others. I suppose we could save some money by violating federal law, but that seems short-sighted.

    The people in this office do tough work involving a ton of paper. One example involves online education. Most states in the U.S. say that if you're going to run an education program in their state, it needs to be approved by some regulatory body. Some states have essentially a reciprocal agreement, so if it's okay in your state, then it's okay in their state. But many states require an entire approval process for a program to run in that state. And by "a program" I mean something like an M.S. in Mathematics. If you want to run an M.S. in English that's another approval, etc. So someone has to go to all the 50 states and D.C. and get approval for every online program that JHU runs in order to enroll students into that program from that state. I think Arkansas actually requires that someone come to Arkansas and testify in person about a program asking for approval.

    I support online education programs, and I'm glad the Provost's office is getting all those approvals for us.

  • Corporate Security. This may be a difficult one for some people to understand, but bear in mind that much of Johns Hopkins is located in East Baltimore. If you've ever seen the TV show The Wire, then you know why we need corporate security.
  • Facilities and Real Estate. Johns Hopkins owns and deals with a lot of real estate; it's a big organization. Who is supposed to take care of all that? For example, we just installed a brand new supercomputer jointly with the University of Maryland, called MARCC. I'm really excited to use this supercomputer for research, but systems like this require a bit of space. A lot of space actually. So we needed to get some land to put it on. If you've ever bought a house, you know how much paperwork is involved.
  • Development and Alumni Relations. I have a new appreciation for this office now that I co-direct a program that has enrolled over 1.5 million people in just over a year. It's critically important that we keep track of our students for many reasons: tracking student careers and success, tapping them to mentor current students, developing relationships with organizations that they're connected to are just a few.
  • General Counsel. I'm not he lawbreaking type, so I need lawyers to help me out.
  • Enterprise Development. This office involves, among other things, technology transfer, which I have recently been involved with quite a bit for my role in the Data Science Specialization offered through Coursera. This is just to say that I personally benefit from this office. I've heard people say that universities shouldn't be involved in tech transfer, but Bayh-Dole is what it is and I think Johns Hopkins should play by the same rules as everyone else. I'm not interested in filing patents, trademarks, and copyrights, so it's good to have people doing that for me.

Okay, that's just a few offices, but you get the point. These administrators seem to be doing a real job (imagine that!) and actually helping out the university. Many of these people are actually helping me out. Some of these jobs are essentially required by the existence of federal laws, and so we need people like this.

So, just to recap, I think there are in fact more administrators in universities than there used to be. Is this causing an increase in tuition? It's possible, but it's probably not the only cause. If you believe the CSU study, there was about a 3.5% annual increase in the number of administrators each year from 1975 to 2008. College tuition during that time period went up around 4% per year (inflation adjusted). But even so, much of this administration needs to be done (because faculty don't want to do it), so this is a difficult path to go down if you're looking for ways to lower tuition.

Even if we've found the smoking gun, the question is what do we do about it?


Genomics Case Studies Online Courses Start in Two Weeks (4/27)

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The last month of the HarvardX Data Analysis for Genomics series start on 4/27. We will cover case studies on RNAseq, Variant calling, ChipSeq and DNA methylation. Faculty includes Shirley Liu, Mike Love, Oliver Hoffman and the HSPH Bioinformatics Core. Although taking the previous courses on the series will help, the four case study courses were developed as stand alone and you can obtain a certificate for each one without taking any other course.

Each course is presented over two weeks but will remain open until June 13 to give students an opportunity to take them all if they wish. For more information follow the links listed below.

  1. RNA-seq data analysis will be lead by Mike Love
  2. Variant Discovery and Genotyping will be taught by Shannan Ho Sui, Oliver Hofmann, Radhika Khetani and Meeta Mistry (from the The HSPH Bioinformatics Core)
  3. ChIP-seq data analysis will be lead by Shirley Liu
  4. DNA methylation data analysis will be lead by Rafael Irizarry

A blessing of dimensionality often observed in high-dimensional data sets

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Tidy data sets have one observation per row and one variable per column.  Using this definition, big data sets can be either:

  1. Wide - a wide data set has a large number of measurements per observation, but fewer observations. This type of data set is typical in neuroimaging, genomics, and other biomedical applications.
  2. Tall - a tall data set has a large number of observations, but fewer measurements. This is the typical setting in a large clinical trial or in a basic social network analysis.

The curse of dimensionality tells us that estimating some quantities gets harder as the number of dimensions of a data set increases - as the data gets taller or wider. An example of this was nicely illustrated by my student Prasad (although it looks like his quota may be up on Rstudio).

For wide data sets there is also a blessing of dimensionality. The basic reason for the blessing of dimensionality is that:

No matter how many new measurements you take on a small set of observations, the number of observations and all of their characteristics are fixed.

As an example, suppose that we make measurements on 10 people. We start out by making one measurement (blood pressure), then another (height), then another (hair color) and we keep going and going until we have one million measurements on those same 10 people. The blessing occurs because the measurements on those 10 people will all be related to each other. If 5 of the people are women and 5 or men, then any measurement that has a relationship with sex will be highly correlated with any other measurement that has a relationship with sex. So by knowing one small bit of information, you can learn a lot about many of the different measurements.

This blessing of dimensionality is the key idea behind many of the statistical approaches to wide data sets whether it is stated explicitly or not. I thought I'd make a very short list of some of these ideas:

1. Idea: De-convolving mixed observations from high-dimensional data. 

How the blessing plays a role: The measurements for each observation are assumed to be a mixture of values measured from different observation types. The proportion of each observation type is assumed to be fixed across measurements, so you can take advantage of the multiple measurements to estimate the mixing percentage and perform the deconvolution. (Wenyi Wang came and gave an excellent seminar on this idea at JHU a couple of days ago, which inspired this post).

2. Idea: The two groups model for false discovery rates.

How the blessing plays a role:  The models assume that a hypothesis test is performed for each observation and that the probability any observation is drawn from the null, the null distribution, and the alternative distributions are common across observations. If the null is assumed known, then it is possible to use the known null distribution to estimate the common probability that an observation is drawn from the null.


3. Idea: Empirical Bayes variance shrinkage for linear models

How the blessing plays a role:  A linear model is fit for each observation and the means and variances of the log ratios calculated from the model are assumed to follow a common distribution across observations. The method estimates the hyper-parameters of these common distributions and uses them to adjust any individual measurement's estimates.


4. Idea: Surrogate variable analysis

How the blessing plays a role:  Each observation is assumed to be influenced by a single variable of interest (a primary variable) and multiple unmeasured confounders. Since the observations are fixed, the values of the unmeasured confounders are the same for each measurement and a supervised PCA can be used to estimate surrogates for the confounders. (see my JHU job talk for more on the blessing)


The blessing of dimensionality I'm describing here is related to the idea that Andrew Gelman refers to in this 2004 post.  Basically, since increasingly large number of measurements are made on the same observations there is an inherent structure to those observations. If you take advantage of that structure, then as the dimensionality of your problem increases you actually get better estimates of the structure in your high-dimensional data - a nice blessing!


How to Get Ahead in Academia

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

This video on how to make it in academia was produced over 10 years ago by Steven Goodman for the ENAR Junior Researchers Workshop. Now the whole world can benefit from its wisdom.

The movie features current and former JHU Biostatistics faculty, including Francesca Dominici, Giovanni Parmigiani, Scott Zeger, and Tom Louis. You don't want to miss Scott Zeger's secret formula for getting promoted!


Why You Need to Study Statistics

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The American Statistical Association is continuing its campaign to get you to study statistics, if you haven't already. I have to agree with them that being a statistician is a pretty good job. Their latest video highlights a wide range of statisticians working in industry, government, and academia. You can check it out here:


Teaser trailer for the Genomic Data Science Specialization on Coursera

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone


We have been hard at work in the studio putting together our next specialization to launch on Coursera. It will be called the "Genomic Data Science Specialization" and includes a spectacular line up of instructors: Steven Salzberg, Ela Pertea, James Taylor, Liliana Florea, Kasper Hansen, and me. The specialization will cover command line tools, statistics, Galaxy, Bioconductor, and Python. There will be a capstone course at the end of the sequence featuring an in-depth genomic analysis. If you are a grad student, postdoc, or principal investigator in a group that does genomics this specialization is for you. If you are a person looking to transition into one of the hottest areas of research with the new precision medicine initiative this is for you. Get pumped and share the teaser-trailer with your friends!


Introduction to Bioconductor HarvardX MOOC starts this Monday March 30

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Bioconductor is one of the most widely used open source toolkits for biological high-throughput data. In this four week course, co-taught with Vince Carey and Mike Love, we will introduce you to Bioconductor's general infrastructure and then focus on two specific technologies: next generation sequencing and microarrays. The lectures and assessments will be annotated in case you want to focus only on one of these two technologies. Although if you plan to be a bioinformatician we recommend you learn both.

Topics covered include:

  • A short introduction to molecular biology and measurement technology
  • An overview on how to leverage the platform and genome annotation packages and experimental archives
  • GenomicsRanges: the infrastructure for storing, manipulating and analyzing next generation sequencing data
  • Parallel computing and cloud concepts
  • Normalization, preprocessing and bias correction.
  • Statistical inference in practice: including hierarchical models and gene set enrichment analysis
  • Building statistical analysis pipelines of genome-scale assays including the creation of reproducible reports

Throughout the class we will be using data examples from both next generation sequencing and microarray experiments.

We will assume basic knowledge of Statistics and R.

For more information visit the course website.