Simply Statistics

13
Apr

Why is there so much university administration? We kind of asked for it.

The latest commentary on the rising cost of college tuition is by Paul F. Campos and is titled The Real Reason College Tuition Costs So Much. There has been much debate about this article and whether Campos is right or wrong...and I don't plan to add to that. However, I wanted to pick up on a major point of the article that I felt got left hanging out there: The rising levels of administrative personnel at universities.

Campos argues that the reason college tuition is on the rise is not that colleges get less and less money from the government (mostly state government for state schools), but rather that there is an increasing number of administrators at universities that need to be paid in dollars and cents. He cites a study that shows that for the California State University system, in a 34 year period, the number of of faculty rose by about 3% whereas the number of administrators rose by 221%.

My initial thinking when I saw the 221% number was "only that much?" I've been a faculty member at Johns Hopkins now for about 10 years, and just in that short period I've seen the amount of administrative work I need to do go up what feels like at least 221%. Partially, of course, that is a result of climbing up the ranks. As you get more qualified to do administrative work, you get asked to do it! But even adjusting for that, there are quite a few things that faculty need to do now that they weren't required to do before.  Frankly, I'm grateful for the few administrators that we do have around here to help me out with various things.

Campos seems to imply (but doesn't come out and say) that the bulk of administrators are not necessary. And that if we were to cut these people from the payrolls, that we could reduce tuition down to what it was in the old days. Or at least, it would be cheaper. This argument reminds me about debates over the federal budget: Everyone thinks the budget is too big, but no one wants to suggest something to cut.

My point here is that the reason there are so many administrators is that there's actually quite a bit of administration to do. And the amount of administration that needs to be done has increased over the past 30 years.

Just for fun, I decided to go to the Johns Hopkins University Administration web site to see who all these administrators were.  This site shows the President's Cabinet and the Deans of the individual schools, which isn't everybody, but it represents a large chunk. I don't know all of these people, but I have met and worked with a few of them.

For the moment I'm going to skip over individual people because, as much as you might think they are overpaid, no individual's salary is large enough to move the needle on college tuition. So I'll stick with people who actually represent large offices with staff. Here's a sample.

  • University President. Call me crazy, but I think the university needs a President. In the U.S. the university President tends to focus on outward facing activities like raising money from various sources, liasoning with the government(s), and pushing university initiatives around the world. This is not something I want to do (but I think it's necessary), I'd rather have the President take care of it for me.
  • University Provost. At most universities in the U.S. the Provost is the "senior academic officer", which means that he/she runs the university. This is a big job, especially at big universities, and require coordinating across a variety of constituencies. Also, at JHU, the Provost's office deals with a number of compliance related issues like Title IX, accreditation, Americans with Disabilities Act, and many others. I suppose we could save some money by violating federal law, but that seems short-sighted.

    The people in this office do tough work involving a ton of paper. One example involves online education. Most states in the U.S. say that if you're going to run an education program in their state, it needs to be approved by some regulatory body. Some states have essentially a reciprocal agreement, so if it's okay in your state, then it's okay in their state. But many states require an entire approval process for a program to run in that state. And by "a program" I mean something like an M.S. in Mathematics. If you want to run an M.S. in English that's another approval, etc. So someone has to go to all the 50 states and D.C. and get approval for every online program that JHU runs in order to enroll students into that program from that state. I think Arkansas actually requires that someone come to Arkansas and testify in person about a program asking for approval.

    I support online education programs, and I'm glad the Provost's office is getting all those approvals for us.

  • Corporate Security. This may be a difficult one for some people to understand, but bear in mind that much of Johns Hopkins is located in East Baltimore. If you've ever seen the TV show The Wire, then you know why we need corporate security.
  • Facilities and Real Estate. Johns Hopkins owns and deals with a lot of real estate; it's a big organization. Who is supposed to take care of all that? For example, we just installed a brand new supercomputer jointly with the University of Maryland, called MARCC. I'm really excited to use this supercomputer for research, but systems like this require a bit of space. A lot of space actually. So we needed to get some land to put it on. If you've ever bought a house, you know how much paperwork is involved.
  • Development and Alumni Relations. I have a new appreciation for this office now that I co-direct a program that has enrolled over 1.5 million people in just over a year. It's critically important that we keep track of our students for many reasons: tracking student careers and success, tapping them to mentor current students, developing relationships with organizations that they're connected to are just a few.
  • General Counsel. I'm not he lawbreaking type, so I need lawyers to help me out.
  • Enterprise Development. This office involves, among other things, technology transfer, which I have recently been involved with quite a bit for my role in the Data Science Specialization offered through Coursera. This is just to say that I personally benefit from this office. I've heard people say that universities shouldn't be involved in tech transfer, but Bayh-Dole is what it is and I think Johns Hopkins should play by the same rules as everyone else. I'm not interested in filing patents, trademarks, and copyrights, so it's good to have people doing that for me.

Okay, that's just a few offices, but you get the point. These administrators seem to be doing a real job (imagine that!) and actually helping out the university. Many of these people are actually helping me out. Some of these jobs are essentially required by the existence of federal laws, and so we need people like this.

So, just to recap, I think there are in fact more administrators in universities than there used to be. Is this causing an increase in tuition? It's possible, but it's probably not the only cause. If you believe the CSU study, there was about a 3.5% annual increase in the number of administrators each year from 1975 to 2008. College tuition during that time period went up around 4% per year (inflation adjusted). But even so, much of this administration needs to be done (because faculty don't want to do it), so this is a difficult path to go down if you're looking for ways to lower tuition.

Even if we've found the smoking gun, the question is what do we do about it?

09
Apr

How to Get Ahead in Academia

This video on how to make it in academia was produced over 10 years ago by Steven Goodman for the ENAR Junior Researchers Workshop. Now the whole world can benefit from its wisdom.

The movie features current and former JHU Biostatistics faculty, including Francesca Dominici, Giovanni Parmigiani, Scott Zeger, and Tom Louis. You don't want to miss Scott Zeger's secret formula for getting promoted!

02
Apr

Why You Need to Study Statistics

The American Statistical Association is continuing its campaign to get you to study statistics, if you haven't already. I have to agree with them that being a statistician is a pretty good job. Their latest video highlights a wide range of statisticians working in industry, government, and academia. You can check it out here:

12
Feb

Is Reproducibility as Effective as Disclosure? Let's Hope Not.

Jeff and I just this week published a commentary in the Proceedings of the National Academy of Sciences on our latest thinking on reproducible research and its ability to solve the reproducibility/replication "crisis" in science (there's a version on arXiv too). In a nutshell, we believe reproducibility (making data and code available so that others can recompute your results) is an essential part of science, but it is not going to end the crisis of confidence in science. In fact, I don't think it'll even make a dent. The problem is that reproducibility, as a tool for preventing poor research, comes in at the wrong stage of the research process (the end). While requiring reproducibility may deter people from committing outright fraud (a small group), it won't stop people who just don't know what they're doing with respect to data analysis (a much larger group).

In an eerie coincidence, Jesse Eisinger of the investigative journalism non-profit ProPublica, has just published a piece on the New York Times Dealbook site discussing how requiring disclosure rules in the financial industry has produced meager results. He writes

Over the last century, disclosure and transparency have become our regulatory crutch, the answer to every vexing problem. We require corporations and government to release reams of information on food, medicine, household products, consumer financial tools, campaign finance and crime statistics. We have a booming “report card” industry for a range of services, including hospitals, public schools and restaurants.

The rationale for all this disclosure is that

someone, somewhere reads the fine print in these contracts and keeps corporations honest. It turns out what we laymen intuit is true: No one reads them, according to research by a New York University law professor, Florencia Marotta-Wurgler.

But disclosure is nevertheless popular because how could you be against it?

The disclosure bonanza is easy to explain. Nobody is against it. It’s politically expedient. Companies prefer such rules, especially in lieu of actual regulations that would curtail bad products or behavior. The opacity lobby — the remora fish class of lawyers, lobbyists and consultants in New York and Washington — knows that disclosure requirements are no bar to dodgy practices. You just have to explain what you’re doing in sufficiently incomprehensible language, a task that earns those lawyers a hefty fee.

In the now infamous Duke Saga, Keith Baggerly was able to reproduce the work of Potti et al. after roughly 2,000 hours of work because the data were publicly available (although the code was not). It's not clear how much time would have been saved if the code had been available, but it seems reasonable to assume that it would have taken some amount of time to understand the analysis, if not reproduce it. Once the errors in Potti's work were discovered, it took 5 years for the original Nature Medicine paper to be retracted.

Although you could argue that the process worked in some sense, it came at tremendous cost of time and money. Wouldn't it have been better if the analysis had been done right in the first place?

26
Jan

Reproducible Research Course Companion

Screen Shot 2015-01-26 at 4.14.26 PMI'm happy to announce that you can now get a copy of the Reproducible Research Course Companion from the Apple iBookstore. The purpose of this e-book is pretty simple. The book provides all of the key video lectures from my Reproducible Research course offered on Coursera, in a simple offline e-book format. The book can be viewed on a Mac, iPad, or iPad mini.

If you're interested in taking my Reproducible Research course on Coursera and would like a flavor of what the course will be like, then you can view the lectures through the book (the free sample contains three lectures). On the other hand, if you already took the course and would like access to the lecture material afterwards, then this might be a useful add-on. If you care currently enrolled in the course, then this could be a handy way for you to take the lectures on the road with you.

Please note that all of the lectures are still available for free on YouTube via my YouTube channel. Also, the book provides content only. If you wish to actually complete the course, you must take it through the Coursera web site.

15
Oct

Dear Laboratory Scientists: Welcome to My World

Consider the following question: Is there a reproducibility/replication crisis in epidemiology?

I think there are only two possible ways to answer that question:

  1. No, there is no replication crisis in epidemiology because no one ever believes the result of an epidemiological study unless it has been replicated a minimum of 1,000 times in every possible population.
  2. Yes, there is a replication crisis in epidemiology, and it started in 1854 when John Snow inferred, from observational data, that cholera was spread via contaminated water obtained from public pumps.

If you chose (2), then I don't think you are allowed to call it a "crisis" because I think by definition, a crisis cannot last 160 years. In that case, it's more of a chronic disease.

I had an interesting conversation last week with a prominent environmental epidemiologist over the replication crisis that has been reported about extensively in the scientific and popular press. In his view, he felt this was less of an issue in epidemiology because epidemiologists never really had the luxury of people (or at least fellow scientists) believing their results because of their general inability to conduct controlled experiments.

Given the observational nature of most environmental epidemiological studies, it's generally accepted in the community that no single study can be considered causal, and that many replications of a finding are need to establish a causal connection. Even the popular press knows now to include the phrase "correlation does not equal causation" when reporting on an observational study. The work of Sir Austin Bradford Hill essentially codifies the standard of evidence needed to draw causal conclusions from observational studies.

So if "correlation does not equal causation", it begs the question, what does equal causation? Many would argue that a controlled experiment, whether it's a randomized trial or a laboratory experiment, equals causation. But people who work in this area have long known that while controlled experiments do assign the treatment or exposure, there are still many other elements of the experiment that are not controlled.

For example, if subjects drop out of a randomized trial, you now essentially have an observational study (or at least a "broken" randomized trial). If you are conducting a laboratory experiment and all of the treatment samples are measured with one technology and all of the control samples are measured with a different technology (perhaps because of a lack of blinding), then you still have confounding.

The correct statement is not "correlation does not equal causation" but rather "no single study equals causation", regardless of whether it was an observational study or a controlled experiment. Of course, a very tightly controlled and rigorously conducted controlled experiment will be more valuable than a similarly conducted observational study. But in general, all studies should simply be considered as further evidence for or against an hypothesis. We should not be lulled into thinking that any single study about an important question can truly be definitive.

22
Sep

Unbundling the educational package

I just got back from the World Economic Forum's summer meeting in Tianjin, China and there was much talk of disruption and innovation there. Basically, if you weren't disrupting, you were furniture. Perhaps not surprisingly, one topic area that was universally considered ripe for disruption was Education.

There are many ideas bandied about with respect to "disrupting" education and some are interesting to consider. MOOCs were the darlings of...last year...but they're old news now. Sam Lessin has a nice piece in the The Information (total paywall, sorry, but it's worth it) about building a subscription model for universities. Aswath Damodaran has what I think is a nice framework for thinking about the "education business".

One thing that I latched on to in Damodaran's piece is the idea of education as a "bundled product". Indeed, I think the key aspect of traditional on-site university education is the simultaneous offering of

  1. Subject matter content (i.e. course material)
  2. Mentoring and guidance by faculty
  3. Social and professional networking
  4. Other activities (sports, arts ensembles, etc.)

MOOCs have attacked #1 for many subjects, typically large introductory courses. Endeavors like the Minerva project are attempting to provide lower-cost seminar-style courses (i.e. anti-MOOCs).

I think the extent to which universities will truly be disrupted will hinge on how well we can unbundle the four (or maybe more?) elements described above and provide them separately but at roughly the same level of quality. Is it possible? I don't know.

29
Jul

Introducing people to R: 14 years and counting

I've been introducing people to R for quite a long time now and I've been doing some reflecting today on how that process has changed quite a bit over time. I first started using R around 1998--1999 I think I first started talking about R informally to my fellow classmates (and some faculty) back when I was in graduate school at UCLA. There, the department was officially using Lisp-Stat (which I loved) and only later converted its courses over to R. Through various brown-bag lunches and seminars I would talk about R, and the main selling point at the time was "It's just like S-PLUS but it's free!" As it turns out, S-PLUS was basically abandoned by academics and its ownership changed hands a number of times over the years (it is currently owned by TIBCO). I still talk about S-PLUS when I talk about the history of R but I'm not sure many people nowadays actually have any memories of the product.

When I got to Johns Hopkins in 2003 there wasn't really much of a modern statistical computing class, so Karl Broman, Rafa Irizarry, Brian Caffo, Ingo Ruczinski, and I got together and started what we called the "KRRIB" class, which was basically a weekly seminar where one of us talked about a computing topic of interest. I gave some of the R lectures in that class and when I asked people who had heard of R before, almost no one raised their hand. And no one had actually used it before. My approach was pretty much the same at the time, although I left out the part about S-PLUS because no one had used that either. A lot of people had experience with SAS or Stata or SPSS. A number of people had used something like Java or C/C++ before and so I often used that a reference frame. No one had ever used a functional-style of programming language like Scheme or Lisp.

Over time, the population of students (mostly first-year graduate students) slowly shifted to the point where many of them had been introduced to R while they were undergraduates. This trend mirrored the overall trend with statistics where we are seeing more and more students do undergraduate majors in statistics (as opposed to, say, mathematics). Eventually, by 2008--2009, when I'd ask how many people had heard of or used R before, everyone raised their hand. However, even at that late date, I still felt the need to convince people that R was a "real" language that could be used for real tasks.

R has grown a lot in recent years, and is being used in so many places now, that I think its essentially impossible for a person to keep track of everything that is going on. That's fine, but it makes "introducing" people to R an interesting experience. Nowadays in class, students are often teaching me something new about R that I've never seen or heard of before (they are quite good at Googling around for themselves). I feel no need to "bring people over" to R. In fact it's quite the opposite--people might start asking questions if I weren't teaching R.

Even though my approach to introducing R has evolved over time, with the topics that I emphasize or de-emphasize changing, I've found there are a few topics that I always  stress to people who are generally newcomers to R. For whatever reason, these topics are always new or at least a little unfamiliar.

  • R is a functional-style language. Back when most people primarily saw something like C as a first programming language, it made sense to me that the functional style of programming would seem strange. I came to R from Lisp-Stat so the functional aspect was pretty natural for me. But many people seem to get tripped up over the idea of passing a function as an argument or not being able to modify the state of an object in place. Also, it sometimes takes people a while to get used to doing things like lapply() and map-reduce types of operations. Everyone still wants to write a for loop!
  • R is both an interactive system and a programming language. Yes, it's a floor wax and a dessert topping--get used to it. Most people seem expect one or the other. SAS users are wondering why you need to write 10 lines of code to do what SAS can do in one massive PROC statement. C programmers are wondering why you don't write more for loops. C++ programmers are confused by the weird system for object orientation. In summary, no one is ever happy.
  • Visualization/plotting capabilities are state-of-the-art. One of the big selling points back in the "old days" was that from the very beginning R's plotting and graphics capabilities where far more elegant than the ASCII-art that was being produced by other statistical packages (true for S-PLUS too). I find it a bit strange that this point has largely remained true. While other statistical packages have definitely improved their output (and R certainly has some areas where it is perhaps deficient), R still holds its own quite handily against those other packages. If the community can continue to produce things like ggplot2 and rgl, I think R will remain at the forefront of data visualization.

I'm looking forward to teaching R to people as long as people will let me, and I'm interested to see how the next generation of students will approach it (and how my approach to them will change). Overall, it's been just an amazing experience to see the widespread adoption of R over the past decade. I'm sure the next decade will be just as amazing.

16
Jul

Jan de Leeuw owns the Internet

One of the best things to happen on the Internet recently is that Jan de Leeuw has decided to own the Twitter/Facebook universe. If you do not already, you should be following him. Among his many accomplishments, he founded the Department of Statistics at UCLA (my alma mater), which is currently thriving. On the occasion of the Department's 10th birthday, there was a small celebration, and I recall Don Ylvisaker mentioning that the reason they invited Jan to UCLA way back when was because he "knew everyone and knew everything". Pretty accurate description, in my opinion.

Jan's been tweeting quite a bit of late, but recently had this gem:

followed by

I'm not sure what Jan's thinking behind the first tweet was, but I think many in statistics would consider it a "good thing" to be a minor subfield of data science. Why get involved in that messy thing called data science where people are going wild with data in an unprincipled manner?

This is a situation where I think there is a large disconnect between what "should be" and what "is reality". What should be is that statistics should include the field of data science. Honestly, that would be beneficial to the field of statistics and would allow us to provide a home to many people who don't necessarily have one (primarily, people working not he border between two fields). Nate Silver made reference to this in his keynote address to the Joint Statistical Meetings last year when he said data science was just a fancy term for statistics.

The reality though is the opposite. Statistics has chosen to limit itself to a few areas, such as inference, as Jan mentions, and to willfully ignore other important aspects of data science as "not statistics". This is unfortunate, I think, because unlike many in the field of statistics, I believe data science is here to stay. The reason is because statistics has decided not to fill the spaces that have been created by the increasing complexity of modern data analysis. The needs of modern data analyses (reproducibility, computing on large datasets, data preprocessing/cleaning) didn't fall into the usual statistics curriculum, and so they were ignored. In my view, data science is about stringing together many different tools for many different purposes into an analytic whole. Traditional statistical modeling is a part of this (often a small part), but statistical thinking plays a role in all of it.

Statisticians should take on the challenge of data science and own it. We may not be successful in doing so, but we certainly won't be if we don't try.

24
Jun

New book on implementing reproducible research

9781466561595I have mentioned this in a few places but my book edited with Victoria Stodden and Fritz Leisch, Implementing Reproducible Research, has just been published by CRC Press. Although it is technically in their "R Series", the chapters contain information on a wide variety of useful tools, not just R-related tools. 

There is also a supplementary web site hosted through Open Science Framework that contains a lot of additional information, including the list of chapters.