Simply Statistics


Grad students in (bio)statistics - do a postdoc!

Up until about 20 years ago, postdocs were scarce in Statistics. In contrast, during the same time period, it was rare for a Biology PhD to go straight into a tenure track position.

Driven mostly by the availability of research funding for those working in applied areas,  postdocs are becoming much more common in our field and I think this is great. It is great for PhD students to expand their horizons during two years in which they don’t have to worry about teaching, committee meetings, or grant writing. It is also great for those of us fortunate enough to work with well-trained, independent, energetic, bright, and motivated fresh PhDs. Many of our best graduates are electing to postpone their entry into tenure track jobs in favor of postdocs. Also students from other fields, computer science and engineering in particular, are taking postdocs with statisticians. I think these are both good trends. If they continue, the result will be that, as a field, we will become more well-rounded and productive. 

This trend has been particularly beneficial for me. Most of the postdocs I have hired have come to me with a CV worthy of a tenure track job. They have been independent and worked more as collaborators than advisees. So why pass on more $ and prestige? A PhD in Statistics/Computer Science/Engineering can be on a very specific topic and students may not gain any collaborative experience whatsoever. A postdoc at Hopkins Biostat provides a new experience in a highly collaborative environment, with access to world leaders in the biomedical sciences, and where we focus on development of applied tools. The experience can also improve a student’s visibility and job prospects, while delaying the tenure clock until they have more publications under their belts.

An important thing you should be aware of is that in many departments you can negotiate the start of a tenure track position. So seriously consider taking 1-2 years of almost 100% research time before commencing the grind of a tenure track job. 

I’m not the only one who thinks postdocs are a good thing for our field and for biostatistics students. The column below was written by Terry Speed in November 2003 and is reprinted with permission from the IMS Bulletin,

In Praise of Postdocs

I don’t know what proportion of IMS members have PhDs (or an equivalent) in probability or statistics, but I’d guess it’s fairly high. I don’t know what proportion of those that do have PhDs would also have formal post-doctoral research experience, but here I’d guess it’s rather low.

Why? One possible reason is that for much of the last 40 years, anyone completing a PhD in prob or stat and wanting a research career, could go straight into one. Prospective employers of people with PhDs in our field—be they universities, research institutes, national labs or companies—don’t require their novices to have completed a postdoc, and most graduating PhDs are only to happy to go straight into their first job.

This is in sharp contrast with the biological and physical sciences, where it is rare to appoint someone to a tenure-track faculty or research scientist position without their having completed one or more postdocs.

Thee number of people doing postdocs in probability or statistics has been growing over the last 15 years. This is in part due to the arrival on the scene of institutes such as the MSRI, IMA, IPAM, NISS, NCAR, and recently the MBI and SAMSI in the US, the Newton Institute in the UK, the Fields Institute in Canada, the Institut Henri Poincaré in France, and others elsewhere around the world. In such institutes short- term postdoc positions go with their current research programs, and there are usually a smaller number continuing for longer periods.

It is also the case that an increasing number of senior researchers are being awarded research funds to support postdocs in prob or stat, often in the newer, applied areas such as computational biology.

And finally, it is has long been the case that many countries (Germany, Sweden, Switzerland, and the US, to name a few) have national grants supporting postdoctoral research in their own or, even better, another country. I think all of this is great, and would like to see this trend continue and strengthen.

Why do I think postdocs are a good thing? And why do I think young probabilists and statisticians should do one, even when they can get a good job without having done so?

For most of us, doing a PhD means getting totally absorbed in some relatively narrow research area for 2–3 years, treating that as the most important part of science for that time, and trying to produce some of the best work in that area. This is fine, and we get a PhD for our efforts, but is it good training for a lifelong research career? While it is obviously good preparation for doing more of the same, I don’t think it is adequate for research in general. I regard the successful completion of a PhD as (at least) evidence that the person in question can do research, but it doesn’t follow that they can go on and successfully do research in new area, or in a different environment, or without close supervision.

Postdocs give you the chance to broaden, to learn new technical skills, to become acquainted with new areas, and to absorb the culture of a new institution, all at a time when your professional responsibilities are far fewer than they would have been had you taken that first “real” job. The postdoc period can be a wonderful time in your scientific life, one which sees you blossom, building on the confidence you gained by having completed your PhD, in what is still essentially a learning environment, but one where you can follow your own interests, explore new areas, and still make mistakes. At the worst, you have delayed your entry into the workforce two or three years, and you can still keep on working in your PhD area if you wish. The number of openings for researchers in prob or stat doesn’t fluctuate so much on this time scale, so you are unlikely to be worse off than the earnings foregone. At best, you will move into a completely new area of research, one much better suited to your personal interests and skills, perhaps also better suited to market demand, but either way, one chosen with your PhD experience behind you. This can greatly enhance your long-term career prospects and more than compensate for your delayed entry into the workforce.

Students: the time to think about this is now [November], not just as you are about to file your dissertation. And the choice is not necessarily one between immediate security and career development: you might be able to have both. You shouldn’t shy from applying for tenure-track jobs and postdocs at the same time, and if offered the job you want, requesting (say) two years’ leave of absence to do the postdoc you want. Employers who care about your career development are unlikely to react badly to such a request.


An R function to map your Twitter Followers

I wrote a little function to make a personalized map of who follows you or who you follow on Twitter. The idea for this function was inspired by some plots I discussed in a previous post. I also found a lot of really useful code over at flowing data here

The function uses the packages twitteR, maps, geosphere, and RColorBrewer. If you don’t have the packages installed, when you source the twitterMap code, it will try to install them for you. The code also requires you to have a working internet connection. 

One word of warning is that if you have a large number of followers or people you follow, you may be rate limited by Twitter and unable to make the plot.

To make your personalized twitter map, first source the function:

> source(“”)

The function has the following form: 

twitterMap <- function(userName,userLocation=NULL,fileName=”twitterMap.pdf”,nMax = 1000,plotType=c(“followers”,”both”,”following”))

with arguments:

  • userName - the twitter username you want to plot
  • userLocation - an optional argument giving the location of the user, necessary when the location information you have provided Twitter isn’t sufficient for us to find latitude/longitude data
  • fileName - the file where you want the plot to appear
  • nMax - The maximum number of followers/following to get from Twitter, this is implemented to avoid rate limiting for people with large numbers of followers. 
  • plotType - if “both” both followers/following are plotted, etc. 

Then you can create a plot with both followers/following like so: 

> twitterMap(“simplystats”)

Here is what the resulting plot looks like for our Twitter Account:

If your location can’t be found or latitude longitude can’t be calculated, you may have to chose a bigger city near you. The list of cities used by twitterMap can be found like so:



>grep(“Baltimore”, world.cities[,1])

If your city is in the database, this will return the row number of the world.cities data frame corresponding to your city. 

If you like this function you may also like our function to determine if you are a data scientist or to analyze your Google Scholar citations page.
Update: The bulk of the heavy lifting done by these functions is performed by Jeff Gentry’s very nice twitteR package and code put together by Nathan Yau over at FlowingData. This is really an example of standing on the shoulders of giants. 

On Hard and Soft Money

As the academic job hunting season goes into effect many will be applying to a variety of different types of departments. In statistics, there is a pretty big separation between statistics departments, which tend to be in arts & sciences colleges, and biostatistics departments, which tend to be in medical or public health institutions. A key difference between these two types of departments is the funding model.

Statistics department faculty tend to be on 9- or 10-month salaries with funding primarily coming from teaching classes (research funding can be obtained for the summer months). Biostatistics departments faculty tend to have 12-month salaries with a large chunk of funding coming from research grants. Statistics departments are sometimes called “hard money” departments (i.e. tuition money is “hard”) while biostatistics departments are “soft money”. Grant money is considered “soft” because it has a tendency to go away a bit more easily. As long as students want to attend a university, there will always be tuition.

The biostatistics department at Johns Hopkins is a soft money department. We tend to get the bulk of our salaries from research project grants. Statisticians can play two roles on research grants: as a co-investigator/collaborator and as a principal investigator (PI). I guess that’s true of anyone, but statisticians are very commonly part of research projects as co-investigators because pretty much every research project these days will need statistical advice or methodological development. Researchers often have trouble getting their grants funded if they don’t have a statistician on board. So there’s often plenty of funding to go around for statisticians. But the real problem is getting enough time to do the research you want to do. If you’re spending all your time doing other people’s work, then sure you’re getting paid, but you’re not getting things done that will advance your career.

In a soft money department, I can think of two ways to go. The first is to write your own grants with you as the PI. That way you can guarantee funding for yourself to do the things you find interesting (assuming your grant is funded!). The other approach is to collaborate on a project where the work you need to do is work you would have done anyway. That can be a happy coincidence because then you don’t have to deal with the administrative burden of running a research project. But this approach relies a bit on luck and on the research environment at your institution.

Many job candidates tell me that they are worried about working in a soft money department because if they can’t get their grants funded then they will be in some sort of trouble. In hard money departments, at least the majority of their salary is guaranteed by the teaching they do. This is true to some extent, but I contend that they are worrying about the wrong thing, mainly money.

What job candidates should really be worried about is whether the department will support them in their career. Candidates should be looking for departments that mentor their junior faculty and create an environment in which it will be easy to succeed. If you’re in a department that routinely hangs their junior faculty out to dry, you can have all the hard money you want and you’ll still be unhappy. A soft money department that supports their junior faculty will make sure the right structure is in place for faculty to succeed. 

Here are some things to look out for in any department, but perhaps more so in a soft money department:

  • Is there administrative support staff to help with writing grants i.e. for drafting budgets, assembling biosketches, and other paperwork?
  • Are their senior faculty around who have successfully written grants and would be willing to read your grants and give you feedback?
  • Is the environment there sufficient for you to do the things you want to do? For example, are their excellent collaborators for you to work with? Powerful computing support? All these things will help you get an edge over people who don’t have easy access to these resources.

Besides having a good idea, the environment can play a key role in writing a good grant. For starters, if all your collaborators are in the same building as you, it makes it a lot easier to coordinate meetings to discuss ideas and to do the preparation. If you’re trying to work with 4 different people in 4 different institutions (maybe in different timezones), things just get a little harder and maybe you don’t get the feedback you need.

Similarly, if you have a strong computing infrastructure in place, then you can test it out beforehand and see what its capabilities are. If you need to purchase the same infrastructure for yourself as part of a grant, then you won’t know what it can do until you get and set it up. In our department, we are constantly buying new systems for our computing center and there are always glitches in the beginning with new equipment and new software. If you can avoid having to do this, it makes the grant a lot easier to write.

Lastly, I’ll just say that if you’re in the position of applying for tenure-track academic jobs, you’re probably not lazy. So you’re going to do your work no matter where you go. You just need to find a place where you can get things done. 


In Greece, a statistician faces life in prison for doing his job: calculating and reporting a statistic

In a recent post I described the importance of government statisticians. Well, apparently in Greece it is a dangerous job, as Andreas Georgiou, the person in charge of the Greek statistics office, found out.

So far, though, his efforts have been met with resistance, strikes and a criminal investigation that could lead to life in prison for Georgiou.

What are his efforts ?

His first priority after he was appointed was to figure out how big Greece’s deficit really was back in 2009, when the crisis began. He looked through all the data and concluded that Greece’s deficit that year was 15.8 percent of GDP — higher what had previously been reported.

Eurostat, the central authority in Brussels, praised Georgiou’s methodology and blessed the number as true. The hundreds of Greek people who work beneath Georgiou — the old guard — did not.

So in response, the “old guard” decided to vote on the summary statistic:

Skordas sits on a governing board for the statistics office. His board wanted to debate and vote on the deficit number before anyone in Brussels was allowed to see it. Georgiou, the technocrat, saw that as a threat to his independence. He refused. The number is the number, he said. It’s not something to be put up for a vote.

Did they perform a Bayesian analysis based on the vote?


Interview with Nathan Yau of FlowingData

Nathan Yau

Nathan Yau is a graduate student in statistics at UCLA and the author of the extremely popular data visualization blog He recently published a book Visualize This - a really nice guide to modern data visualization using R, Illustrator and Javascript - which should be on the bookshelf of any statistician working on data visualization. 

Do you consider yourself a statistician/data scientist/or something else?

Statistician. I feel like statisticians can call them data scientists, but not the other way around. Although with data scientists there’s an implied knowledge of programming, which statisticians need to get better at.

Who have been good mentors to you and what qualities have been most helpful for you?

I’m visualization-focused, and I really got into the area during a summer internship at The New York Times. Before that, I mostly made graphs in R for reports. I learned a lot about telling stories with data and presenting data to a general audience, and that has stuck with me ever since.

Similarly, my adviser Mark Hansen has showed me how data is more free-flowing and intertwined with everything. It’s hard to describe. I mean coming into graduate school, I thought in terms of datasets and databases, but now I see it as something more organic. I think that helps me see what the data is about more clearly.

How did you get into statistics/data visualization?

In undergrad, an introduction to statistics (for engineering) actually pulled me in. The professor taught with so much energy, and the material sort of clicked with me. My friends who were also taking the course complained and had trouble with it, but I wanted more for some reason. I eventually switched from electrical engineering to statistics.

I got into visualization during my first year in grad school. My adviser gave a presentation on visualization, but from a media arts perspective rather than a charts-and-graphs-in-R-Tufte point of view. I went home after that class, googled visualization and that was that.

Why do you think there has been an explosion of interest in data visualization?

The Web is a really visual place, so it’s easy for good visualization to spread. It’s also easier for a general audience to read a graph than it is to understand statistical concepts. And from a more analytical point of view, there’s just a growing amount of data and visualization is a good way to poke around.

Other than R, what tools should students learn to improve their data visualizations?

For static graphics, I use Illustrator all the time to bring storytelling into the mix or to just provide some polish. For interactive graphics on the Web, it’s all about JavaScript nowadays. D3, Raphael.js, and Processing.js are all good libraries to get started.

Do you think the rise of infographics has led to a “watering down” of data visualization?

So I actually just wrote a post along these lines. It’s true that there a lot of low-quality infographics, but I don’t think that takes away from visualization at all. It makes good work more obvious. I think the flood of infographics is a good indicator of people’s eagerness to read data.

How did you decide to write your book “Visualize This”?
Pretty simple. I get emails and comments all the time when I post graphics on FlowingData that ask how something was done. There aren’t many resources that show people how to do that. There are books that describe what makes good graphics but don’t say anything about how to actually go about doing it, and there are programming books for say, R, but are too technical for most and aren’t visualization-centric. I wanted to write a book that I wish I had in the early days.
Any final thoughts on statistics, data and visualization? 

Keep an open mind. Oftentimes, statisticians seem to box themselves into positions of analysis and reports. Statistics is an applied field though, and now more than ever, there are opportunities to work anywhere there is data, which is practically everywhere.

Dear editors/associate editors/referees, Please reject my papers quickly

The review times for most journals in our field are ridiculous. Check out Figure 1 here. A careful review takes time, but not six months. Let’s be honest, those papers are sitting on desks for the great majority of those six months. But here is what really kills me: waiting six months for a review basically saying the paper is not of sufficient interest to the readership of the journal. That decision you can come to in half a day. If you don’t have time, don’t accept the responsibility to review a paper.

I like sharing my work with my statistician colleagues, but the Biology journals never  do this to me. When my paper is not of sufficient interest, these journals reject me in days not months. I sometimes work on topics that are fast pace and many of my competitors are not statisticians. If I have to wait six months for each rejection, I can’t compete. By the time the top three applied statistics journals reject the paper, more than a year goes by and the paper is no longer novel. Meanwhile I can go through Nature Methods, Genome Research, and Bioinformatics in less than 3 months.

Nick Jewell once shared an idea that I really liked. It goes something like this. Journals in our field will accept every paper that is correct. The editorial board, with the help of referees, assigns each paper into one of five categories A, B, C, D, E based on novelty, importance, etc… If you don’t like the category you are assigned, you can try your luck elsewhere. But before you go, note that the paper’s category can improve after publication based on readership feedback. While we wait for this idea to get implemented, I please ask that if you get one of my papers and you don’t like it, reject it quickly. You can write this review: “This paper rubbed me the wrong way and I heard you like being rejected fast so that’s all I am going to say.” Your comments and critiques are valuable, but not worth the six month wait. 

ps -  I have to admit that the newer journals have not been bad to me in this regard. Unfortunately, for the sake of my students/postdocs going into the job market and my untenured jr colleagues, I feel I have to try the established top journals first as they still impress more on a CV.


Smoking is a choice, breathing is not.

Over the last week or so I’ve been posting about the air pollution levels in Beijing, China. The twitter feed from the US Embassy there makes it easy to track the hourly levels of fine particulate matter (PM2.5) and you can use this R code to make a graph of the data.

One problem with talking about particulate matter levels is that the units are a bit abstract. We usually talk in terms of micrograms per cubic meter (mcg/m^3), which is a certain mass of particles per volume of air. The 24-hour national ambient air quality standard for fine PM in the US is 35 mcg/m^3. But what does that mean in reality?

C. Arden Pope III and colleagues recently wrote an interesting paper in Environmental Health Perspectives on the dose-response relationship between particles and lung cancer and cardiovascular disease. They combined data from air pollution studies and smoking studies to estimate the dose-response curve for a very large range of PM levels. Ambient air pollution, not surprisingly, is on the low-end of PM exposure, followed by second hand smoke, followed by active smoking. One challenge they faced is putting everything on the same scale in terms of PM exposure so that the different studies could be compared.

Here are the important details: On average, actively smoking a cigarette generates a dose of about 12 milligrams (mg) of particulate matter. Daily inhalation rates obviously depend on your size, age, physical activity, health, and other factors, but in adults they range from about 13 to 23 cubic meters of air per day. For convenience, I’ll just take the midpoint of that range, which is 18 cubic meters per day.

If your city’s fine PM levels were compliant with the US national standard of 35 mcg/m^3, then in the worst case scenario you’d be breathing in about 630 micrograms of particles per day, which is about 0.05 cigarettes (1 cigarette every 20 days). Sounds like it’s not too bad, but keep in mind that most of the increase in risk from smoking is seen in the low range of the dose-response curve (although this is obviously very low).

If we move now to Beijing, where 24-hour average levels can easily reach up to 300 mcg/m^3 (and indoor levels can reach 200 mcg/m^3), then we’re talking about a daily dose of almost half a cigarette. Now, half a cigarette might still seem like not that much, but keep in mind that pretty much everyone is exposed: old and young, sick and healthy. Not everyone gets the same dose because of variation in inhalation rates, but even the low end of the range gives you about 0.3 cigarettes. 

Beijing is hardly alone here, as a number of studies in Asian cities show comparable levels of fine PM. I’ve redone my previous plot of PM2.5 in Beijing in terms of number cigarettes per day. Here’s the last 2 months in Beijing (for an average adult).


The Supreme Court's interpretation of statistical correlation may determine the future of personalized medicine


The Supreme Court heard oral arguments last week in the case Mayo Collaborative Services vs. Prometheus Laboratories (No 10-1150). At issue is a patent Prometheus Laboratories holds for making decisions about the treatment of disease on the basis of a measurement of a specific, naturally occurring molecule and a corresponding calculation. The specific language at issue is a little technical, but the key claim from the patent under dispute is:

1. A method of optimizing therapeutic efficacy for treatment of an immune-mediated gastrointestinal disorder, comprising: 

(a) administering a drug providing 6-thioguanine to a subject having said immune-mediated gastrointestinal disorder; and 

(b) determining the level of 6-thioguanine in said subject having said immune-mediated gastrointestinal disorder,  

wherein the level of 6-thioguanine less than about 230 pmol per 8x10^8 red blood cells indicates a need to increase the amount of said drug subsequently administered to said subject and  

wherein the level of 6-thioguanine greater than about 400 pmol per 8x10^8 red blood cells indicates a need to decrease the amount of said drug subsequently administered to said subject.

So basically the patent is on a decision made about treatment on the basis of a statistical correlation. When the levels of a specific molecule (6-thioguanine) are too high, then the dose of a drug (thiopurine) should be decreased, if they are too low then the dose of the drug should be increased. Here (and throughout the post) correlation is interpreted more loosely as a relationship between two variables; rather than the strict definition as the linear relationship between two quantitative variables. 

This correlation between levels of 6-thioguanine and patient response was first reported by a group of academics in a paper in 1996. Prometheus developed a diagnostic test based on this correlation. Doctors (including those at the Mayo clinic) would draw blood, send it to Prometheus, who would calculate the levels of 6-thioguanine and report them back. 

According to Mayo’s brief, some Doctors at the Mayo, who used this test, decided it was possible to improve on the test. So they developed their own diagnostic test, based on a different measurement of 6-thioguanine (6-TGN) and reported different information including:

  • A blood reading greater than 235 picomoles of 6-TGN is a “target therapeutic range,” and a reading greater than 250 picomoles of 6-TGN is associated with remission in adult patients; and
  • A blood reading greater than 450 picomoles of 6-TGN indicates possible adverse health effects, but in some instances levels over 700 are associated with remission without significant toxicity, while a “clearly defined toxic level” has not been established; and
  • A blood reading greater than 5700 picomoles of 6-MMP is possibly toxic to the liver.

They subsequently created their own proprietary test and started to market that test. At which point Prometheus sued the Mayo Clinic for infringement. The most recent decision on the case was made by a federal circuit court who upheld Prometheus’ claim. A useful summary is here

The arguments for the two sides are summarized in the briefs for each side; for Mayo

Whether 35 U.S.C. § 101 is satisfied by a patent claim that covers observed correlations between blood test results and patient health, so that the patent effectively preempts use of the naturally occurring correlations, simply because well-known methods used to administer prescription drugs and test blood may involve “transformations” of body chemistry.

and for Prometheus

Whether the Federal Circuit correctly held that concrete methods for improving the treatment of patients suffering from autoimmune diseases by using  individualized metabolite measurements to inform the calibration of the patient’s dosages of synthetic thiopurines are patentable processes under 35 U.S.C. §101. 

Basically, Prometheus claims that the patent covers cases where doctors observe a specific data point and make a decision about a specific drug on the basis of that data point and a known correlation with patient outcomes. Mayo, on the other hand, says that since the correlation between the data and the outcome are naturally occurring processes, they can not be patented. 

In the oral arguments, the attorney for Mayo makes the claim that the test is only patentable if Prometheus specifies a specific level for 6-thioguanine and a specific treatment associated with that level (see page 21-24 of the transcript). He then goes on to suggest that the Mayo would then be free to pick another level and another treatment option for their diagnostic test. Justice Breyer disagrees even with this specific option (see page 38 of the transcript and his fertilizer example). He has made this view known before in his dissent to the dismissal of the Labcorp writ of certori (a very similar case focusing on whether a correlation can be patented). 

Brief summary: Prometheus is trying to patent a correlation between a molecule’s level and treatment decisions. Mayo is claiming this is a natural process and can’t be patented.  

Implications for Personalized Medicine (a statistician’s perspective)

I believe this case has major potential consequences for the entire field of personalized medicine. The fundamental idea of personalized medicine is that treatment decisions for individual patients will be tailored on the basis of data collected about them and statistical calculations made on the basis of that data (i.e. correlations, or more complicated statistical functions).

According to my interpretation, if the Supreme Court rules in favor of Mayo in a broad sense, then this suggests that decisions about treatment made on the basis of data and correlation are not broadly patentable. In both the Labcorp dissent and the oral arguments for the Prometheus case, Justice Breyer argues that the process described by the patents:

…instructs the user to (1) obtain test results and (2) think about them. 

He suggests that these are natural correlations and hence can not be patented, just the way a formula like E = mc^2 can not be patented. The distinction seems to be subtle, where E=mc^2 is a formula that exactly describes a property of nature, the observed correlation is an empirical estimate of a parameter calculated on the basis of noisy data. 

From a statistical perspective, there is little difference between calculating a correlation and calculating something more complicated, like the Oncotype DX signature. Both return a score that can be used to determine treatment or other health care decisions. In some sense, they are both “natural phenomena” - one is just more complicated to calculate than the other. So it is not surprising that Genomic Health, the developers of Oncotype, have filed an amicus in favor of Prometheus. 

Once a score is calculated, regardless of the level of complication in calculating that score, the personalized decision still comes down to a decision made by a doctor on the basis of a number. So if the court broadly decides in favor of Mayo, from a statistical perspective, this would seemingly pre-empt patenting any personalized medicine decision made on the basis of observing data and making a calculation. 

Unlike traditional medical procedures like surgery, or treatment with a drug, these procedures are based on data and statistics. But in the same way, a very specific set of operations and decisions is taken with the goal of improving patient health. If these procedures are broadly ruled as simply “natural phenomena”, it suggests that the development of personalized decision making strategies is not, itself, patentable. This decision would also have implications for other companies that use data and statistics to make money, like software giant SAP, which has also filed an amicus brief in support of the federal circuit court opinion (and hence Prometheus).

A large component of medical treatment in the future will likely be made on the basis of data and statistical calculations on those data - that is the goal of personalized medicine. So the Supreme Court’s decision about the patentability of correlation has seemingly huge implications for any decision made on the basis of data and statistical calculations. Regardless of the outcome, this case lends even further weight to the idea that statistical literacy is critical, including for Supreme Court justices. 

Simply Statistics will be following this case closely; look for more in depth analysis in future blog posts. 


Interview w/ Mario Marazzi, Puerto Rico Institute of Statistics Director, on the importance of Government Statisticians

[Desplace hacia abajo para traducción al español]

In my opinion, the importance of government statisticians is underappreciated. In the US, agencies such as the CDC, the Census Bureau, and the Bureau of Labor Statistics employ statisticians to help collect and analyze data that contribute to important policy decisions. How many students will enroll in public schools this year? Is there a type II diabetes epidemic? Is unemployment rising? How many homeless people are in Los Angeles? The answers to these questions can guide policy and spending decisions and they can’t be answered without the help of the government statisticians that collect and analyze relevant data.

Until recently the Puerto Rican government had no formal mechanisms for collecting data. Puerto Rico, an unincorporated territory of the United States, has many serious economic and social problems .  With a very high murder rate, less than 50% of the working-age population in the labor force, an economy that continues to worsen after 5 years of recession , and a substantial traffic problem , Puerto Rico can certainly benefit from sound government statistics to better guide policy-making.  Better measurement, information and knowledge can only improve the situation.

In 2007, the Puerto Rico Institute of Statistics was founded. Mario Marazzi, who obtained his PhD in Economics from Cornell University, left a prestigious job at the Federal Reserve to become the first Executive Director of the Institute.  Given the complicated political landscape in Puerto Rico, Mario made an admirable sacrifice for his home country. He was kind enough to answer some questions for Simply Statistics:

What is the biggest success story of the Institute?

I would say that our biggest success story has been to revive the idea that high-quality statistics are critical for the success of any organization in Puerto Rico.  For too long, statistics were neglected and even abused in Puerto Rico.  There is now a palpable sense in Puerto Rico that it is important to devote resources and time to ensure that data are produced with care.

We have also undertaken a number of critical statistical projects since our inauguration in 2007.  For instance, the Institute completed the revision to Puerto Rico’s Consumer Price Index, after identifying that official inflation had been overestimated by more than double for 15 years.  The Institute revised Puerto Rico’s Mortality Statistics, after detecting the use of an inconsistent selection methodology for the cause of death, as well as discovering thousands of deaths that had not been previously included in the official data.  We also undertook Puerto Rico’s first-ever Science and Technology Survey that allowed us to measure the economic impact of Research and Development activities in Puerto Rico.

What discovery, made from collecting data in Puerto Rico, has most surprised you?

We performed a study on migration patterns during the last decade.  From anecdotal evidence, it was fairly clear that in the last five years there had been an elevated level of migration out of Puerto Rico.  Nevertheless, the data revealed a few stunning conclusions.  For five consecutive years, about 1 percent of Puerto Rico’s population simply left Puerto Rico every year, even after taking into account the people who migrated to Puerto Rico.  The demographic consequences were significant: migration had been accelerating the aging of Puerto Rico’s population, and people who left Puerto Rico had a greater level of educational achievement than those who arrived.  In fact, for the first-time ever in recorded history, Puerto Rico’s population actually declined between the 2000 and 2010 Census.  Despite declining fertility rates, it is now clear migration was the cause of the overall population decrease.

Are government agencies usually willing to cooperate with the Institute? If not, what resources does the Institute have available to make them comply?

Frequently, statistical functions are not very high on policymakers’ lists of priorities.  As a result, government statisticians are usually content to collaborate with the Institute, since we can bring resources to help solve the common problems they face.

At times, some agencies can be reluctant to undertake the changes needed to produce high-quality statistics.  In these instances, the Institute is endowed with the authority by law to move the process along, through statistical policy mandates approved by the Board of Directors of the Institute. 

If there is a particular agency that excels at collecting and sharing data, can others learn from them?

Definitely, we encourage agencies to share their best practices with one another.  To facilitate this process, the Institute has the responsibility of organizing the Puerto Rico Statistical Coordination Committee, where representatives from each agency can share practical experiences, and enhance interagency coordination.

Do you think Puerto Rico needs more statisticians?

Absolutely.  Some of our brightest minds in statistics work outside of Puerto Rico, both in Universities and in the Federal Government.  Puerto Rico needs an injection of human resources to bring its statistical system up to global standards.

What can academic statisticians do to help institutes such as yours?

Academic statisticians are instrumental to furthering the mission of the Institute.  Governments produce statistics in a wide array of disciplines.  Each area can have very specific and unique methodologies.  It is impossible for one to be an expert in every methodology. 

As a result, the Institute depends on the collaboration of academic statisticians that can bring to bear their expertise in specific fields.  For example, academic biostatisticians can help identify needed improvements to existing methodologies in health statistics.  Index theorists can train government statisticians in the latest index methodologies.  Computational statisticians can analyze large data sets to help us explain the otherwise unexplained behavior of the data. 

We also host several Puerto Rico datasets on the Institute’s website, which were provided by professors from a number of different fields.  

Entrevista con Mario Marazzi (version en español)

En mi opinión, la importancia de los estadísticos que trabajan para el gobierno se subestima.En los EEUU, agencias como el Center for Disease Control, el Census Bureau y el Bureau of Labor Statistics emplean estadísticos para ayudar a recopilar y analizar datos que contribuyen a importantes decisiones de política pública. Por ejemplo, ¿cuántos estudiantes se matricularán en las escuelas públicas este año? ¿Hay una epidemia de diabetes tipo II?  ¿El desempleo está aumentando? ¿Cuántos deambulantes viven en Los Ángeles?  Las respuestas a estas preguntas ayudan determinar las decisiones presupuestarias y de política pública y no se pueden contestar sin la ayuda de los estadísticos del gobierno que recogen y analizan los datos pertinentes.

Hasta hace poco el gobierno de Puerto Rico no tenía mecanismos formales de recolección de datos. Puerto Rico, un territorio no incorporado de Estados Unidos, tiene muchos problemas socioeconómicos. Con una tasa de asesinatos muy alta, menos de 50% de la población con edad de trabajar en la fuerza laboral, una economía que sigue empeorando después de 5 años de recesión y problemas serios de tráfico, Puerto Rico se beneficiaría de estadísticas gubernamentales de alta calidad para mejor guíar la formulación de política pública. Mejores medidas, información y conocimientos sólo pueden mejorar la situación.

En 2007, se inaguró el Institute de Estadísticas de Puerto Rico. Mario Marazzi, quien obtuvo su doctorado en Economía de la Universidad de Cornell, dejó un trabajo prestigioso en Federal Reserve para convertirse en el primer Director Ejecutivo del Instituto.

Tomando en cuenta el complicado panorama político en Puerto Rico, Mario hizo un sacrificio admirable por su país y cordialmente aceptó contestar unas preguntas para nuestro blog:

¿Cuál ha side el mayor éxito del Instituto?

Yo diría que nuestro mayor éxito ha sido revivir la idea de que las estadísticas de alta calidad son cruciales para el éxito de cualquier organización en Puerto Rico.  Por mucho tiempo, las estadísticas fueron descuidadas e incluso abusadas en Puerto Rico. En la actualidad existe una sensación palpable en Puerto Rico que es importante dedicar recursos y tiempo para asegurarse de que los datos se produzcan con cuidado.

También, desde nuestra inauguración en 2007, hemos realizado una serie de proyectos críticos de estadística.  Por ejemplo, el Instituto concluyó la revisión del Índice de Precios al Consumidor de Puerto Rico, después de identificar que la inflación oficial había sido sobreestimada por más del doble durante 15 años. El Instituto revisó las Estadísticas de Mortalidad de Puerto Rico, después de detectar el uso de una metodología de selección inconsistente para determinar la causa de muerte y tras descubrir miles de muertes que no habían sido incluidos en los datos oficiales.  Además, realizamos por primera vez en Puerto Rico la primera Encuesta de Ciencia y Tecnología que nos permitió medir el impacto económico de las actividades de investigación y desarrollo en Puerto Rico.

¿Cuál descubrimiento, realizado a partir de la recopilación de datos en Puerto Rico, más te ha sorprendido?

Nosotros realizamos un estudio sobre los patrones de migración durante la última década. A partir de la evidencia anecdótica, era bastante claro que durante los últimos cinco años ha habido un nivel elevado de emigración de Puerto Rico. Sin embargo, los datos revelaron algunas conclusiones sorprendentes. Durante cinco años consecutivos, 1 por ciento de la población de Puerto Rico se ha ido de Puerto Rico todos los años, incluso después de tomar en cuenta la gente que emigró a Puerto Rico. Las consecuencias demográficas eran importantes: la migración ha acelerado el envejecimiento de la población de Puerto Rico y las personas que se fueron de Puerto Rico tienen un mayor nivel de preparación escolar que los que llegaron. De hecho, por primera vez en la historia, la población de Puerto Rico disminuyó entre el Censo de 2000 y el del 2010.  A pesar de tasas de fecundidad que disminuyen, ahora está claro que la migración es la causa principal de la reducción de población.

¿Por lo general, las agencias gubernamentales están dispuestas a cooperar con el Instituto?  Si no, ¿qué recursos tiene disponible el Instituto para obligarlos?

Frecuentemente, las estadísticas no aparecen muy altas en las listas de prioridades de los políticos. Como resultado, los estadísticos del gobierno por lo general están contentos de colaborar con el Instituto, ya que nosotros podemos aportar recursos para ayudar a resolver los problemas comunes a que se enfrentan.

A veces, algunas agencias pueden mostrarse reacios a emprender los cambios necesarios para producir estadísticas de alta calidad. En estos casos, el Instituto posee la autoridad legal de acelerar el proceso, a través de mandatos aprobados por el Consejo de Administración del Instituto.

Si hay un organismo en particular que se destaca en la recopilación y el intercambio de datos, ¿otros pueden aprender de ellos?

Definitivamente.  Nosotros animamos a las agencias a compartir sus mejores prácticas con otros. Para facilitar este proceso, el Instituto tiene la responsabilidad de organizar el Comité de Coordinación Estadística de Puerto Rico, donde representantes de cada agencia pueden compartir experiencias prácticas y mejorar la coordinación interinstitucional.

 ¿Cree usted que Puerto Rico necesita más estadísticos?

Por supuesto. Algunas de nuestras mentes más brillantes en estadísticas trabajan fuera de Puerto Rico, tanto en las universidades como en el Gobierno Federal. Puerto Rico necesita una inyección de recursos humanos para que su sistema estadístico llegue a los estándares mundiales.

¿Qué pueden hacer los estadísticos académicos hacer ayudar a instituciones como la suya?

Los estadísticos académicos son fundamentales para promover la misión del Instituto. Los gobiernos generan las estadísticas en una amplia gama de disciplinas. Cada área puede tener metodologías muy específicas y únicas. Es imposible que uno sea un experto en cada metodología.

Como resultado, el Instituto cuenta con la colaboración de estadísticos académicos que pueden ejercer sus conocimientos en campos específicos. Por ejemplo, los bioestadísticos académicos pueden ayudar a identificar las mejoras necesarias a las metodologías existentes en el contexto de la salud pública.  Los “Index theorists” pueden entrenar a los estadísticos del gobierno en las últimas metodologías de índice. Los estadísticos computacionales pueden analizar grandes “datasets” que nos ayudan explicar comportamientos de otra manera  no explicados de los datos.

También organizamos varios datasets de Puerto Rico en la página web del Instituto, que fueron proporcionados por profesores en varios campos diferentes.