Simply Statistics


Pro Tips for Grad Students in Statistics/Biostatistics (Part 2)

This is the second in my series on pro tips for graduate students in statistics/biostatistics. For more tips, see part 1

  1. Meet with seminar speakers. When you go on the job market face recognition is priceless. I met Scott Zeger at UW when I was a student. When I came for an interview I already knew him (and Ingo, and Rafa, and ….). An even better idea…ask a question during the seminar.
  2. Be a finisher. The key to getting a Ph.D. (other than passing your quals) is the ability to sit down and just power through and get it done. This means sometimes you will have to work late or on a weekend. The people who are the most successful in grad school are the people that just nd a way to get it done. If it was easy…anyone would do it.
  3. Work on problems you genuinely enjoy thinking about/are
    passionate about. A lot of statistics (and science) is long periods of concentrated effort with no guarantee of success at the end. To be a really good statistician requires a lot of patience and effort. It is a lot easier to work hard on something you like or feel strongly about.
More to come soon.

Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)

I just finished teaching a Ph.D. level applied statistical methods course here at Hopkins. As part of the course, I gave one “pro-tip” a day; something I wish I had learned in graduate school that has helped me in becoming a practicing applied statistician. Here are the first three, more to come soon. 
  1. A major component of being a researcher is knowing what’s going on in the research community. Set up an RSS feed with journal articles. Google Reader is a good one, but there are others. Here are some good applied stat journals: Biostatistics, Biometrics, Annals of Applied Statistics…
  2. Reproducible research is a hot topic, in part because a couple of high-profile papers that were disastrously non-reproducible (see “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology”). When you write code for statistical analysis try to make sure that: (a) It is neat and well-commented - liberal and specific comments are your friend. (b)That it can be run by someone other than you, to produce the same results that you report.
  3. In data analysis - particularly for complex high-dimensional
    data - it is frequently better to choose simple models for clearly defined parameters. With a lot of data, there is a strong temptation to go overboard with statistically complicated models; the danger of overfitting/ over-interpreting is extreme. The most reproducible results are often produced by sensible and statistically “simple” analyses (Note: being sensible and simple does not always lead to higher prole results).

Sunday data/statistics link roundup (6/17)

Happy Father’s Day!

  1. A really interesting read on randomized controlled trials (RCTs) and public policy. The examples in the boxes are fantastic. This seems to be one of the cases where the public policy folks are borrowing ideas from Biostatistics, which has been involved in randomized controlled trials for a long time. It’s a cool example of adapting good ideas in one discipline to the specific challenges of another. 
  2. Roger points to this link in the NY Times about the “Consumer Genome”, which basically is a collection of information about your purchases and consumer history. On Twitter, Leonid K. asks: ‘Since when has “genome” becaome a generic term for “a bunch of information”?’. I completely understand the reaction against the “genome of x”, which is an over-used analogy. I actually think the analogy isn’t that unreasonable; like a genome, the information contained in your purchase/consumer history says something about you, but doesn’t tell the whole picture. I wonder how this information could be used for public health, since it is already being used for advertising….
  3. This PeerJ journal looks like it has the potential to be good.  They even encourage open peer review, which has some benefits. Not sure if it is sustainable, see for example, this breakdown of the costs. I still think we can do better.  
  4. Elon Musk is one of my favorite entrepreneurs. He tackles what I consider to be some of the most awe-inspiring and important problems around. This article about the Tesla S got me all fired up about how a person with vision can literally change the fuel we run on. Nothing to do with statistics, other than I think now is a similarly revolutionary time for our discipline. 
  5. There was some interesting discussion on Twitter of the usefulness of the Yelp dataset I posted for academic research. Not sure if this ever got resolved, but I think more and more as data sets from companies/startups become available, the terms of use for these data will be critical. 
  6. I’m still working on Roger’s puzzle from earlier this week. 

Statisticians, ASA, and Big Data

Today I got my copy of Amstat News and eagerly opened it before I realized it was not the issue with the salary survey….

But the President’s Corner section had the following column on big data by ASA president Robert Rodriguez.

Big Data is big news. It is the focus of stories in The New York Times and the subject of technology blogs, business forums, and economic studies. This column describes how statisticians can prepare for opportunities in Big Data and explains the distinctive value our profession can provide.

Here’s a homework assignment for you all: Please read the column and explain what’s wrong with it. I’ll post the answer in a (near) future post.


Poison gas or...air pollution?

From our Beijing bureau, we have the following message from the U.S. embassy that was recently issued to U.S. citizens in China:

The Embassy has received reports from U.S. citizens living and traveling in Wuhan that the air quality in the city has been particularly poor since yesterday morning.  On June 11 at 16:20, the Wuhan Environmental Protection Administrative Bureau posted information about this on its website.  Below is a translation of that information:

“Beginning on June 11, 2012 around 08:00 AM, the air quality inside Wuhan appeared to worsen, with low visibility and burning smells. According to city air data, starting at 07:00 AM this morning, the density of the respiratory particulate matter increased in the air downtown; it increased quickly after 08:00 AM.  The density at 14:00 approached 0.574mg/m3, a level that is deemed “serious” by national standards.  An analysis of the air indicates the pollution is caused from burning of plant material northeast of Wuhan.

It’s not immediately clear which pollutant they’re talking about, but it’s probably PM10 (particulate matter less than 10 microns in aerodynamic diameter). If so, that level is quite high—U.S. 24-hour average standards are at 0.15 mg/m3 (note that the reported level was an hourly level). 

Our investigation of downtown’s districts, and based on reports from all of Wuhan’s large industrial enterprises, have determined that that there has not been any explosion, sewage release, leakage of any poisoning gas, or any other type of urgent environmental accident from large industrial enterprises.  Nor is there burning of crops in the new city area.  News spread online of a chlorine leak from Qingshan or a boiler explosion at Wuhan Iron and Steel Plant are rumors.

So, this is not some terrible incident, it’s just the usual smell. Good to know.

According to our investigation, the abnormal air quality in our city is mainly caused by the burning of the crops northeast of Wuhan towards Hubei province.  Similar air quality is occurring in Jiangsu, Henan and Anhui provinces, as well as in Xiaogan, Jingzhou, Jingmen and Xiantao, cities nearby Wuhan.

The weather forecast authority of the city has advised that recent weather conditions have not been good for the dispersion of pollutants.”

The embassy goes on to warn:

U.S. citizens are reminded that air pollution is a significant problem in many cities and regions in China.  Health effects are likely to be more severe for sensitive populations, including children and older adults.  While the quality of air can differ greatly between cities or between urban and rural areas, U.S. citizens living in or traveling to China may wish to consult their doctor when living in or prior to traveling to areas with significant air pollution.


Getting a grant...or a startup

Y Combinator is company that invests in startups and brings them to the San Francisco area to get them ready for prime time. One of the co-founders is Paul Graham, whose essays we’ve featured on this blog.

The Y Combinator web site itself is quite interesting and in particular, the section on how to apply to Y Combinator caught my eye. Now, I don’t know the first thing about starting a startup (nor do I have any current interest in doing so), but I do know a little bit about applying for NIH grants and it struck me that the advice for the startups seemed very useful for writing grants. It surprised me because I always thought that the process of “marketing” a startup to someone would be quite different from applying for a grant—-startups are supposed to be cool and innovative and futuristic while grants are more about doing the usual thing. Just shows you how much I know about the startup world.

I thought I’d pluck out a few good parts from Graham’s long list of advice that I found useful. The full essay is definitely worth reading.

Here’s one that struck me immediately:

If we get 1000 applications and have 10 days to read them, we have to read about 100 a day. That means a YC partner who reads your application will on average have already read 50 that day and have 50 more to go. Yours has to stand out. So you have to be exceptionally clear and concise. Whatever you have to say, give it to us right in the first sentence, in the simplest possible terms.

In that past, I always thought that grant reviewers had all the time in the world to read my grant and probably dedicated a week of their life to reading it. Hah! Having served on study sections now, I realize there’s precious little time to dedicate to the tall pile of grants that need to be read. Grants that are well written are a pleasure to read. Ones that are poorly written (or take forever to get to the point) just make me angry.

It’s a mistake to use marketing-speak to make your idea sound more exciting. We’re immune to marketing-speak; to us it’s just noise. So don’t begin…with something like

We are going to transform the relationship between individuals and information.

That sounds impressive, but it conveys nothing. It could be a description of any technology company. Are you going to build a search engine? Database software? A router? I have no idea.

One test of whether you’re explaining your idea effectively is to ask how close the reader is to reproducing it. After reading that sentence I’m no closer than I was before, so its content is effectively zero.

I usually tell people if at any stage of writing a grant you have a choice between being more general and more specific, always be more specific. That way people can judge you based on the facts, not based on their imagination of the facts. This doesn’t always lead to success, of course, but it can remove an element of chance. If a reviewer has to fill in the details of your idea, who knows what they’ll think of?

One reason [company] founders resist giving matter-of-fact descriptions [of their company] is that they seem to constrain your potential. “But [my product] is so much more than a database with a wiki UI!” The problem is, the less constraining your description, the less you’re saying. So it’s better to err on the side of matter-of-factness.

Of course, there are some applications that specifically ask you to “think big” and there the rules may be a bit different. But still, I think it’s better to avoid broad and sweeping generalities. These days, given the relatively tight page limits, you need to convey the maximum amount of information possible.

One good trick for describing a project concisely is to explain it as a variant of something the audience already knows. It’s like Wikipedia, but within an organization. It’s like an answering service, but for email. It’s eBay for jobs. This form of description is wonderfully efficient. Don’t worry that it will make your idea seem “derivative.” Some of the best ideas in history began by sticking together two existing ideas no one realized could be combined.

Not sure this is so relevant to writing grants, but I thought was interesting. My instinct was to think that this would make your idea seem derivative also, but maybe not.

…if we can see obstacles to your idea that you don’t seem to have considered, that’s a bad sign. This is your idea. You’ve had days, at least, to think about it, and we’ve only had a couple minutes. We shouldn’t be able to come up with objections you haven’t thought of.

Paradoxically, it is for this reason better to disclose all the flaws in your idea than to try to conceal them. If we think of a problem you don’t mention, we’ll assume it’s because you haven’t thought of it. 

This is one definitely true—better to reveal limitations/weaknesses than to look like you haven’t thought of them. Because if a reviewer finds one, then it’s all they’ll talk about. Often times, a big problem is lack of space to fit this in, but if you can do it I think it’s always a good idea to include it.


You don’t have to sell us on you. We’ll sell ourselves, if we can just understand you. But every unnecessary word in your application subtracts from the effect of the necessary ones. So before submitting your application, print it out and take a red pen and cross out every word you don’t need. And in what’s left be as specific and as matter-of-fact as you can.

I think there are quite a few differences between scientists reviewing grants and startup investors and we probably shouldn’t take the parallels too seriously. In particular, investors I think are going to be more optimistic because, as Graham says, “they get equity”. Scientists are trained to be skeptical and so will be looking at applications with a slightly different eye.

However, I think the general advice to be specific and concise about what you’re doing is good. If anything, it may help you realize that you have no idea what you’re doing.