13
Sep

## An experimental foundation for statistics

In a recent conversation with Brian (of abstraction fame) about the relationship between mathematics and statistics. Statistics, for historical reasons, has been treated as a mathematical sub-discipline (this is the NSF’s view).

One reason statistics is viewed as a sub-discipline of math is because the foundations of statistics are built on the basis of deductive reasoning, where you start with a few general propositions or foundations that you assume to be true and then systematically prove more specific results. A similar approach is taken in most mathematical disciplines.

In contrast, scientific disciplines like biology are largely built on the basis of inductive reasoning and the scientific method. Specific individual discoveries are described and used as a framework for building up more general theories and principles.

So the question Brian and I had was: what if you started over and built statistics from the ground up on the basis of inductive reasoning and experimentation? Instead of making mathematical assumptions and then proving statistical results, you would use experiments to identify core principals. This actually isn’t without precedent in the statistics community. Bill Cleveland and Robert McGill studied how people perceive graphical information and produced some general recommendations about the use of area/linear contrasts, common axes, etc. There has also been a lot of work on experimental understanding of how humans understand uncertainty

So what if we put statistics on an experimental, rather than on a mathematical foundation. We performed experiments to see what kind of regression models people were able to interpret most clearly, what were the best ways to evaluate confounding/outliers, or what measure of statistical significance people understood best? Basically, what if the “quality” of a statistical method did not rest on the mathematics behind the method, but on the basis of experimental results demonstrating how people used the methods? So, instead of justifying lowess mathematically, we justified it on the basis of its practical usefulness through specific, controlled experiments. Some of this is already happening when people do surveys of the most successful methods in Kaggle contests or with the MAQC.

I wonder what methods would survive the change in paradigm?

14
Mar

## A proposal for a really fast statistics journal

I know we need a new journal like we need a good poke in the eye. But I got fired up by the recent discussion of open science (by Paul Krugman and others) and the seriously misguided Research Works Act- that aimed to make it illegal to deposit published papers funded by the government in Pubmed central or other open access databases.

I also realized that I spend a huge amount of time/effort on the following things: (1) waiting for reviews (typically months), (2) addressing reviewer comments that are unrelated to the accuracy of my work - just adding citations to referees papers or doing additional simulations, and (3) resubmitting rejected papers to new journals - this is a huge time suck since I have to reformat, etc. Furthermore, If I want my papers to be published open-access I also realized I have to pay at minimum \$1,000 per paper

So I thought up my criteria for an ideal statistics journal. It would be accurate, have fast review times, and not discriminate based on how interesting an idea is. I have found that my most interesting ideas are the hardest ones to get published.  This journal would:

• Be open-access and free to publish your papers there. You own the copyright on your work.
• The criteria for publication would be: (1) it has to do with statistics, computation, or data analysis, (2) is the work is technically correct.
• We would accept manuals, reports of new statistical software, and full length research articles.
• There would be no page limits/figure limits.
• The journal would be published exclusively online.
• We would guarantee reviews within 1 week and publication immediately upon review if criteria (1) and (2) are satisfied
• Papers would receive a star rating from the editor - 0-5 stars. There would be a place for readers to also review articles
• All articles would be published with a tweet/like button so they can be easily distributed
To achieve such a fast review time, here is how it would work. We would have a large group of Associate Editors (hopefully 30 or more). When a paper was received, it would be assigned to an AE. The AEs would agree to referee papers within 2 days. They would use a form like this:
• Review of: Jeff’s Paper
• Technically Correct: Yes
• About statistics/computation/data analysis: Yes
• Number of Stars: 3 stars

• 3 Strengths of Paper (1 required):
• This paper revolutionizes statistics

• 3 Weakness of Paper (1 required):
• * The proof that this paper revolutionizes statistics is pretty weak
• because he only includes one example.
That’s it, super quick, super simple, so it wouldn’t be hard to referee. As long as the answers to the first two questions were yes, it would be published.
So now here’s my questions:
1. Would you ever consider submitting a paper to such a journal?
2. Would you be willing to be one of the AEs for such a journal?
3. Is there anything you would change?
29
Oct

## The 5 Most Critical Statistical Concepts

It seems like everywhere we look, data is being generated - from politics, to biology, to publishing, to social networks. There are also diverse new computational tools, like GPGPU and cloud computing, that expand the statistical toolbox. Statistical theory is more advanced than its ever been, with exciting work in a range of areas.

With all the excitement going on around statistics, there is also increasing diversity. It is increasingly hard to define “statistician” since the definition ranges from very mathematical to very applied. An obvious question is: what are the most critical skills needed by statisticians?

So just for fun, I made up my list of the top 5 most critical skills for a statistician by my own definition. They are by necessity very general (I only gave myself 5).

1. The ability to manipulate/organize/work with data on computers - whether it is with excel, R, SAS, or Stata, to be a statistician you have to be able to work with data.
2. A knowledge of exploratory data analysis - how to make plots, how to discover patterns with visualizations, how to explore assumptions
3. Scientific/contextual knowledge - at least enough to be able to abstract and formulate problems. This is what separates statisticians from mathematicians.
4. Skills to distinguish true from false patterns - whether with p-values, posterior probabilities, meaningful summary statistics, cross-validation or any other means.
5. The ability to communicate results to people without math skills - a key component of being a statistician is knowing how to explain math/plots/analyses.

What are your top 5? What order would you rank them in? Even though these are so general, I almost threw regression in there because of how often it pops up in various forms.

Related Posts: Rafa on graduate education and What is a Statistician? Roger on “Do we really need applied statistics journals?”

28
Sep

## The future of graduate education

Stanford is offering a free online course and more than 100,000 students have registered. This got the blogosphere talking about the future of universities. Matt Yglesias thinks that “colleges are the next newspaper and are destined for some very uncomfortable adjustments”. Tyler Cowen reminded us that since 2003 he has been saying that professors are becoming obsolete. His main point is that thanks to the internet, the need for lecturers will greatly diminish. He goes on to predict that

the market was moving towards superstar teachers, who teach hundreds at a time or even thousands online. Today, we have the Khan Academy, a huge increase in online education, electronic textbooks and peer grading systems and highly successful superstar teachers with Michael Sandel and his popular course Justice, serving as example number one.

I think this is particularly true for stat and biostat graduate programs, especially in hard money environments.

A typical Statistics department will admit five to ten PhD students. In most departments we teach probability theory, statistical theory, and applied statistics. Highly paid professors teach these three courses for these five to ten students, which means that the university ends up spending hundreds of thousands of dollars on them.  Where does this money come from? From those that teach hundreds at a time. The stat 101 courses are full of tuition paying students. These students are subsidizing the teaching of our graduate courses. In hard money institutions, they are also subsidizing some of the research conducted by the professors that teach the small graduate courses. Note that 75% of their salaries are covered by the University, yet they are expected to spend much less than 75% of their time preparing and teaching these relatively tiny classes. The leftover time they spend on research for which they have no external funding. This isn’t a bad thing as a lot of good theoretical and basic knowledge has been created this way. However, outside pressure to lower tuition costs has University administrators looking for ways to save and graduate education might be a target. “If you want to teach a class, fill it up with 50 students. If you want to do research, get a grant. ” the administrator might say.

Note that, for example, the stat theory class is pretty much the same every year and across universities. So we can pick a couple of superstar stat theory teachers and have them lead an online course for all the stat and biostat graduate students in the world. Then each department hires an energetic instructor, paying him/her 1/4 what they pay a tenured professor, to sit in a room discussing the online lectures with the five to ten PhD students in the program. Currently there are no incentives for the tenured professor to teach well, but the instructor would be rewarded solely by their teaching performance.  Not only does this scheme cut costs, but it can also increase revenue as faculty will have more time to write grant proposals, etc..

So, with teaching out of the equation, why even have departments? Well, for now the internet can’t substitute the one-on-one interactions needed during PhD thesis supervision. As long as NIH and NSF are around, research faculty will be around. The apprenticeship system that has worked for centuries will survive the uncomfortable adjustments that are coming. Special topic seminars will also survive as faculty will use them as part of their research agenda. Rotations, similar to those implemented in Biology programs, can serve as match makers between professors and students. But classroom teaching is due for some “uncomfortable adjustments”.

I agree with Tyler Cowen and Matt Yglesias: the number of cushy professors jobs per department will drop dramatically in the future, especially in hard money institutions. So let’s get ready. Maybe Biostat departments should start planning for the future now. Harvard, Seattle, Michigan, Emory, etc.. want to teach stat theory with us?

PS -  I suspect this all applies to liberal arts and hard science graduate programs.

28
Sep

## The p>0.05 journal

I want to start a journal called “P>0.05”. This journal will publish all the negative results in science. These would also be stored in a database. Think of all the great things we could do with this. We could, for example, plot p-value histograms for different disciplines. I bet most would have a flat distribution. We could also do it by specific association. A paper comes out saying chocolate is linked to weaker bones? Check the histogram and keep eating chocolate. Any publishers interested?

26
Sep

## "Unoriginal genius"

“The world is full of texts, more or less interesting; I do not wish to add any more”

This quote is from an article in the Chronicle Review. I highly recommend reading the article, particularly check out the section on the author’s “Uncreative writing” class at UPenn. The article is about how there is a trend in literature toward combining/using other people’s words to create new content.

The prominent literary critic Marjorie Perloff has recently begun using the term “unoriginal genius” to describe this tendency emerging in literature. Her idea is that, because of changes brought on by technology and the Internet, our notion of the genius—a romantic, isolated figure—is outdated. An updated notion of genius would have to center around one’s mastery of information and its dissemination. Perloff has coined another term, “moving information,” to signify both the act of pushing language around as well as the act of being emotionally moved by that process. She posits that today’s writer resembles more a programmer than a tortured genius, brilliantly conceptualizing, constructing, executing, and maintaining a writing machine.

It is fascinating to see this happening in the world of literature; a similar trend seems to be happening in statistics. A ton of exciting and interesting work is done by people combining known ideas and tools and applying them to new problems. I wonder if we need a new definition of “creative”?

26
Sep

## 25 minute seminars

Most Statistics and Biostatistics departments have weekly seminars. We usually invite outside speakers to share their knowledge via a 50 minute powerpoint (or beamer) presentation. This gives us the opportunity to meet colleagues from other Universities and pick their brains in small group meetings. This is all great. But, giving a good one hour seminar is hard. Really hard. Few people can pull it off. I propose to the statistical community that we cut the seminars to 25 minutes with 35 minutes for questions and further discussion. We can make exceptions of course. But in general, I think we would all benefit from shorter seminars.

23
Sep

## Getting email responses from busy people

I’ve had the good fortune of working with some really smart and successful people during my career. As a young person, one problem with working with really successful people is that they get a ton of email. Some only see the subject lines on their phone before deleting them.

I’ve picked up a few tricks for getting email responses from important/successful people:

The SI Rules

1. Try to send no more than one email a day.
2. Emails should be 3 sentences or less. Better if you can get the whole email in the subject line.
3. If you need information, ask yes or no questions whenever possible. Never ask a question that requires a full sentence response.
4. When something is time sensitive, state the action you will take if you don’t get a response by a time you specify.
5. Be as specific as you can while conforming to the length requirements.
6. Bonus: include obvious keywords people can use to search for your email.

Anecdotally, SI emails have a 10-fold higher response probability. The rules are designed around the fact that busy people who get lots of email love checking things off their list. SI emails are easy to check off! That will make them happy and get you a response.

It takes more work on your end when writing an SI email. You often need to think more carefully about what to ask, how to phrase it succinctly, and how to minimize the number of emails you write. A surprising side effect of applying SI principles is that I often figure out answers to my questions on my own. I have to decide which questions to include in my SI emails and they have to be yes/no answers, so I end up taking care of simple questions on my own.

Here are examples of SI emails just to get you started:

Example 1

Subject: Is my response to reviewer 2 ok with you?

Body: I’ve attached the paper/responses to referees.

Example 2

Subject: Can you send my letter of recommendation to john.doe@someplace.com?

Body:

Keywords = recommendation, Jeff, John Doe.

Example 3

Subject: I revised the draft to include your suggestions about simulations and language

Revisions attached. Let me know if you have any problems, otherwise I’ll submit Monday at 2pm.

23
Sep

## Dongle communism

If you have a mac and give talks or teach, chances are you have embarrassed yourself by forgetting your dongle. Our lab meetings and classes were constantly delayed due to missing dongles. Communism solved this problem. We bought 10 dongles, sprinkled them around the department, and declared all dongles public property. All dongles, not just the 10. No longer do we have to ask to borrow dongles because they have no owner. Please join the revolution. ps -I think this should apply to pens too!

22
Sep

## The Killer App for Peer Review

A little while ago, over at Genomes Unzipped, Joe Pickrell asked, “Why publish science in peer reviewed journals?” He points out the flaws with the current peer review system and suggests how we can do better. What he suggests is missing is the killer app for peer review.

Well, PLoS has now developed an API, where you can easily access tons of data on the papers published in those journals including downloads, citations, number of social bookmarks, and mentions in major science blogs. Along with Mendeley a free reference manager, they have launched an competition to build cool apps with their free data.

Seems like with the right statistical analysis/cool features a recommender system for say, PLoS One could have most of the features suggested by Joe in his article. One idea would be an RSS-feed based on an idea like the Pandora music sharing service. You input a couple of papers you like from the journal, then it creates an RSS feed with papers similar to that paper.