Tag: statistics


Podcast #6: Data Analysis MOOC Post-mortem

Jeff and I talk about Jeff's recently completed MOOC on Data Analysis.


Sunday data/statistics link roundup (1/20/2013)

  1. This might be short. I have a couple of classes starting on Monday. The first is our Johns Hopkins Advanced Methods class. This is one of my favorite classes to teach, our Ph.D. students are pretty awesome and they always amaze me with what they can do. The other is my Coursera debut in Data Analysis. We are at about 88,000 enrolled. Tell your friends, maybe we can make it an even 100k! In related news, some California schools are experimenting with offering credit for online courses. (via Sherri R.)
  2. Some interesting numbers on why there aren't as many "gunners" in the NBA - players who score a huge number of points.  I love the talk about hustling, rotating team defense. I have always enjoyed watching good defense more than good offense. It might not be the most popular thing to watch, but seeing the Spurs rotate perfectly to cover the open man is a thing of athletic beauty. My Aggies aren't too bad at it either...(via Rafa).
  3. A really interesting article suggesting that nonsense math can make arguments seem more convincing to non-technical audiences. This is tangentially related to a previous study which showed that more equations led to fewer citations in biology articles. Overall, my take home message is that we don't need less equations necessarily; we need to elevate statistical/quantitative literacy to the importance of reading literacy. (via David S.)
  4. This has been posted elsewhere, but a reminder to send in your statistical stories for the 365 stories of statistics.
  5. Automatically generate a postmodernism essay. Hit refresh a few times. It's pretty hilarious. It reminds me a lot of this article about statisticians. Here is the technical paper describing how they simulate the essays. (via Rafa)

Should the Cox Proportional Hazards model get the Nobel Prize in Medicine?

I'm not the first one to suggest that Biostatistics has been undervalued in the scientific community, and some of the shortcomings of epidemiology and biostatistics have been noted elsewhere. But this previous work focuses primarily on the contributions of statistics/biostatistics at the purely scientific level.

The Cox Proportional Hazards model is one of the most widely used statistical models in the analysis of data from clinical trials and other medical studies. The corresponding paper has been cited over 32,000 times; this is a dramatically low estimate of the number of times the model has been used. It is one of "those methods" that doesn't even require a reference to the original methods paper anymore.

Many of the most influential medical studies, including major studies like the Women's Health Initiative have used these methods to answer some of our most pressing medical questions. Despite the incredible impact of this statistical technique on the world of medicine and public health, it has not received the Nobel Prize. This isn't an aberration, statistical methods are not traditionally considered for Nobel Prizes in Medicine. They primarily focus on biochemical, genetic, or public health discoveries.

In contrast, many economics Nobel Prizes have been awarded primarily for the discovery of a new statistical or mathematical concept. One example is the ARCH model. The Nobel Prize in Economics in 2003 was awarded to Robert Engle, the person who proposed the original ARCH model. The model has gone on to have a major impact on financial analysis, much like the Cox model has had a major impact on medicine?

So why aren't Nobel Prizes in medicine awarded to statisticians more often? Other methods such as ANOVA, P-values, etc. have also had an incredibly large impact on the way we measure and evaluate medical procedures. Maybe as medicine becomes increasingly driven by data, we will start to see more statisticians recognized for their incredible discoveries and the huge contributions they make to medical research and practice.



Podcast #5: Coursera Debrief

Jeff and I talk with Brian Caffo about teaching MOOCs on Coursera.


How important is abstract thinking for graduate students in statistics?

A recent lunchtime discussion here at Hopkins brought up the somewhat-controversial topic of abstract thinking in our graduate program. We, like a lot of other biostatistics/statistics programs, require our students to take measure theoretic probability as part of the curriculum. The discussion started as a conversation about whether we should require measure theoretic probability for our students. It evolved into a discussion of the value of abstract thinking (and whether measure theoretic probability was a good tool to measure abstract thinking).

Brian Caffo and I decided an interesting idea would be a point-counterpoint with the prompt, “How important is abstract thinking for the education of statistics graduate students?” Next week Brian and I will provide a point-counterpoint response based on our discussion.

In the meantime we’d love to hear your opinions!


Statistics is not math...

Statistics depends on math, like a lot of other disciplines (physics, engineering, chemistry, computer science). But just like those other disciplines, statistics is not math; math is just a tool used to solve statistical problems. Unlike those other disciplines, statistics gets lumped in with math in headlines. Whenever people use statistical analysis to solve an interesting problem, the headline reads:

“Math can be used to solve amazing problem X”


“The Math of Y” 

Here are some examples:

The Mathematics of Lego - Using data on legos to estimate a distribution

The Mathematics of War - Using data on conflicts to estimate a distribution

Usain Bolt can run faster with maths (Tweet) - Turns out they analyzed data on start times to come to the conclusion

The Mathematics of Beauty - Analysis of data relating dating profile responses and photo attractiveness

These are just a few off of the top of my head, but I regularly see headlines like this. I think there are a couple reasons for math being grouped with statistics: (1) many of the founders of statistics were mathematicians first (but not all of them) (2) many statisticians still identify themselves as mathematicians, and (3) in some cases statistics and statisticians define themselves pretty narrowly. 

With respect to (3), consider the following list of disciplines:

  1. Biostatistics
  2. Data science
  3. Machine learning
  4. Natural language processing
  5. Signal processing
  6. Business analytics
  7. Econometrics
  8. Text mining
  9. Social science statistics
  10. Process control

All of these disciplines could easily be classified as “applied statistics”. But how many folks in each of those disciplines would classify themselves as statisticians? More importantly, how many would be claimed by statisticians? 


What is a major revision?

I posted a little while ago on a proposal for a fast statistics journal. It generated a bunch of comments and even a really nice follow up post with some great ideas. Since then I’ve gotten reviews back on a couple of papers and I think I realized one of the key issues that is driving me nuts about the current publishing model. It boils down to one simple question: 

What is a major revision? 

I often get reviews back that suggest “major revisions” in one or many of the following categories:

  1. More/different simulations
  2. New simulations
  3. Re-organization of content
  4. Re-writing language
  5. Asking for more references
  6. Asking me to include a new method
  7. Asking me to implement someone else’s method for comparison
I don’t consider any of these major revisions. Personally, I have stopped asking for them as major revisions. In my opinion, major revisions should be reserved for issues with the manuscript that suggest that it may be reporting incorrect results. Examples include:
  1. No simulations
  2. No real data
  3. The math/computations look incorrect
  4. The software didn’t work when I tried it
  5. The methods/algorithms are unreadable and can’t be followed
The first list is actually a list of minor/non-essential revisions in my opinion. They may improve my paper, but they won’t confirm that it is correct or not. I find that they are often subjective and are up to the whims of referees. In my own personal refereeing I am making an effort to remove subjective major revisions and only include issues that are critical to evaluate the correctness of a manuscript. I also try to divorce the issues of whether an idea is interesting or not from whether an idea is correct or not. 
I’d be curious to know what other peoples’ definitions of major/minor revisions are?


Why statisticians should join and launch startups

The tough economic times we live in, and the potential for big paydays, have made entrepreneurship cool. From the venture capitalist-in-chief, to the javascript coding mayor of New York, everyone is on board. No surprise there, successful startups lead to job creation which can have a major positive impact on the economy. 

The game has been dominated for a long time by the folks over in CS. But the value of many recent startups is either based on, or can be magnified by, good data analysis. Here are a few startups that are based on data/data analysis: 

  1. The Climate Corporation -analyzes climate data to sell farmers weather insurance.
  2. Flightcaster - uses public data to predict flight delays
  3. Quid - uses data on startups to predict success, among other things.
  4. 100plus - personalized health prediction startup, predicting health based on public data
  5. Hipmunk - The main advantage of this site for travel is better data visualization and an algorithm to show you which flights have the worst “agony”.

To launch a startup you need just a couple of things: (1) a good, valuable source of data (there are lots of these on the web) and (2) a good idea about how to analyze them to create something useful. The second step is obviously harder than the first, but the companies above prove you can do it. Then, once it is built, you can outsource/partner with developers - web and otherwise - to implement your idea. If you can build it in R, someone can make it an app. 

These are just a few of the startups whose value is entirely derived from data analysis. But companies from LinkedIn, to Bitly, to Amazon, to Walmart are trying to mine the data they are generating to increase value. Data is now being generated at unprecedented scale by computers, cell phones, even thremostats! With this onslaught of data, the need for people with analysis skills is becoming incredibly acute

Statisticians, like computer scientists before them, are poised to launch, and make major contributions to, the next generation of startups. 


Help us rate health news reporting with citizen-science powered http://www.healthnewsrater.com

We here at Simply Statistics are big fans of science news reporting. We read newspapers, blogs, and the news sections of scientific journals to keep up with the coolest new research. 

But health science reporting, although exciting, can also be incredibly frustrating to read. Many articles have sensational titles, like “How using Facebook could raise your risk of cancer”. The articles go on to describe some research and interview a few scientists, then typically make fairly large claims about what the research means. This isn’t surprising - eye catching headlines are important in this era of short attention spans and information overload. 

If just a few extra pieces of information were reported in science stories about the news, it would be much easier to evaluate whether the cancer risk was serious enough to shut down our Facebook accounts. In particular we thought any news story should report:

  1. A link back to the original research article where the study (or studies) being described was published. Not just a link to another news story. 
  2. A description of the study design (was it a randomized clinical trial? a cohort study? 3 mice in a lab experiment?)
  3. Who funded the study - if a study involving cancer risk was sponsored by a tobacco company, that might say something about the results.
  4. Potential financial incentives of the authors - if the study is reporting a new drug and the authors work for a drug company, that might say something about the study too. 
  5. The sample size - many health studies are based on a very small sample size, only 10 or 20 people in a lab. Results from these studies are much weaker than results obtained from a large study of thousands of people. 
  6. The organism - Many health science news reports are based on studies performed in lab animals and may not translate to human health. For example, here is a report with the headline “Alzheimers may be transmissible, study suggests”. But if you read the story, scientists injected Alzheimer’s afflicted brain tissue from humans into mice. 

So we created a citizen-science website for evaluating health news reporting called HealthNewsRater. It was built by Andrew Jaffe and Jeff Leek, with Andrew doing the bulk of the heavy lifting.  We would like you to help us collect data on the quality of health news reporting. When you read a health news story on the Nature website, at nytimes.com, or on a blog, we’d like you to take a second to report on the news. Just determine whether the 6 pieces of information above are reported and input the data at HealthNewsRater.

We calculate a score for each story based on the formula:

HNR-Score = (5 points for a link to the original article + 1 point each for the other criteria)/2

The score weights the link to the original article very heavily, since this is the best source of information about the actual science underlying the story. 

In a future post we will analyze the data we have collected, make it publicly available, and let you know which news sources are doing the best job of reporting health science. 

Update: If you are a web-developer with an interest in health news contact us to help make HealthNewsRater better! 


Sunday Data/Statistics Link Roundup

A few data/statistics related links of interest:

  1. Eric Lander Profile
  2. The math of lego (should be “The statistics of lego”)
  3. Where people are looking for homes.
  4. Hans Rosling’s Ted Talk on the Developing world (an oldie but a goodie)
  5. Elsevier is trying to make open-access illegal (not strictly statistics related, but a hugely important issue for academics who believe government funded research should be freely accessible), more here