Simply Statistics


Fundamentals of Engineering Review Question Oops

The Fundamentals of Engineering Exam is the first licensing exam for engineers. You have to pass it on your way to becoming a professional engineer (PE). I was recently shown a problem from a review manual: 

When it is operating properly, a chemical plant has a daily production rate that is normally distributed with a mean of 880 tons/day and a standard deviation of 21 tons/day. During an analysis period, the output is measured with random sampling on 50 consecutive days, and the mean output is found to be 871 tons/day. With a 95 percent confidence level, determine if the plant is operating properly. 

  1. There is at least a 5 percent probability that the plant is operating properly. 
  2. There is at least a 95 percent probability that the plant is operating properly. 
  3. There is at least a 5 percent probability that the plant is not operating properly. 
  4. There is at least a 95 percent probability that the plant is not operating properly. 

Whoops…seems to be a problem there. I’m glad that engineers are expected to know some statistics; hopefully the engineering students taking the exam can spot the problem…but then how do they answer? 


figshare and don't trust celebrities stating facts

A couple of links:

  1. figshare is a site where scientists can share data sets/figures/code. One of the goals is to encourage researchers to share negative results as well. I think this is a great idea - I often find negative results and this could be a place to put them. It also uses a tagging system, like Flickr. I think this is a great idea for scientific research discovery. They give you unlimited public space and 1GB of private space. This could be big, a place to help make reproducible research efforts user-friendly. Via TechCrunch
  2. Don’t trust celebrities stating facts because they usually don’t know what they are talking about. I completely agree with this. Particularly because I have serious doubts about the statisteracy of most celebrities. Nod to Alex for the link (our most active link finder!).  


A tribute to one of the most popular methods in statistics.


Sunday Data/Statistics Link Roundup

  1. Statistics help for journalists (don’t forget to keep rating stories!) This is the kind of thing that could grow into a statisteracy page. The author also has a really nice plug for public schools
  2. An interactive graphic to determine if you are in the 1% from the New York Times (I’m not…).
  3. Mike Bostock’s d3.js presentation, this is some really impressive visualization software. You have to change the slide numbers manually but it is totally worth it. Check out slide 10 and slide 14. This is the future of data visualization. Here is a beginners tutorial to d3.js by Mike Dewar.
  4. An online diagnosis prediction start-up (Symcat) based on data analysis from two Hopkins Med students.

Finally, a bit of a bleg. I’m going to try to make this link roundup a regular post. If you have ideas for links I should include, tweet us @simplystats or send them to Jeff’s email. 


In the era of data what is a fact?

The Twitter universe is abuzz about this article in the New York Times. Arthur Brisbane, who responds to reader’s comments, asks 

I’m looking for reader input on whether and when New York Times news reporters should challenge “facts” that are asserted by newsmakers they write about.

He goes on to give a couple of examples of qualitative facts that reporters have used in stories without questioning the veracity of the claims. As many people pointed out in the comments, this is completely absurd. Of course reporters should check facts and report when the facts in their stories, or stated by candidates, are not correct. That is the purpose of news reporting. 

But I think the question is a little more subtle when it comes to quantitative facts and statistics. Depending on what subsets of data you look at, what summary statistics you pick, and the way you present information - you can say a lot of different things with the same data. As long as you report what you calculated, you are technically reporting a fact - but it may be deceptive. The classic example is calculating median vs. mean home prices. If Bill Gates is in your neighborhood, no matter what the other houses cost, the mean price is going to be pretty high! 

Two concrete things can be done to deal with the malleability of facts in the data age.

First, we need to require that our reporters, policy makers, politicians, and decision makers report the context of numbers they state. It is tempting to use statistics as blunt instruments, punctuating claims. Instead, we should demand that people using statistics to make a point embed them in the broader context. For example, in the case of housing prices, if a politician reports the mean home price in a neighborhood, they should be required to state that potential outliers may be driving that number up. How do we make this demand? By not believing any isolated statistics - statistics will only be believed when the source is quoted and the statistic is described.  

But this isn’t enough, since the context and statistics will be meaningless without raising overall statisteracy (statistical literacy, not to be confused with numeracy).  In the U.S. literacy campaigns have been promoted by library systems. Statisteracy is becoming just as critical; the same level of social pressure and assistance should be applied to individuals who don’t know basic statistics as those who don’t have basic reading skills. Statistical organizations, academic departments, and companies interested in analytics/data science/statistics all have a vested interest in raising the population statisteracy. Maybe a website dedicated to understanding the consequences of basic statistical concepts, rather than the concepts themselves?

And don’t forget to keep rating health news stories!


Academics are partly to blame for supporting the closed and expensive access system of publishing

Michael Eisen recently published a New York Times op-ed arguing that a bill meant to protect publishers, introduced in the House of Representatives, will result in tax payers paying twice for scientific research. According to Eisen

If the bill passes, to read the results of federally funded research, most Americans would have to buy access to individual articles at a cost of $15 or $30 apiece. In other words, taxpayers who already paid for the research would have to pay again to read the results.

We agree and encourage our readers to write Congress opposing the “Research Works Act”. However, whereas many are vilifying the publishers that are lobbying for this act,  I think us academics are the main culprits keeping open access from succeeding.

If this bill makes it into law, I do not think that the main issue will be US taxpayers paying twice for research, but rather that access will be even more restricted to the general scientific community. Interested parties outside the US -and in developing countries in particular- should have unrestriced access to scientific knowledge. Congresswoman Carolyn Maloney gets it wrong by not realizing that giving China (and other countries) access to scientific knowledge is beneficial to science in general and consequently to everyone.  However, to maintain the high quality of research publications we currently enjoy, someone needs to pay for competent editors, copy editors, support staff, and computer servers.  Open access journals shift the costs from the readers to authors that have plenty of funds (grants, startups, etc..) to cover the charges.  By charging the authors, papers can be made available online for free. Free to everyone. Open access. PLoS has demonstrated that the open access model is viable, but a paper in PLoS Biology will run you $2,900 (see Jeff’s table). Several non-profit societies and for profit publishers, such as Nature Publishing Group, offer open access for about the same price

So given all the open access options, why do gated journals survive? I think the main reason is that we -the scientific communitythrough appointments and promotions committees, study sections, award committees, etc. use journal prestige to evaluate publication records disregarding open access as a criteria (see Eisen’s related post on decoupling publication and assessment). Therefore, those that decide to only publish in open access journals, may hinder not only their careers, but also the careers of their students and postdocs. The other reason is that for authors, publishing gated papers is typically cheaper than open access papers, and we don’t always make the more honorable decision. 

Another important consideration is that a substantial proportion of publication costs comes from printing paper copies. My department continues to buy print copies of several stat journals as well as some of the general science magazines. The Hopkins library, on behalf of the faculty, buys print versions of hundreds of journals. As long as we continue to create a market for paper copies, the journals will continue to allocate resources to producing them. Somebody has to pay for this, yet with online versions already being produced the print versions are superfluous.

Apart from opposing the Research Works Act as Eisen proposes, there are two more things I intend to do in 2012: 1) lobby my department to stop buying print versions and 2) lobby my study section to give special consideration to open access publications when evaluating a biosketch or a progress report.


Help us rate health news reporting with citizen-science powered

We here at Simply Statistics are big fans of science news reporting. We read newspapers, blogs, and the news sections of scientific journals to keep up with the coolest new research. 

But health science reporting, although exciting, can also be incredibly frustrating to read. Many articles have sensational titles, like “How using Facebook could raise your risk of cancer”. The articles go on to describe some research and interview a few scientists, then typically make fairly large claims about what the research means. This isn’t surprising - eye catching headlines are important in this era of short attention spans and information overload. 

If just a few extra pieces of information were reported in science stories about the news, it would be much easier to evaluate whether the cancer risk was serious enough to shut down our Facebook accounts. In particular we thought any news story should report:

  1. A link back to the original research article where the study (or studies) being described was published. Not just a link to another news story. 
  2. A description of the study design (was it a randomized clinical trial? a cohort study? 3 mice in a lab experiment?)
  3. Who funded the study - if a study involving cancer risk was sponsored by a tobacco company, that might say something about the results.
  4. Potential financial incentives of the authors - if the study is reporting a new drug and the authors work for a drug company, that might say something about the study too. 
  5. The sample size - many health studies are based on a very small sample size, only 10 or 20 people in a lab. Results from these studies are much weaker than results obtained from a large study of thousands of people. 
  6. The organism - Many health science news reports are based on studies performed in lab animals and may not translate to human health. For example, here is a report with the headline “Alzheimers may be transmissible, study suggests”. But if you read the story, scientists injected Alzheimer’s afflicted brain tissue from humans into mice. 

So we created a citizen-science website for evaluating health news reporting called HealthNewsRater. It was built by Andrew Jaffe and Jeff Leek, with Andrew doing the bulk of the heavy lifting.  We would like you to help us collect data on the quality of health news reporting. When you read a health news story on the Nature website, at, or on a blog, we’d like you to take a second to report on the news. Just determine whether the 6 pieces of information above are reported and input the data at HealthNewsRater.

We calculate a score for each story based on the formula:

HNR-Score = (5 points for a link to the original article + 1 point each for the other criteria)/2

The score weights the link to the original article very heavily, since this is the best source of information about the actual science underlying the story. 

In a future post we will analyze the data we have collected, make it publicly available, and let you know which news sources are doing the best job of reporting health science. 

Update: If you are a web-developer with an interest in health news contact us to help make HealthNewsRater better! 


Statistical Crime Fighter

Dick Berk is using his statistical superpowers to fight crime. Seriously. Here is my favorite paragraph.

Drawing from criminal databases dating to the 1960s, Berk initially modeled the Philadelphia algorithm on more than 100,000 old cases, relying on three dozen predictors, including the perpetrator’s age, gender, neighborhood, and number of prior crimes. To develop an algorithm that forecasts a particular outcome—someone committing murder, for example—Berk applied a subset of the data to “train” the computer on which qualities are associated with that outcome. “If I could use sun spots or shoe size or the size of the wristband on their wrist, I would,” Berk said. “If I give the algorithm enough predictors to get it started, it finds things that you wouldn’t anticipate.” Philadelphia’s parole officers were surprised to learn, for example, that the crime for which an offender was sentenced—whether it was murder or simple drug possession—does not predict whether he or she will commit a violent crime in the future. Far more predictive is the age at which he (yes, gender matters) committed his first crime, and the amount of time between other offenses and the latest one—the earlier the first crime and the more recent the last, the greater the chance for another offense.

Hat tip to Alex Nones.


Do you own or rent?

When it comes to computing, history has gone back and forth between what I would call the “owner model” and the “renter model”. The question is what’s the best approach and how do you determine that?

Back in the day when people like John von Neumann were busy inventing the computer to work out H-bomb calculations, there was more or less a renter model in place. Computers were obviously quite expensive and so not everyone could have one. If you wanted to do your calculation, you’d walk down to the computer room, give them your punch cards with your program written out, and they’d run it for you. Sometime later you’d get some print out with the results of your program. 

A little later, with time-sharing types of machines, you could have dumb terminals login to a central server and run your calculations that way. I guess that saved you the walk to the computer room (and all the punch cards). I still remember some of these green-screen dumb terminals from my grad school days (yes, UCLA still had these monstrosities in 1999). 

With personal computers in the 80s, you could own your own computer, so there was no need to depend on some central computer (and a connection to it) to do the work for you. As computing components got cheaper, these personal computers got more and more powerful and rivaled the servers of yore. It was difficult for me to imagine ever needing things like mainframes again except for some esoteric applications. Especially, with the development of Linux, you could have all the power of a Unix mainframe on your desk or lap (or now your palm). 

But here we are, with Jeff buying a Chromebook. Have we just taken a step back in time? Is cloud computing and the renter model the way to go? I have to say that I was a big fan of “cloud computing” back in the day. But once Linux came around, I really didn’t think there was a need for the thin client/fat server model.

But it seems we are going back that way and the reason seems to be because of mobile devices. Mobile devices are now just small computers, so many people own at least two computers (a “real” computer and a phone). With multiple computers, it’s a pain to have to synchronize both the data and the applications on them. If they’re made by different manufacturers then you can’t even have the same operating system/applications on the devices. Also, no one cares about the operating system anymore, so why should it have to be managed? The cloud helps solve some of these problems, as does owning devices from the same company (as I do, Apple fanboy that I am).

I think the all-renter model of the Chromebook is attractive, but I don’t think it’s ready for prime time just yet. Two reasons I can think of are (1) Microsoft Office and (2) slow network connections. If you want to make Jeff very unhappy, you can either (1) send him a Word document that needs to be edited in Track Changes; or (2) invite him to an international conference on some remote island. The need for a strong network connection is problematic because I’ve yet to encounter a hotel that had a fast enough connection for me to work remotely over on our computing cluster. For that reason I’m sticking with my current laptop.