Tag: caffo


Sunday Data/Statistics Link Roundup (11/4/12)

  1. Brian Caffo headlines the WaPo article about massive online open courses. He is the driving force behind our department’s involvement in offering these massive courses. I think this sums it up: `“I can’t use another word than unbelievable,” Caffo said. Then he found some more: “Crazy . . . surreal . . . heartwarming.”’
  2. A really interesting discussion of why “A Bet is a Tax on B.S.”. It nicely describes why intelligent betters must be disinterested in the outcome, otherwise they will end up losing money. The Nate Silver controversy just doesn’t seem to be going away, good news for his readership numbers, I bet. (via Rafa)
  3. An interesting article on how scientists are not claiming global warming is the sole cause of the extreme weather events we are seeing, but that it does contribute to them being more extreme. The key quote: “We can’t say that steroids caused any one home run by Barry Bonds, but steroids sure helped him hit more and hit them farther. Now we have weather on steroids.” —Eric Pooley. (via Roger)
  4. The NIGMS is looking for a Biomedical technology, Bioinformatics, and Computational Biology Director. I hope that it is someone who understands statistics! (via Karl B.)
  5. Here is another article that appears to misunderstand statistical prediction.  This one is about the Italian scientists who were jailed for failing to predict an earthquake. No joke. 
  6. We talk a lot about how much the data revolution will change industries from social media to healthcare. But here is an important reality check. Patients are not showing an interest in accessing their health care data. I wonder if part of the reason is that we haven’t come up with the right ways to explain, understand, and utilize what is inherently stochastic and uncertain information. 
  7. The BMJ is now going to require all data from clinical trials published in their journal to be public.  This is a brilliant, forward thinking move. I hope other journals will follow suit. (via Karen B.R.)
  8. An interesting article about the impact of retractions on citation rates, suggesting that papers in fields close to those of the retracted paper may show negative impact on their citation rates. I haven’t looked it over carefully, but how they control for confounding seems incredibly important in this case. (via Alex N.). 

Sunday Data/Statistics Link Roundup (9/23/12)

  1. Harvard Business school is getting in on the fun, calling the data scientist the sexy profession for the 21st century. Although I am a little worried that by the time it gets into a Harvard Business document, the hype may be outstripping the real promise of the discipline. Still, good news for statisticians! (via Rafa via Francesca D.’s Facebook feed). 
  2. The counterpoint is this article which suggests that data scientists might be able to be replaced by tools/software. I think this is also a bit too much hype for my tastes. Certain things will definitely be automated and we may even end up with a deterministic statistical machine or two. But there will continually be new problems to solve which require the expertise of people with data analysis skills and good intuition (link via Samara K.)
  3. A bunch of websites are popping up where you can sign up and have people take your online courses for you. I’m not going to give them the benefit of a link, but they aren’t hard to find these days. The thing I don’t understand is, if it is a free online course, why have someone else take it for you? It’s free, its in your spare time, and the bar for passing is pretty low (links via Sherri R. redacted)….
  4. Maybe mostly useful for me, but for other people with Tumblr blogs, here is a way to insert Latex.
  5. Brian Caffo shares his impressions of the SAMSI massive data workshop.  He raises an important issue which definitely deserves more discussion: should we be focusing on specific or general problems? Worth a read. 
  6. For the people into self-tracking, Chris V. points to an app created by the University of Indiana that lets people track their sexual activity. The most interesting thing about that app is how it highlights a key and I suppose often overlooked issue with analyzing self-tracking data. Despite the size of these data sets, they are still definitely biased samples. It’s only a brave few who will tell the University of Indiana all about their sex life. 

On the relative importance of mathematical abstraction in graduate statistical education

Editor’s Note: This is the counterpoint in our series of posts on the value of abstraction in graduate education. See Brian’s defense of abstraction on Monday and the comments on his post, as well as the comments on our original teaser post for more. See below for a full description of the T-bone inside joke*.

Brian did a good job at defining abstraction. In a cagey debater’s move, he provided an incredibly broad definition of abstraction that includes the reason we call a :-) a smiley face, the reason why we can apply least squares to a variety of data types, and the reason we write functions when programming. At this very broad level, it is clear that abstract thinking is necessary for graduate students or any other data professional.

But our debate was inspired by a discussion of whether measure-theoretic probability was a key component of our graduate program. There was some agreement that for many biostatistics Ph.D. students, this exact topic may not be necessary for their research or careers. Brian suggested that measure-theoretic probability was a surrogate marker for something more important - abstract thinking and the ability to generalize ideas. This is a very specific form of generalization and abstraction that is used most commonly by statisticians: the ability that permits one to prove theorems and develop statistical models that can be applied to a variety of data types. I will therefore refocus the debate on the original topic. I have three main points:

  1. There is an over emphasis in statistical graduate programs on abstraction defined as the ability to prove mathematical theorems and develop general statistical methods.
  2. It is possible to create incredible statistical value without developing generalizable statistical methods
  3. While abstraction as defined generally is good, overemphasis on this specific type of abstraction limits our ability to include computing and real data analysis in our curriculum. It also takes away from the most important learning experience of graduate school: performing independent research.

There is an over emphasis in statistical graduate programs on abstraction defined as the ability to prove mathematical theorems and develop general statistical methods.

At a top program, you can expect to take courses in very theoretical statistics, measure theoretic probability, and an applied (or methods) sequence. The first two courses are exclusively mathematical. The third (at the programs I have visited, graduated from, taught in), despite its name, is most generally focused on mathematical details underlying statistical methods. The result is that most Ph.D. students are heavily trained in the mathematical theory behind statistics.

At the same time, there are a long list of skills necessary to develop a successful Ph.D. statistician. These include creativity in applications, statistical programming skills, grit to power through the boring/hard parts of research, interpretation of statistical results on real data, ability to identify the most important scientific problems, and a deep understanding of the scientific problems you are working on. Abstraction is on that list, but it is just one of many skills on that list. Graduate education is a zero-sum game over a finite period of time. Our strong focus on mathematical abstraction means there is less time for everything else.

Any hard quantitative course will measure the ability of a student to abstract in the general sense Brian defined. One of these courses would be very useful for our students. But it is not clear that we should focus on mathematical abstraction to the exclusion of other important characteristics of graduate students.

It is possible to create incredible statistical value without developing generalizable statistical methods

A major standard for success in academia is the ability to generate solutions to problems that are widely read, cited, and used. A graduate student who produces these types of solutions is likely to have a high-impact and well-respected career. In general, it is not necessary to be able to prove theorems, understand measure theory, or develop generalizable statistical models to have this type of success.

One example is one of the co-authors of our blog, best known for his work in genomics. In this field, data is noisy and full of systematic errors, and for several technologies, he invented methods to correct them. For example, he developed the most popular method for making measurements from different experiments comparable, for removing the dependence of measurements on the letters in a gene, and for reducing variability due to operators who run the machine or the ozone levels. Each of these discoveries involved: (1) deep understanding of the specific technology used, (2) a good intuition of what signals were due to biology and which were due to technology, (3) application/development of specific, somewhat ad-hoc, statistical procedures to correct the mistakes, and (4) the development and distribution of good software. His work has been hugely influential on genomics, has been cited thousands of times, and has substantially improved the quality of both biological and statistical results.

But the work did not result in knowledge that was generalizable to other areas of application, it deals with problems that are highly specialized to genomics. If these were his only contributions (they are not), he’d be a hugely successful Ph.D. statistician. But had he focused on general solutions he would have never solved the problems at hand, since the problems were highly specific to a single application. And this is just one example I know well because I work in the area. There are a ton more just like it.

While abstraction as defined generally is good, overemphasis on a specific type of abstraction limits our ability to include computing and real data analysis in our curriculum. It also takes away from the most important learning experience of graduate school: performing independent research.

One could argue that the choice of statistical techniques during data analysis is abstraction, or that one needs to abstract to develop efficient software. But the ability to abstract needed for these tasks can be measured by a wide range of classes, not just measure theoretic probability. Some of these classes might teach practically applicable skills like writing fast and efficient algorithms. Many results of high statistical value do not require mathematical proofs, abstract inductive reasoning, or asymptotic theory. It is a good idea to have a some people who can abstract away the science behind statistical methods to the core mathematical philosophy. But our current curriculum is too heavily weighted in this direction. In some cases, statisticians are even being left behind because they do not have sufficient time in their curriculum to develop the computational skills and amass the necessary subject matter knowledge needed to compete with the increasingly diverse set of engineers, computer scientists, data scientists, and computational biologists tackling the same scientific problems.

We need to reserve a larger portion of graduate education for diving deeply into specific scientific problems, even if it means they spend less time developing generalizable/abstract statistical ideas.

* Inside joke explanation: Two years ago at JSM I ran a footrace with this guy for the rights to the name “Jeff” in the department of Biostatistics at Hopkins for the rest of 2011. Unfortunately, we did not pro-rate for age and he nipped me by about a half-yard. True to my word, I went by Tullis (my middle name) for a few months, including on the title slide of my JSM talk. This was, of course, immediately subjected to all sorts of nicknaming and B-Caffo loves to use “T-bone”. I apologize on behalf of those that brought it up.


In which Brian debates abstraction with T-Bone

Editor’s Note: This is the first in a set of point-counterpoint posts related to the value of abstract thinking in graduate education that we teased a few days ago. Brian Caffo, recently installed Graduate Program Director at the best Biostat department in the country, has kindly agreed to lead off with the case for abstraction. We’ll follow up later in the week with my counterpoint. In the meantime, there have already been a number of really interesting and insightful comments inspired by our teaser post that are well worth reading. See the comments here

The impetus for writing this blog post came out of a particularly heady lunchroom discussion on the role of measure theoretic probability in our curriculum. We have a very mathematically rigorous program at Hopkins Biostatistics that includes a full academic year of measure theoretic probability.  Similar to elsewhere, many faculty dispute the necessity of this course. I am in favor of it. My principal reason being that I believe it is useful for building up and evaluating a student’s abilities in abstraction and generalization.

In our discussion, abstraction was the real point of contention. Emphasizing abstraction versus more immediately practical tools is an age-old argument of ivory tower stereotypes (the philosopher archetype) versus  equally stereotypically scientific pragmatists (the engineering archetype).

So, let’s begin picking this scab. For your sake and mine, I’ll try to be brief.

My definitions:

Abstraction - reducing a technique, idea or concept to its essence or core.

Generalization -  extending a technique, idea or concept to areas for which it was not originally intended.

PhD - a post baccalaureate degree  that requires substantial new contributions to knowledge.

The term “substantial new contributions” in my definition of a PhD is admittedly fuzzy. To tie it down, examples that I think do create new knowledge in the field of statistics include:

  1. applying existing techniques to data where they have not been used before (generalization of the application of the techniques),
  2. developing statistical software (abstraction of statistical and mathematical thoughts into code),
  3. developing new statistical methods from existing ones (generalization),
  4. proving new theory (both abstraction and generalization) and
  5. creating new data analysis pipelines (both abstraction and generalization).

In every one of these examples, generalization or abstraction is what differentiates it from a purely technical accomplishment.

To give a contrary activity, consider statistical technical specialization. That is, the application an existing method to data where the method is already known to be effective and no new statistical thought is required. Regardless of how necessary, difficult or important applying that method is, such activity does not constitute the creation of new statistical knowledge, even if it is a necessary schlep in the creation of new knowledge of another sort.

Though many statistics graduate level activities require substantial technical specialization, to be doctoral statistical research in a way that satisfies my definition, generalization and abstraction are necessary components.

I further contend that abstraction is a key tool for obtaining meaningful generalization. A method, theory, analysis, etcetera can not be retooled to non-intended use without stripping away some of its specialization and abstracting it to its core utility.

Abstraction is constantly necessary when applying statistical methods. For example, whenever a statistician says “Method A really was designed for a different kind of data than mine. But at its core it’s really useful for finding out B, which I need to know. So I’ll use it anyway until (if ever) I come up with something better.”  

As examples: A = CLT, B = distribution for normalized means, A =  principal components, B = directions of variation, A = bootstrap, B = sampling distributions, A = linear models, B = mean relationships with covariates.

Abstraction and generalization facilitates learning new areas. Knowledge of the abstract core of a discipline makes that knowledge much more portable. This is seen across every discipline. Musicians who know music theory can use their knowledge for any instrument;  computer scientists who understand data structures and algorithms can switch languages easily; electrical engineers who understand signal processing can switch between technologies easily. Abstraction is what allows them to see past the concrete (instrument, syntax, technology) to the essence (music, algorithm, signal).

And statisticians learn statistical and probability theory. However, in statistics, abstraction is not represented only by mathematics and theory. As pointed out by the absolutely unimpeachable source, Simply Statistics, software is exactly an abstraction.

I think abstraction is important and we need to continue publishing those kinds of ideas. However, I think there is one key point that the statistics community has had difficulty grasping, which is that software represents an important form of abstraction, if not the most important form …

(A QED is in order, I believe.)


Guest Post: SMART thoughts on the ADHD 200 Data Analysis Competition

Note: This is a guest post by our colleagues Brian Caffo, Ani Eloyan, Fang Han, Han Liu,John Muschelli, Mary Beth Nebel, Tuo Zhao and Ciprian Crainiceanu. They won the ADHD 200 imaging data analysis competition. There has been some controversy around the results because one team obtained a higher score without using any of the imaging data. Our colleagues have put together a very clear discussion of the issues raised by the competition so we are publishing it here to contribute to the discussion. Questions about this post should be directed to the Hopkins team leader Brian Caffo 


Below we share some thoughts about the ADHD 200 competition, a landmark competition using functional and structural brain imaging data to predict ADHD status.


Note, we’re calling these “SMART thoughts” to draw attention to our working group, “Statistical Methods and Applications for Research in Technology” (www.smart-stats.org), though hopefully the acronym applies in the non-intended sense as well.

Our team was declared the official winners of the competition. However, a team from the University of Alberta scored a higher number of competition points, but was disqualified for not having used imaging data. We have been in email contact with a representative of that team and have enjoyed the discussion. We found those team members to be gracious and to embody an energy and scientific spirit that are refreshing to encounter.
We mentioned our sympathy to them, in that the process seemed unfair, especially given the vagueness of what qualifies as use of the imaging data. More on this thought below.  
This brings us to the point of this note, concern over the narrative surrounding the competition based on our reading of web pages, social media and water cooler discussions.
We are foremost concerned with the unwarranted conclusion that because the team with the highest competition point total did not use imaging data, the overall scientific validity of using (f)MRI imaging data to study ADHD is now in greater doubt.
We stipulate that, like many others, we are skeptical of the utility of MRI data for tasks such as ADHD diagnoses. We are not arguing against such skepticism.
Instead we are arguing against using the competition results as if they were strong evidence for such skepticism.
We raise four points to argue against overreacting to the competition outcome with respect to the use of structural and functional MRI in the study of ADHD.

Point 1. The competition points are not an accurate measure of performance and scientific value.

Because the majority of the training, and hence presumably the test, sets in the competition were typically developing, the competition points favored specificity.
In addition, a correct label of TD yielded 1 point, while a correct ADHD diagnosis with incorrect subtype yielded .5 points.

These facts suggest a classifier that declares everyone as TD as a starting point. For example, if 60% of the 197 test subjects are controls, this algorithm would yield 118 competition points, better than all but a few entrants. In fact, if 64.5% or higher of the test set is TD, this algorithm wins over Alberta (and hence everyone else).

In addition, competition points are variables possessing randomness.  It is human nature to interpret the anecdotal rankings of competitions as definitive evidence of superiority. This works fine as long as rankings are reasonably deterministic. But is riddled with logical flaws when rankings are stochastic. Variability in rankings has a huge effect on the result of competitions, especially when highly tuned prediction methods from expert teams are compared. Indeed, in such cases the confidence intervals of the AUCs (or other competition criteria) overlap. The 5th or 10th place team may actually have had the most scientifically informative algorithm.

Point 2. Biologically valueless predictors were important.

Most importantly, contributing location (aka site), was a key determinant of prediction performance. Site is a proxy for many things: the demographics of the ADHD population in the site’s PI’s studies, the policies by which a PI chose to include data, scanner type, IQ measure, missing data patterns, data quality and so on.

In addition to site, missing data existence and data quality also held potentially important information about prediction, despite being (biologically) unrelated to ADHD. The likely causality, if existent, would point in the reverse direction (i.e. that presence of ADHD would result in a greater propensity for missing data and lower data quality, perhaps due to movement in the scanner).

This is a general fact regarding prediction algorithms, which do not intrinsically account for causal directions or biological significance.

Point 3. The majority of the imaging data is not prognostic.

Likely every entrant, and the competition organizers, were aware that the majority of the imaging data is not useful for predicting ADHD. (Here we use the term “imaging data” loosely, meaning raw and/or processed data.)   In addition, the imaging data are noisy. Therefore, use of these data introduced tens of billions of unnecessary numbers to predict 197 diagnoses.

As such, even if extremely important variables are embedded in the imaging data, (non-trivial) use of all of the imaging data could degrade performance, regardless of the ultimate value of the data.

To put this in other words, suppose all entrants were offered an additional 10 billion numbers, say genomic data, known to be noisy and, in aggregate, not predictive of disease. However, suppose that some unknown function of a small collection of variables was very meaningful for prediction, as is presumably the case with genomic data. If the competition did not require its use, a reasonable strategy would be to avoid using these data altogether.

Thus, in a scientific sense, we are sympathetic to the organizers’ choice to eliminate the Alberta team, since a primary motivation of the competition was to encourage a large set of eyes to sift through a large collection of very noisy imaging data.

Of course, as stated above, we believe that what constitutes a sufficient use of the imaging data is too vague to be an adequate rule to eliminate a team in a competition.

Thus our scientifically motivated support of the organizers conflicts with our procedural dispute of the decision made to eliminate the Alberta team.

Point 4. Accurate prediction of a response is neither necessary nor sufficient for a covariate to be biologically meaningful.

Accurate prediction of a response is an extremely high bar for a variable of interest. Consider drug development for ADHD. A drug does not have to demonstrate that its application to a collection of symptomatic individuals would predict with high accuracy a later abatement of symptoms.  Instead, a successful drug would have to demonstrate a mild averaged improvement over a placebo or standard therapy when randomized.

As an example, consider randomly administering such a drug to 50 of 100 subjects who have ADHD at baseline.  Suppose data are collected at 6 and 12 months. Further suppose that 8 out of 50 of those receiving the drug had no ADHD symptoms at 12 months, while 1 out of 50 of those receiving placebo had no ADHD symptoms at 12 months. The Fisher’s exact test P-value is .03, by the way.  

The statistical evidence points to the drug being effective. Knowledge of drug status, however, would do little to improve prediction accuracy. That is, given a new data set of subjects with ADHD at baseline and knowledge of drug status, the most accurate classification for every subject would be to guess that they will continue to have ADHD symptoms at 12 months.  Of course, our confidence in that prediction would be slightly lower for those having received the drug.

However, consider using ADHD status at 6 months as a predictor. This would be enormously effective at locating those subjects who have an abatement of symptoms whether they received the drug or not. In this thought experiment, one predictor (symptoms at 6 months) is highly predictive, but not meaningful (it simply suggests that Y is a good predictor of Y), while the other (presence of drug at baseline) is only mildly predictive, but is statistically and biologically significant.

As another example, consider the ADHD200 data set. Suppose that a small structural region is highly impacted in an unknown subclass of ADHD. Some kind of investigation of morphometry or volumetrics might detect an association with disease status. The association would likely be weak, given absence of a-priori knowledge of this region or the subclass. This weak association would not be useful in a prediction algorithm. However, digging into this association could potentially inform the biological basis of the disease and further refine the ADHD phenotype.

Thus, we argue that it is important to differentiate the ultimate goals of obtaining high prediction accuracy with that of biological discovery of complex mechanisms in the presence of high dimensional data.


We urge caution in over-interpretation of the scientific impact of the University of Alberta’s strongest performance in the competition.  

Ultimately, what Alberta’s having the highest point total established is that they are fantastic people to talk to if you want to achieve high prediction accuracy. (Looking over their work, this appears to have already been established prior to the competition :-).

It was not established that brain structure or resting state function, as measured by MRI, is a blind alley in the scientific exploration of ADHD.

Related Posts: Roger on “Caffo + Ninjas = Awesome”, Rafa on the “Self Assessment Trap”, Roger on “Private health insurers to release data


Caffo's Theorem

Brian Caffo from the comments:

Personal theorem: the application of statistics in any new field will be labeled “Technical sounding word” + ics. Examples: Sabermetrics, analytics, econometrics, neuroinformatics, bioinformatics, informatics, chemeometrics. 

It’s like how adding mayonnaise to anything turns it in to salad (eg: egg salad, tuna salad, ham salad, pasta salad, …)

I’d like to be the first to propose the statistical study of turning things in salad. So called mayonaisics.

Related Posts: Caffo + Ninjas = Awesome