Tag: data


Sunday data/statistics link roundup (1/20/2013)

  1. This might be short. I have a couple of classes starting on Monday. The first is our Johns Hopkins Advanced Methods class. This is one of my favorite classes to teach, our Ph.D. students are pretty awesome and they always amaze me with what they can do. The other is my Coursera debut in Data Analysis. We are at about 88,000 enrolled. Tell your friends, maybe we can make it an even 100k! In related news, some California schools are experimenting with offering credit for online courses. (via Sherri R.)
  2. Some interesting numbers on why there aren't as many "gunners" in the NBA - players who score a huge number of points.  I love the talk about hustling, rotating team defense. I have always enjoyed watching good defense more than good offense. It might not be the most popular thing to watch, but seeing the Spurs rotate perfectly to cover the open man is a thing of athletic beauty. My Aggies aren't too bad at it either...(via Rafa).
  3. A really interesting article suggesting that nonsense math can make arguments seem more convincing to non-technical audiences. This is tangentially related to a previous study which showed that more equations led to fewer citations in biology articles. Overall, my take home message is that we don't need less equations necessarily; we need to elevate statistical/quantitative literacy to the importance of reading literacy. (via David S.)
  4. This has been posted elsewhere, but a reminder to send in your statistical stories for the 365 stories of statistics.
  5. Automatically generate a postmodernism essay. Hit refresh a few times. It's pretty hilarious. It reminds me a lot of this article about statisticians. Here is the technical paper describing how they simulate the essays. (via Rafa)

Sunday data/statistics link roundup (1/6/2013)

  1. Not really statistics, but this is an interesting article about how rational optimization by individual actors does not always lead to an optimal solutiohn. Related, ere is the coolest street sign I think I've ever seen, with a heatmap of traffic density to try to influence commuters.
  2. An interesting paper that talks about how clustering is only a really hard problem when there aren't obvious clusters. I was a little disappointed in the paper, because it defines the "obviousness" of clusters only theoretically by a distance metric. There is very little discussion of the practical distance/visual distance metrics people use when looking at clustering dendograms, etc.
  3. A post about the two cultures of statistical learning and a related post on how data-driven science is a failure of imagination. I think in both cases, it is worth pointing out that the only good data science is good science - i.e. it seeks to answer a real, specific question through the scientific method. However, I think for many modern scientific problems it is pretty naive to think we will be able to come to a full, mechanistic understanding complete with tidy theorems that describe all the properties of the system. I think the real failure of imagination is to think that science/statistics/mathematics won't change to tackle the realistic challenges posed in solving modern scientific problems.
  4. A graph that shows the incredibly strong correlation ( > 0.99!) between the growth of autism diagnoses and organic food sales. Another example where even really strong correlation does not imply causation.
  5. The Buffalo Bills are going to start an advanced analytics department (via Rafa and Chris V.), maybe they can take advantage of all this free play-by-play data from years of NFL games.
  6. A prescient interview with Isaac Asimov on learning, predicting the Kahn Academy, MOOCs and other developments in online learning (via Rafa and Marginal Revolution).
  7. The statistical software signal - what your choice of software says about you. Just another reason we need a deterministic statistical machine.



Sunday data/statistics link roundup (12/30/12)

  1. An interesting new app called 100plus, which looks like it uses public data to help determine how little decisions (walking more, one more glass of wine, etc.) lead to more or less health. Here's a post describing it on the heathdata.gov blog. As far as I can tell, the app is still in beta, so only the folks who have a code can download it.
  2. Data on mass shootings from the Mother Jones investigation.
  3. A post by Hilary M. on "Getting Started with Data Science". I really like the suggestion of just picking a project and doing something, getting it out there. One thing I'd add to the list is that I would spend a little time learning about an area you are interested in. With all the free data out there, it is easy to just "do something", without putting in the requisite work to know why what you are doing is good/bad. So when you are doing something, make sure you take the time to "know something".
  4. An analysis of various measures of citation impact (also via Hilary M.). I'm not sure I follow the reasoning behind all of the analyses performed (seems a little like throwing everything at the problem and hoping something sticks) but one interesting point is how citation/usage are far apart from each other on the PCA plot. This is likely just because the measures cluster into two big categories, but it makes me wonder. Is it better to have a lot of people read your paper (broad impact?) or cite your paper (deep impact?).
  5. An interesting conversation on Twitter about how big data does not mean you can ignore the scientific method. We have talked a little bit about this before, in terms of how one should motivate statistical projects.

Sunday Data/Statistics Link Roundup (11/4/12)

  1. Brian Caffo headlines the WaPo article about massive online open courses. He is the driving force behind our department’s involvement in offering these massive courses. I think this sums it up: `“I can’t use another word than unbelievable,” Caffo said. Then he found some more: “Crazy . . . surreal . . . heartwarming.”’
  2. A really interesting discussion of why “A Bet is a Tax on B.S.”. It nicely describes why intelligent betters must be disinterested in the outcome, otherwise they will end up losing money. The Nate Silver controversy just doesn’t seem to be going away, good news for his readership numbers, I bet. (via Rafa)
  3. An interesting article on how scientists are not claiming global warming is the sole cause of the extreme weather events we are seeing, but that it does contribute to them being more extreme. The key quote: “We can’t say that steroids caused any one home run by Barry Bonds, but steroids sure helped him hit more and hit them farther. Now we have weather on steroids.” —Eric Pooley. (via Roger)
  4. The NIGMS is looking for a Biomedical technology, Bioinformatics, and Computational Biology Director. I hope that it is someone who understands statistics! (via Karl B.)
  5. Here is another article that appears to misunderstand statistical prediction.  This one is about the Italian scientists who were jailed for failing to predict an earthquake. No joke. 
  6. We talk a lot about how much the data revolution will change industries from social media to healthcare. But here is an important reality check. Patients are not showing an interest in accessing their health care data. I wonder if part of the reason is that we haven’t come up with the right ways to explain, understand, and utilize what is inherently stochastic and uncertain information. 
  7. The BMJ is now going to require all data from clinical trials published in their journal to be public.  This is a brilliant, forward thinking move. I hope other journals will follow suit. (via Karen B.R.)
  8. An interesting article about the impact of retractions on citation rates, suggesting that papers in fields close to those of the retracted paper may show negative impact on their citation rates. I haven’t looked it over carefully, but how they control for confounding seems incredibly important in this case. (via Alex N.). 

A statistician loves the #insurancepoll...now how do we analyze it?

Amanda Palmer broke Twitter yesterday with her insurance poll. She started off just talking about how hard it is for musicians who rarely have health insurance, but then wandered into polling territory. She sent out a request for people to respond with the following information:

quick twitter poll. 1) COUNTRY?! 2) profession? 3) insured? 4) if not, why not, if so, at what cost per month (or covered by job)?

This quick little poll struck a nerve with people and her Twitter feed blew up. Long story short, tons of interesting information was gathered from folks. This information is frequently kept semi-obscured, particularly what is the cost of health insurance for folks in different places. This isn’t the sort of info that insurance companies necessarily publicize widely and isn’t the sort of thing people talk about. 

The results were really fascinating and its worth reading the above blog post or checking out the hashtag: #insurancepoll. But the most fascinating thing for me as a statistician was thinking about how to analyze these data. @aubreyjaubrey is apparently collecting the data someplace, hopefully she’ll make it public. 

At least two key issues spring to mind:

  1. This is a massive convenience sample. 
  2. It is being collected through a social network

Although I’m sure there are more. If a student is looking for an amazingly interesting/rich data set and some seriously hard stats problems, they should get in touch with Aubrey and see if they can make something of it!


Sunday Data/Statistics Link Roundup (9/2/2012)

  1. Just got back from IBC 2012 in Kobe Japan. I was in an awesome session (organized by the inimitable Lieven Clement) with great talks by Matt McCall, Djork-Arne Clevert, Adetayo Kasim, and Willem Talloen. Willem’s talk nicely tied in our work and how it plays into the pharmaceutical development process and the bigger theme of big data. On the way home through SFO I saw this hanging in the airport. A fitting welcome back to the states. Although, as we talked about in our first podcast, I wonder how long the Big Data hype will last…
  2. Simina B. sent this link along for a masters program in analytics at NC State. Interesting because it looks a lot like a masters in statistics program, but with a heavier emphasis on data collection/data management. I wonder what role the stat department down there is playing in this program and if we will see more like it pop up? Or if programs like this with more data management will be run by stats departments other places. Maybe our friends down in Raleigh have some thoughts for us. 
  3. If one set of weekly links isn’t enough to fill your procrastination quota, go check out NextGenSeek’s weekly stories. A bit genomics focused, but lots of cool data/statistics links in there too. Love the “extreme Venn diagrams”. 
  4. This seems almost like the fast statistics journal I proposed earlier. Can’t seem to access the first issue/editorial board either. Doesn’t look like it is open access, so it’s still not perfect. But I love the sentiment of fast/single round review. We can do better though. I think Yihue X. has some really interesting ideas on how. 
  5. My wife taught for a year at Grinnell in Iowa and loved it there. They just released this cool data set with a bunch of information about the college. If all colleges did this, we could really dig in and learn a lot about the American secondary education system (link via Hilary M.). 
  6. From the way-back machine, a rant from Rafa about meetings. Stayed tuned this week for some Simply Statistics data about our first year on the series of tubes

A deterministic statistical machine

As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.

StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$. 

So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result to figshare

The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around. 

The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better? 


Sunday data/statistics link roundup (8/26/12)

First off, a quick apology for missing last week, and thanks to Augusto for noticing! On to the links:

  1. Unbelievably the BRCA gene patents were upheld by the lower court despite the Supreme Court coming down pretty unequivocally against patenting correlations between metabolites and health outcomes. I wonder if this one will be overturned if it makes it back up to the Supreme Court. 
  2. A really nice interview with David Spiegelhalter on Statistics and Risk. David runs the Understanding Uncertainty blog and published a recent paper on visualizing uncertainty. My favorite line from the interview might be: “There is a nice quote from Joel Best that “all statistics are social products, the results of people’s efforts”. He says you should always ask, “Why was this statistic created?” Certainly statistics are constructed from things that people have chosen to measure and define, and the numbers that come out of those studies often take on a life of their own.”
  3. For those of you who use Tumblr like we do, here is a cool post on how to put technical content into your blog. My favorite thing I learned about is the Github Gist that can be used to embed syntax-highlighted code.
  4. A few interesting and relatively simple stats for projecting the success of NFL teams.  One thing I love about sports statistics is that they are totally willing to be super ad-hoc and to be super simple. Sometimes this is all you need to be highly predictive (see for example, the results of Football’s Pythagorean Theorem). I’m sure there are tons of more sophisticated analyses out there, but if it ain’t broke… (via Rafa). 
  5. My student Hilary has a new blog that’s worth checking out. Here is a nice review of ProjectTemplate she did. I think the idea of having an organizing principle behind your code is a great one. Hilary likes ProjectTemplate, I think there are a few others out there that might be useful. If you know about them, you should leave a comment on her blog!
  6. This is ridiculously cool. Man City has opened up their data/statistics to the data analytics community. After registering, you have access to many of the statistics the club uses to analyze their players. This is yet another example of open data taking over the world. It’s clear that data generators can create way more value for themselves by releasing cool data, rather than holding it all in house. 
  7. The Portland Public Library has created a website called Book Psychic, basically a recommender system for books. I love this idea. It would be great to have a recommender system for scientific papers

Interview with C. Titus Brown - Computational biologist and open access champion

C. Titus Brown 

C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering at Michigan State University. He develops computational software for next generation sequencing and the author of the blog, “Living in an Ivory Basement”. We talked to Titus about open access (he publishes his unfunded grants online!), improving the reputation of PLoS One, his research in computational software development, and work-life balance in academics. 

Do you consider yourself a statistician, data scientist, computer scientist, or something else?

Good question.  Short answer: apparently somewhere along the way I
became a biologist, but with a heavy dose of “computational scientist”
in there.

The longer answer?  Well, it’s a really long answer…

My first research was on Avida, a bottom-up model for evolution that
Chris Adami, Charles Ofria and I wrote together at Caltech in 1993:
http://en.wikipedia.org/wiki/Avida.  (Fun fact: Chris, Charles and I
are now all faculty at Michigan State!  Chris and I have offices one
door apart, and Charles has an office one floor down.)  Avida got me
very interested in biology, but not in the undergrad “memorize stuff”
biology — more in research.  This was computational science: using
simple models to study biological phenomena.

While continuing evolution research, I did my undergrad in pure math at Reed
College, which was pretty intense; I worked in the Software Development
lab there, which connected me to a bunch of reasonably well known hackers
including Keith Packard, Mark Galassi, and Nelson Minar.

I also took a year off and worked on Earthshine:


and then rebooted the project as an RA in 1997, the summer after
graduation.  This was mostly data analysis, although it included a
fair amount of hanging off of telescopes adjusting things as the
freezing winter wind howled through the Big Bear Solar Observatory’s
observing room, aka “data acquisition”.

After Reed, I applied to a bunch of grad schools, including Princeton
and Caltech in bio, UW in Math, and UT Austin and Ohio State in
physics.  I ended up at Caltech, where I switched over to
developmental biology and (eventually) regulatory genomics and genome
biology in Eric Davidson’s lab.  My work there included quite a bit
of wet bench biology, which is not something many people associate with me,
but was nonetheless something I did!

Genomics was really starting to hit the fan in the early 2000s, and I
was appalled by how biologists were handling the data — as one
example, we had about $500k worth of sequences sitting on a shared
Windows server, with no metadata or anything — just the filenames.
As another example, I watched a postdoc manually BLAST a few thousand
ESTs against the NCBI nr database; he sat there and did them three by
three, having figured out that he could concatenate three sequences
together and then manually deconvolve the results.  As probably the
most computationally experienced person in the lab, I quickly got
involved in data analysis and Web site stuff, and ended up writing
some comparative sequence analysis software that was mildly popular
for a while.

As part of the sequence analysis Web site I wrote, I became aware that
maintaining software was a *really* hard problem.  So, towards the end
of my 9 year stint in grad school, I spent a few years getting into
testing, both Web testing and more generally automated software
testing.  This led to perhaps my most used piece of software, twill, a
scripting language for Web testing.  It also ended up being one of the
things that got me elected into the Python Software Foundation,
because I was doing everything in Python (which is a really great
language, incidentally).

I also did some microbial genome analysis (which led to my first
completely reproducible paper (Brown and Callan, 2004;
http://www.ncbi.nlm.nih.gov/pubmed/14983022) and collaborated with the
Orphan lab on some metagenomics:
http://www.ncbi.nlm.nih.gov/pubmed?term=18467493.  This led to a
fascination with the biological “dark matter” in nature that is the
subject of some of my current work on metagenomics.

I landed my faculty position at MSU right out of grad school, because
bioinformatics is sexy and CS departments are OK with hiring grad
students as faculty.  However, I deferred for two years to do a
postdoc in Marianne Bronner-Fraser’s lab because I wanted to switch to
the chick as a model organism, and so I ended up arriving at MSU in
2009.  I had planned to focus a lot on development gene regulatory
networks, but 2009 was when Illumina sequencing hit, and as one of the
few people around who wasn’t visibly frightened by the term “gigabyte”
I got inextricably involved in a lot of different sequence analysis
projects.  These all converged on assembly, and, well, that seems to
be what I work on now :).

The two strongest threads that run through my research are these:

1. “better science through superior software” — so much of science
depends on computational inference these days, and so little of the
underlying software is “good”.  Scientists *really* suck at software
development (for both good and bad reasons) and I worry that a lot of
our current science is on a really shaky foundation.  This is one
reason I’m invested in Software Carpentry
(http://software-carpentry.org), a training program that Greg Wilson
has been developing — he and I agree that science is our best hope
for a positive future, and good software skills are going to be
essential for a lot of that science.  More generally I hope to turn
good software development into a competitive advantage for my lab
and my students.

2. “better hypothesis generation is needed” — biologists, in
particular, tend to leap towards the first testable hypothesis they
find.  This is a cultural thing stemming (I think) from a lot of
really bad interactions with theory: the way physicists and
mathematicians think about the world simply doesn’t fit with the Rube
Goldberg-esque features of biology (see

So getting back to the question, uh, yeah, I think I’m a computational
scientist who is working on biology?  And if I need to write a little
(or a lot) of software to solve my problems, I’ll do that, and I’ll
try to do it with some attention to good software development
practice — not just out of ethical concern for correctness, but
because it makes our research move faster.

One thing I’m definitely *not* is a statistician.  I have friends who
are statisticians, though, and they seem like perfectly nice people.

You have a pretty radical approach to open access, can you tell us a little bit about that?

Ever since Mark Galassi introduced me to open source, I thought it
made sense.  So I’ve been an open source-nik since … 1988?

From there it’s just a short step to thinking that open science makes
a lot of sense, too.  When you’re a grad student or a postdoc, you
don’t get to make those decisions, though; it took until I was a PI
for me to start thinking about how to do it.  I’m still conflicted
about *how* open to be, but I’ve come to the conclusion that posting
preprints is obvious

The “radical” aspect that you’re referring to is probably my posting
of grants (http://ivory.idyll.org/blog/grants-posted.html).  There are
two reasons I ended up posting all of my single-PI grants.  Both have
their genesis in this past summer, when I spent about 5 months writing
6 different grants — 4 of which were written entirely by me.  Ugh.

First, I was really miserable one day and joked on Twitter that “all
this grant writing is really cutting into my blogging” — a mocking
reference to the fact that grant writing (to get $$) is considered
academically worthwhile, while blogging (which communicates with the
public and is objectively quite valuable) counts for naught with my
employer.  Jonathan Eisen responded by suggesting that I post all of
the grants and I thought, what a great idea!

Second, I’m sure it’s escaped most people (hah!), but grant funding
rates are in the toilet — I spent all summer writing grants while
expecting most of them to be rejected.  That’s just flat-out
depressing!  So it behooves me to figure out how to make them serve
multiple duties.  One way to do that is to attract collaborators;
another is to serve as google bait for my lab; a third is to provide
my grad students with well-laid-out PhD projects.  A fourth duty they
serve (and I swear this was unintentional) is to point out to people
that this is MY turf and I’m already solving these problems, so maybe
they should go play in less occupied territory.  I know, very passive

So I posted the grants, and unknowingly joined a really awesome cadre
of folk who had already done the same
Most feedback I’ve gotten has been from grad students and undergrads
who really appreciate the chance to look at grants; some people told
me that they’d been refused the chance to look at grants from their
own PIs!

At the end of the day, I’d be lucky to be relevant enough that people
want to steal my grants or my software (which, by the way, is under a
BSD license — free for the taking, no “theft” required…).  My
observation over the years is that most people will do just about
anything to avoid using other people’s software.

In theoretical statistics, there is a tradition of publishing pre-prints while papers are submitted. Why do you think biology is lagging behind?

I wish I knew!  There’s clearly a tradition of secrecy in biology;
just look at the Cold Spring Harbor rules re tweeting and blogging
(http://meetings.cshl.edu/report.html) - this is a conference, for
chrissakes, where you go to present and communicate!  I think it’s
self-destructive and leads to an insider culture where only those who
attend meetings and chat informally get to be members of the club,
which frankly slows down research. Given the societal and medical
challenges we face, this seems like a really bad way to continue doing

One of the things I’m proudest of is our effort on the cephalopod
genome consortium’s white paper,
http://ivory.idyll.org/blog/cephseq-cephalopod-genomics.html, where a
group of bioinformaticians at the meeting pushed really hard to walk
the line between secrecy and openness.  I came away from that effort
thinking two things: first, that biologists were erring on the side of
risk aversity; and second, that genome database folk were smoking
crack when they pushed for complete openness of data.  (I have a blog
post on that last statement coming up at some point.)

The bottom line is that the incentives in academic biology are aligned
against openness.  In particular, you are often rewarded for the first
observation, not for the most useful one; if your data is used to do
cool stuff, you don’t get much if any credit; and it’s all about
first/last authorship and who is PI on the grants.  All too often this
means that people sit on their data endlessly.

This is getting particularly bad with next-gen data sets, because
anyone can generate them but most people have no idea how to analyze
their data, and so they just sit on it forever…

Do you think the ArXiv model will catch on in biology or just within the bioinformatics community?

One of my favorite quotes is: “Making predictions is hard, especially
when they’re about the future.” I attribute it to Niels Bohr.

It’ll take a bunch of big, important scientists to lead the way. We
need key members of each subcommunity of biology to decide to do it on
a regular basis. (At this point I will take the obligatory cheap shot
and point out that Jonathan Eisen, noted open access fan, doesn’t post
his stuff to preprint servers very often.  What’s up with that?)  It’s
going to be a long road.

What is the reaction you most commonly get when you tell people you have posted your un-funded grants online?

“Ohmigod what if someone steals them?”

Nobody has come up with a really convincing model for why posting
grants is a bad thing.  They’re just worried that it *might* be.  I
get the vague concerns about theft, but I have a hard time figuring
out exactly how it would work out well for the thief — reputation is
a big deal in science, and gossip would inevitably happen.  And at
least in bioinformatics I’m aiming to be well enough known that
straight up ripping me off would be suicidal.  Plus, if reviewers
do/did google searches on key concepts then my grants would pop up,
right?  I just don’t see it being a path to fame and glory for anyone.

Revisiting the passive-aggressive nature of my grant posting, I’d like
to point out that most of my grants depend on preliminary results from
our own algorithms.  So even if they want to compete on my turf, it’ll
be on a foundation I laid.  I’m fine with that — more citations for
me, either way :).

More optimistically, I really hope that people read my grants and then
find new (and better!) ways of solving the problems posed in them.  My
goal is to enable better science, not to hunker down in a tenured job
and engage in irrelevant science; if someone else can use my grants as
a positive or negative signpost to make progress, then broadly
speaking, my job is done.

Or, to look at it another way: I don’t have a good model for either
the possible risks OR the possible rewards of posting the grants, and
my inclinations are towards openness, so I thought I’d see what

How can junior researchers correct misunderstandings about open access/journals like PLoS One that separate correctness from impact? Do you have any concrete ideas for changing minds of senior folks who aren’t convinced?

Render them irrelevant by becoming senior researchers who supplant them
when they retire.  It’s the academic tradition, after all!  And it’s
really the only way within the current academic system, which — for
better or for worse — isn’t going anywhere.

Honestly, we need fewer people yammering on about open access and more
people simply doing awesome science and submitting it to OA journals.
Conveniently, many of the high impact journals are shooting themselves
in the foot and encouraging this by rejecting good science that then
ends up in an OA journal; that wonderful ecology oped on PLoS One
citation rates shows this well

Do you have any advice on what computing skills/courses statistics students interested in next generation sequencing should take?

For courses, no — in my opinion 80% of what any good researcher
learns is self-motivated and often self-taught, and so it’s almost
silly to pretend that any particular course or set of skills is
sufficient or even useful enough to warrant a whole course.  I’m not a
big fan of our current undergrad educational system :)

For skills?  You need critical thinking coupled with an awareness that
a lot of smart people have worked in science, and odds are that there
are useful tricks and approaches that you can use.  So talk to other
people, a lot!  My lab has a mix of biologists, computer scientists,
graph theorists, bioinformaticians, and physicists; more labs should
be like that.

Good programming skills are going to serve you well no matter what, of
course.  But I know plenty of good programmers who aren’t very
knowledgeable about biology, and who run into problems doing actual
science.  So it’s not a panacea.

How does replicable or reproducible research fit into your interests?

I’ve wasted *so much time* reproducing other people’s work that when
the opportunity came up to put down a marker, I took it.


The digital normalization paper shouldn’t have been particularly
radical; that it is tells you all you need to know about replication
in computational biology.

This is actually something I first did a long time ago, with what was
perhaps my favorite pre-faculty-job paper: if you look at the methods
for Brown & Callan (2004) you’ll find a downloadable package that
contains all of the source code for the paper itself and the analysis
scripts.  But back then I didn’t blog :).

Lack of reproducibility and openness in methods has serious
consequences — how much of cancer research has been useless, for
example?  See `this horrific report
Again, the incentives are all wrong: you get grant money for
publishing, not for being useful.  The two are not necessarily the

Do you have a family, and how do you balance work life and home life?

Why, thank you for asking!  I do have a family — my wife, Tracy Teal,
is a bioinformatician and microbial ecologist, and we have two
wonderful daughters, Amarie (4) and Jessie (1).  It’s not easy being a
junior professor and a parent at the same time, and I keep on trying
to figure out how to balance the needs of travel with the need to be a
parent (hint: I’m not good at it).  I’m increasingly leaning towards
blogging as being a good way to have an impact while being around
more; we’ll see how that goes.


Statistics/statisticians need better marketing

Statisticians have not always been great self-promoters. I think in part this comes from our tendency to be arbiters rather than being involved in the scientific process. In some ways, I think this is a good thing. Self-promotion can quickly become really annoying. On the other hand, I think our advertising shortcomings are hurting our field in a number of different ways. 

Here are a few:

  1. As Rafa points out even though statisticians are ridiculously employable right now it seems like statistics M.S. and Ph.D. programs are flying under the radar in all the hype about data/data science (here is an awesome one if you are looking). Computer Science and Engineering, even the social sciences, are cornering the market on “big data”. This potentially huge and influential source of students may pass us by if we don’t advertise better. 
  2. A corollary to this is lack of funding. When the Big Data event happened at the White House with all the major funders in attendance to announce $200 million in new funding for big data, none of the invited panelists were statisticians. 
  3. Our top awards don’t get the press they do in other fields. The Nobel Prize announcements are an international event. There is always speculation/intense interest in who will win. There is similar interest around the Fields medal in mathematics. But the top award in statistics, the COPSS award doesn’t get nearly the attention it should. Part of the reason is lack of funding (the Fields is $15k, the COPSS is $1k). But part of the reason is that we, as statisticians, don’t announce it, share it, speculate about it, tell our friends about it, etc. The prestige of these awards can have a big impact on the visibility of a field. 
  4.  A major component of visibility of a scientific discipline, for better or worse, is the popular press. The most recent article in a long list of articles at the New York Times about the data revolution does not mention statistics/statisticians. Neither do the other articles. We need to cultivate relationships with the media. 

We are all busy solving real/hard scientific and statistical problems, so we don’t have a lot of time to devote to publicity. But here are a couple of easy ways we could rapidly increase the visibility of our field, ordered roughly by the degree of time commitment. 

  1. All statisticians should have Twitter accounts and we should share/discuss our work and ideas online. The more we help each other share, the more visibility our ideas will get. 
  2. We should make sure we let the ASA know about cool things that are happening with data/statistics in our organizations and they should spread the word through their Twitter account and other social media. 
  3. We should start a conversation about who we think will win the next COPSS award in advance of the next JSM and try to get local media outlets to pick up our ideas and talk about the award. 
  4. We should be more “big tent” about statistics. ASA President Robert Rodriguez nailed this in his speech at JSM. Whenever someone does something with data, we should claim them as a statistician. Sometimes this will lead to claiming people we don’t necessarily agree with. But the big tent approach is what is allowing CS and other disciplines to overtake us in the data era. 
  5. We should consider setting up a place for statisticians to donate money to build up the award fund for the COPSS/other statistics prizes. 
  6. We should try to forge relationships with start-up companies and encourage our students to pursue industry/start-up opportunities if they have interest. The less we are insular within the academic community, the more high-profile we will be. 
  7. It would be awesome if we started a statistical literacy outreach program in communities around the U.S. We could offer free courses in community centers to teach people how to understand polling data/the census/weather reports/anything touching data. 

Those are just a few of my ideas, but I have a ton more. I’m sure other people do too and I’d love to hear them. Let’s raise the tide and lift all of our boats!