Tag: open access

17
Aug

Interview with C. Titus Brown - Computational biologist and open access champion

C. Titus Brown 


C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering at Michigan State University. He develops computational software for next generation sequencing and the author of the blog, “Living in an Ivory Basement”. We talked to Titus about open access (he publishes his unfunded grants online!), improving the reputation of PLoS One, his research in computational software development, and work-life balance in academics. 

Do you consider yourself a statistician, data scientist, computer scientist, or something else?

Good question.  Short answer: apparently somewhere along the way I
became a biologist, but with a heavy dose of “computational scientist”
in there.

The longer answer?  Well, it’s a really long answer…

My first research was on Avida, a bottom-up model for evolution that
Chris Adami, Charles Ofria and I wrote together at Caltech in 1993:
http://en.wikipedia.org/wiki/Avida.  (Fun fact: Chris, Charles and I
are now all faculty at Michigan State!  Chris and I have offices one
door apart, and Charles has an office one floor down.)  Avida got me
very interested in biology, but not in the undergrad “memorize stuff”
biology — more in research.  This was computational science: using
simple models to study biological phenomena.

While continuing evolution research, I did my undergrad in pure math at Reed
College, which was pretty intense; I worked in the Software Development
lab there, which connected me to a bunch of reasonably well known hackers
including Keith Packard, Mark Galassi, and Nelson Minar.

I also took a year off and worked on Earthshine:

http://en.wikipedia.org/wiki/Planetshine#Earthshine

and then rebooted the project as an RA in 1997, the summer after
graduation.  This was mostly data analysis, although it included a
fair amount of hanging off of telescopes adjusting things as the
freezing winter wind howled through the Big Bear Solar Observatory’s
observing room, aka “data acquisition”.

After Reed, I applied to a bunch of grad schools, including Princeton
and Caltech in bio, UW in Math, and UT Austin and Ohio State in
physics.  I ended up at Caltech, where I switched over to
developmental biology and (eventually) regulatory genomics and genome
biology in Eric Davidson’s lab.  My work there included quite a bit
of wet bench biology, which is not something many people associate with me,
but was nonetheless something I did!

Genomics was really starting to hit the fan in the early 2000s, and I
was appalled by how biologists were handling the data — as one
example, we had about $500k worth of sequences sitting on a shared
Windows server, with no metadata or anything — just the filenames.
As another example, I watched a postdoc manually BLAST a few thousand
ESTs against the NCBI nr database; he sat there and did them three by
three, having figured out that he could concatenate three sequences
together and then manually deconvolve the results.  As probably the
most computationally experienced person in the lab, I quickly got
involved in data analysis and Web site stuff, and ended up writing
some comparative sequence analysis software that was mildly popular
for a while.

As part of the sequence analysis Web site I wrote, I became aware that
maintaining software was a *really* hard problem.  So, towards the end
of my 9 year stint in grad school, I spent a few years getting into
testing, both Web testing and more generally automated software
testing.  This led to perhaps my most used piece of software, twill, a
scripting language for Web testing.  It also ended up being one of the
things that got me elected into the Python Software Foundation,
because I was doing everything in Python (which is a really great
language, incidentally).

I also did some microbial genome analysis (which led to my first
completely reproducible paper (Brown and Callan, 2004;
http://www.ncbi.nlm.nih.gov/pubmed/14983022) and collaborated with the
Orphan lab on some metagenomics:
http://www.ncbi.nlm.nih.gov/pubmed?term=18467493.  This led to a
fascination with the biological “dark matter” in nature that is the
subject of some of my current work on metagenomics.

I landed my faculty position at MSU right out of grad school, because
bioinformatics is sexy and CS departments are OK with hiring grad
students as faculty.  However, I deferred for two years to do a
postdoc in Marianne Bronner-Fraser’s lab because I wanted to switch to
the chick as a model organism, and so I ended up arriving at MSU in
2009.  I had planned to focus a lot on development gene regulatory
networks, but 2009 was when Illumina sequencing hit, and as one of the
few people around who wasn’t visibly frightened by the term “gigabyte”
I got inextricably involved in a lot of different sequence analysis
projects.  These all converged on assembly, and, well, that seems to
be what I work on now :).

The two strongest threads that run through my research are these:

1. “better science through superior software” — so much of science
depends on computational inference these days, and so little of the
underlying software is “good”.  Scientists *really* suck at software
development (for both good and bad reasons) and I worry that a lot of
our current science is on a really shaky foundation.  This is one
reason I’m invested in Software Carpentry
(http://software-carpentry.org), a training program that Greg Wilson
has been developing — he and I agree that science is our best hope
for a positive future, and good software skills are going to be
essential for a lot of that science.  More generally I hope to turn
good software development into a competitive advantage for my lab
and my students.

2. “better hypothesis generation is needed” — biologists, in
particular, tend to leap towards the first testable hypothesis they
find.  This is a cultural thing stemming (I think) from a lot of
really bad interactions with theory: the way physicists and
mathematicians think about the world simply doesn’t fit with the Rube
Goldberg-esque features of biology (see
http://ivory.idyll.org/blog/is-discovery-science-really-bogus.html).

So getting back to the question, uh, yeah, I think I’m a computational
scientist who is working on biology?  And if I need to write a little
(or a lot) of software to solve my problems, I’ll do that, and I’ll
try to do it with some attention to good software development
practice — not just out of ethical concern for correctness, but
because it makes our research move faster.

One thing I’m definitely *not* is a statistician.  I have friends who
are statisticians, though, and they seem like perfectly nice people.

You have a pretty radical approach to open access, can you tell us a little bit about that?

Ever since Mark Galassi introduced me to open source, I thought it
made sense.  So I’ve been an open source-nik since … 1988?

From there it’s just a short step to thinking that open science makes
a lot of sense, too.  When you’re a grad student or a postdoc, you
don’t get to make those decisions, though; it took until I was a PI
for me to start thinking about how to do it.  I’m still conflicted
about *how* open to be, but I’ve come to the conclusion that posting
preprints is obvious
(http://ivory.idyll.org/blog/blog-practicing-open-science.html).

The “radical” aspect that you’re referring to is probably my posting
of grants (http://ivory.idyll.org/blog/grants-posted.html).  There are
two reasons I ended up posting all of my single-PI grants.  Both have
their genesis in this past summer, when I spent about 5 months writing
6 different grants — 4 of which were written entirely by me.  Ugh.

First, I was really miserable one day and joked on Twitter that “all
this grant writing is really cutting into my blogging” — a mocking
reference to the fact that grant writing (to get $$) is considered
academically worthwhile, while blogging (which communicates with the
public and is objectively quite valuable) counts for naught with my
employer.  Jonathan Eisen responded by suggesting that I post all of
the grants and I thought, what a great idea!

Second, I’m sure it’s escaped most people (hah!), but grant funding
rates are in the toilet — I spent all summer writing grants while
expecting most of them to be rejected.  That’s just flat-out
depressing!  So it behooves me to figure out how to make them serve
multiple duties.  One way to do that is to attract collaborators;
another is to serve as google bait for my lab; a third is to provide
my grad students with well-laid-out PhD projects.  A fourth duty they
serve (and I swear this was unintentional) is to point out to people
that this is MY turf and I’m already solving these problems, so maybe
they should go play in less occupied territory.  I know, very passive
aggressive…

So I posted the grants, and unknowingly joined a really awesome cadre
of folk who had already done the same
(http://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/).
Most feedback I’ve gotten has been from grad students and undergrads
who really appreciate the chance to look at grants; some people told
me that they’d been refused the chance to look at grants from their
own PIs!

At the end of the day, I’d be lucky to be relevant enough that people
want to steal my grants or my software (which, by the way, is under a
BSD license — free for the taking, no “theft” required…).  My
observation over the years is that most people will do just about
anything to avoid using other people’s software.

In theoretical statistics, there is a tradition of publishing pre-prints while papers are submitted. Why do you think biology is lagging behind?

I wish I knew!  There’s clearly a tradition of secrecy in biology;
just look at the Cold Spring Harbor rules re tweeting and blogging
(http://meetings.cshl.edu/report.html) - this is a conference, for
chrissakes, where you go to present and communicate!  I think it’s
self-destructive and leads to an insider culture where only those who
attend meetings and chat informally get to be members of the club,
which frankly slows down research. Given the societal and medical
challenges we face, this seems like a really bad way to continue doing
research.

One of the things I’m proudest of is our effort on the cephalopod
genome consortium’s white paper,
http://ivory.idyll.org/blog/cephseq-cephalopod-genomics.html, where a
group of bioinformaticians at the meeting pushed really hard to walk
the line between secrecy and openness.  I came away from that effort
thinking two things: first, that biologists were erring on the side of
risk aversity; and second, that genome database folk were smoking
crack when they pushed for complete openness of data.  (I have a blog
post on that last statement coming up at some point.)

The bottom line is that the incentives in academic biology are aligned
against openness.  In particular, you are often rewarded for the first
observation, not for the most useful one; if your data is used to do
cool stuff, you don’t get much if any credit; and it’s all about
first/last authorship and who is PI on the grants.  All too often this
means that people sit on their data endlessly.

This is getting particularly bad with next-gen data sets, because
anyone can generate them but most people have no idea how to analyze
their data, and so they just sit on it forever…

Do you think the ArXiv model will catch on in biology or just within the bioinformatics community?

One of my favorite quotes is: “Making predictions is hard, especially
when they’re about the future.” I attribute it to Niels Bohr.

It’ll take a bunch of big, important scientists to lead the way. We
need key members of each subcommunity of biology to decide to do it on
a regular basis. (At this point I will take the obligatory cheap shot
and point out that Jonathan Eisen, noted open access fan, doesn’t post
his stuff to preprint servers very often.  What’s up with that?)  It’s
going to be a long road.

What is the reaction you most commonly get when you tell people you have posted your un-funded grants online?

“Ohmigod what if someone steals them?”

Nobody has come up with a really convincing model for why posting
grants is a bad thing.  They’re just worried that it *might* be.  I
get the vague concerns about theft, but I have a hard time figuring
out exactly how it would work out well for the thief — reputation is
a big deal in science, and gossip would inevitably happen.  And at
least in bioinformatics I’m aiming to be well enough known that
straight up ripping me off would be suicidal.  Plus, if reviewers
do/did google searches on key concepts then my grants would pop up,
right?  I just don’t see it being a path to fame and glory for anyone.

Revisiting the passive-aggressive nature of my grant posting, I’d like
to point out that most of my grants depend on preliminary results from
our own algorithms.  So even if they want to compete on my turf, it’ll
be on a foundation I laid.  I’m fine with that — more citations for
me, either way :).

More optimistically, I really hope that people read my grants and then
find new (and better!) ways of solving the problems posed in them.  My
goal is to enable better science, not to hunker down in a tenured job
and engage in irrelevant science; if someone else can use my grants as
a positive or negative signpost to make progress, then broadly
speaking, my job is done.

Or, to look at it another way: I don’t have a good model for either
the possible risks OR the possible rewards of posting the grants, and
my inclinations are towards openness, so I thought I’d see what
happens.

How can junior researchers correct misunderstandings about open access/journals like PLoS One that separate correctness from impact? Do you have any concrete ideas for changing minds of senior folks who aren’t convinced?

Render them irrelevant by becoming senior researchers who supplant them
when they retire.  It’s the academic tradition, after all!  And it’s
really the only way within the current academic system, which — for
better or for worse — isn’t going anywhere.

Honestly, we need fewer people yammering on about open access and more
people simply doing awesome science and submitting it to OA journals.
Conveniently, many of the high impact journals are shooting themselves
in the foot and encouraging this by rejecting good science that then
ends up in an OA journal; that wonderful ecology oped on PLoS One
citation rates shows this well
(http://library.queensu.ca/ojs/index.php/IEE/article/view/4351).

Do you have any advice on what computing skills/courses statistics students interested in next generation sequencing should take?

For courses, no — in my opinion 80% of what any good researcher
learns is self-motivated and often self-taught, and so it’s almost
silly to pretend that any particular course or set of skills is
sufficient or even useful enough to warrant a whole course.  I’m not a
big fan of our current undergrad educational system :)

For skills?  You need critical thinking coupled with an awareness that
a lot of smart people have worked in science, and odds are that there
are useful tricks and approaches that you can use.  So talk to other
people, a lot!  My lab has a mix of biologists, computer scientists,
graph theorists, bioinformaticians, and physicists; more labs should
be like that.

Good programming skills are going to serve you well no matter what, of
course.  But I know plenty of good programmers who aren’t very
knowledgeable about biology, and who run into problems doing actual
science.  So it’s not a panacea.

How does replicable or reproducible research fit into your interests?

I’ve wasted *so much time* reproducing other people’s work that when
the opportunity came up to put down a marker, I took it.

http://ivory.idyll.org/blog/replication-i.html

The digital normalization paper shouldn’t have been particularly
radical; that it is tells you all you need to know about replication
in computational biology.

This is actually something I first did a long time ago, with what was
perhaps my favorite pre-faculty-job paper: if you look at the methods
for Brown & Callan (2004) you’ll find a downloadable package that
contains all of the source code for the paper itself and the analysis
scripts.  But back then I didn’t blog :).

Lack of reproducibility and openness in methods has serious
consequences — how much of cancer research has been useless, for
example?  See `this horrific report
<http://online.wsj.com/article/SB10001424052970203764804577059841672541590.html>`__.)
Again, the incentives are all wrong: you get grant money for
publishing, not for being useful.  The two are not necessarily the
same…

Do you have a family, and how do you balance work life and home life?

Why, thank you for asking!  I do have a family — my wife, Tracy Teal,
is a bioinformatician and microbial ecologist, and we have two
wonderful daughters, Amarie (4) and Jessie (1).  It’s not easy being a
junior professor and a parent at the same time, and I keep on trying
to figure out how to balance the needs of travel with the need to be a
parent (hint: I’m not good at it).  I’m increasingly leaning towards
blogging as being a good way to have an impact while being around
more; we’ll see how that goes.

22
Apr

Sunday data/statistics link roundup (4/22)

  1. Now we know who is to blame for the pie chart. I had no idea it had been around, straining our ability to compare relative areas, since 1801. However, the same guy (William Playfair) apparently also invented the bar chart. So he wouldn’t be totally shunned by statisticians. (via Leonid K.)
  2. A nice article in the Guardian about the current group of scientists that are boycotting Elsevier. I have to agree with the quote that leads the article, “All professions are conspiracies against the laity.” On the other hand, I agree with Rafa that academics are partially to blame for buying into the closed access hegemony. I think more than a boycott of a single publisher is needed; we need a change in culture. (first link also via Leonid K)
  3. A blog post on how to add a transparent image layer to a plot. For some reason, I have wanted to do this several times over the last couple of weeks, so the serendipity of seeing it on R Bloggers merited a mention. 
  4. I agree the Earth Institute needs a better graphics advisor. (via Andrew G.)
  5. A great article on why multiple choice tests are used - they are an easy way to collect data on education. But that doesn’t mean they are the right data. This reminds me of the Tukey quote: “The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data”. It seems to me if you wanted to have a major positive impact on education right now, the best way would be to develop a new experimental design that collects the kind of data that really demonstrates mastery of reading/math/critical thinking. 
  6. Finally, a bit of a bleg…what is the best way to do the SVD of a huge (think 1e6 x 1e6), sparse matrix in R? Preferably without loading the whole thing into memory…
08
Jan

Sunday Data/Statistics Link Roundup

A few data/statistics related links of interest:

  1. Eric Lander Profile
  2. The math of lego (should be “The statistics of lego”)
  3. Where people are looking for homes.
  4. Hans Rosling’s Ted Talk on the Developing world (an oldie but a goodie)
  5. Elsevier is trying to make open-access illegal (not strictly statistics related, but a hugely important issue for academics who believe government funded research should be freely accessible), more here
03
Nov

Free access publishing is awesome...but expensive. How do we pay for it?

I am a huge fan of open access journals. I think open access is good both for moral reasons (science should be freely available) and for more selfish ones (I want people to be able to read my work). If given the choice, I would publish all of my work in journals that distribute results freely. 

But it turns out that for most open/free access systems, the publishing charges are paid by the scientists publishing in the journals. I did a quick scan and compiled this little table of how much it costs to publish a paper in different journals (here is a bigger table): 

  • PLoS One  $1,350.00
  • PLoS Biology: $2,900.00
  • BMJ Open $1,937.28
  • Bioinformatics (Open Access Option) $3,000.00
  • Genome Biology (Open Access Option) $2,500.00
  • Biostatistics (Open Access Option) $3,000.00

The first thing I noticed is that it is minimum about $1,500 to get a paper published open access. That may not seem like a lot of money and most journals offer discounts to people who can’t pay. But it still adds up, this last year my group has published 7 papers. If I paid for all of them to be published open access, that would be at minimum $10,500! That is half the salary of a graduate student researcher for an entire year. For a senior scientist that may be no problem, but for early career scientists, or scientists with limited access to resources, it is a big challenge.

Publishers who are solely dedicated to open access (PLoS, BMJ Open, etc.) seem to have on average lower publication charges than journals who only offer open access as an option. I think part of this is that the journals that aren’t open access in general have to make up some of the profits they lose by making the articles free. I certainly don’t begrudge the journals the costs. They have to maintain the websites, format the articles, and run the peer review process. That all costs money. 

A modest proposal

What I wonder is if there was a better place for that money to come from? Here is one proposal (hat tip to Rafa): academic and other libraries pay a ton of money for subscriptions to journals like Nature and Science. They also are required to pay for journals in a large range of disciplines. What if, instead of investing this money in subscriptions for their university, academic libraries pitched in and subsidized the publication costs of open/free access?

If all university libraries pitched in, the cost for any individual library would be relatively small. It would probably be less than paying for subscriptions to hundreds of journals. At the same time, it would be an investment that would benefit not only the researchers at their school, but also the broader scientific community by keeping research open. Then neither the people publishing the work, nor the people reading it would be on the hook for the bill. 

This approach is the route taken by ArXiv, a free database of unpublished papers. These papers haven’t been peer reviewed, so they don’t always carry the same weight as papers published in peer-reviewed journals. But there are a lot of really good and important papers in the database - it is an almost universally accepted pre-print server.

The other nice thing about ArXiv is that you don’t pay for article processing, the papers are published as is. The papers don’t look quite as pretty as they do in Nature/Science or even PLoS, but it is also much cheaper. The only costs associated with making this a full fledged peer-reviewed journal would be refereeing (which scientists do for free anyway) and editorial responsibilities (again mostly volunteer by scientists). 

Related Posts:  Jeff on “Submitting scientific papers is too time consuming”