2011-11-18

Héctor Corrada Bravo is an assistant professor in the Department of Computer Science and the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park. He moved to College Park after finishing his Ph.D. in computer science at the University of Wisconsin and a postdoc in biostatistics at the Johns Hopkins Bloomberg School of Public Health. He has done outstanding work at the intersection of molecular biology, computer science, and statistics. For more info check out his webpage.

Which term applies to you: statistician/data scientist/computer
scientist/machine learner?

I want to understand interesting phenomena (in my case mostly in
biology and medicine) and I believe that our ability to collect a large number of relevant
measurements and infer characteristics of these phenomena can drive
scientific discovery and commercial innovation in the near future.
Perhaps that makes me a data scientist and means that depending on the
task at hand one or more of the other terms apply.

A lot of the distinctions many people make between these terms are
vacuous and unnecessary, but some are nonetheless useful to think
about. For example, both statisticians and machine learners [sic] know
how to create statistical algorithms that compute interesting and informative objects using measurements (perhaps) obtained through some stochastic or partially observed
process. These objects could be genomic tools for cancer screening, or
statistics that better reflect the relative impact of baseball players
on team success.

Both fields also give us ways to evaluate and characterize these objects.
However, there are times when these objects are tools that fulfill an
immediately utilitarian purpose and thinking like an engineer might
(as many people in Machine Learning do) is the right approach.
Other times, these objects are there to help us get insights about our
world and thinking in ways that many statisticians do is the right
approach.  You need both of these ways of thinking to do interesting
science and dogmatically avoiding either of them is a terrible idea.

How did you get into statistics/data science (i.e. your history)?

I got interested in Artificial Intelligence at one point, and found
that my mathematics background was nicely suited to work on this. Once
I got into it, thinking about statistics and how to analyze and
interpret data was natural and necessary. I started working with two
wonderful advisors at Wisconsin, Raghu Ramakrishnan (CS) and Grace Wahba (Statistics)
that helped shape the way I approach problems from different angles
and with different goals. The last piece was discovering that
computational biology is a fantastic setting in which to apply and
devise these methods to answer really interesting questions.

What is the problem currently driving you?

I’ve been working on cancer epigenetics to find specific genomic
measurements for which increased stochasticity appears to be general
across multiple cancer types. Right now, I’m really wondering how far
into the clinic can these discoveries be taken, if at all. For
example, can we build tools that use these genomic measurements to
improve cancer screening?

How do you see CS/statistics merging in the future?

I think that future got here some time ago, but is about to get much
more interesting.

Here is one example: Computer Science is about creating and analyzing
algorithms and building the systems that can implement them. Some of
what many computer scientists have done looks at problems concerning how to
keep, find and ship around information (Operating Systems, Networks,
Databases, etc.). Many times these have been driven by very specific
needs, e.g., commercial transactions in databases. In some ways,
companies have moved from from asking how do I use data to keep track
of my activities to how do I use data to decide which activities to do
and how to do them. Statistical tools should be used to answer these
questions, and systems built by computer scientists have statistical
algorithms at their core.

Beyond R, what are some really useful computational tools for

I think a computational tool that everyone can benefit a lot from
understanding better is algorithm design and analysis. This doesn’t
have to be at a particularly deep level, but just getting a sense of
how long a particular process might take, and how to devise a different way of doing it that might make it more efficient is really useful. I’ve been toying with the idea of creating a CS course called (something like) “Highlights of continuous
mathematics for computer science” that reminds everyone of the cool
stuff that one learns in math now that we can appreciate their usefulness. Similarily, I think
statistics students can benefit from “Highlights of discrete
mathematics for statisticians”.

Now a request for comments below from you and readers: (5a) Beyond R,
what are some really useful statistical tools for computer scientists

Review times in statistics journals are long, should statisticians
move to conference papers?

I don’t think so. Long review times (anything more than 3 weeks) are
really not necessary. We tend to publish in journals with fairly quick
review times that produce (for the most part) really useful and
insightful reviews.

I was recently talking to senior members in my field who were telling
me stories about the “old times” when CS was moving from mainly
publishing in journals to now mainly publishing in conferences. But
now, people working in collaborative projects (like computational biology) work in fields
that primarily publish in journals, so the field needs to be able to
properly evaluate their impact and productivity. There is no perfect
system.

For instance, review requests in fields where conferences are the main
publication venue come in waves (dictated by conference schedule).
Reviewers have a lot of papers to go over in a relatively short time
which makes their job of providing really helpful and fair reviews not
so easy. So, in that respect, the journal system can be better. The one thing that is universally true is that you don’t need long review times.

Previous Interviews: Daniela Witten, Chris Barr, Victoria Stodden