Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Science really is non-partisan: facts and skepticism annoy everybody

This is a short open letter to those that believe scientists have a “liberal bias” and question their objectivity. I suspect that for many conservatives, this Saturday’s March for Science served as confirmation of this fact. In this post I will try to convince you that this is not the case specifically by pointing out how scientists often annoy the left as much as the right.

First, let me emphasize that scientists are highly appreciative of members of Congress and past administrations that have supported Science funding though the DoD, NIH and NSF. Although the current administration did propose a 20% cut to NIH, we are aware that, generally speaking, support for scientific research has traditionally been bipartisan.

It is true that the typical data-driven scientists will disagree, sometimes strongly, with many stances that are considered conservative. For example, most scientists will argue that:

  1. Climate change is real and is driven largely by increased carbon dioxide and other human-made emissions into the atmosphere.
  2. Evolution needs to be part of children’s education and creationism has no place in Science class.
  3. Homosexuality is not a choice.
  4. Science must be publically funded because the free market is not enough to make science thrive.

But scientists will also hold positions that are often criticized heavily by some of those who identify as politically left wing:

  1. Current vaccination programs are safe and need to be enforced: without heard immunity thousands of children would die.
  2. Genetically modified organisms (GMOs) are safe and are indispensable to fight world hunger. There is no need for warning labels.
  3. Using nuclear energy to power our electrical grid is much less harmful than using natural gas, oil and coal and, currently, more viable than renewable energy.
  4. Alternative medicine, such as homeopathy, naturopathy, faith healing, reiki, and acupuncture, is pseudo-scientific quackery.

The timing of the announcement of the March for Science, along with the organizers’ focus on environmental issues and diversity, may have made it seem like a partisan or left-leaning event, but please also note that many scientists criticized the organizers for this very reason and there was much debate in general. Most scientists I know that went to the march did so not necessarily because they are against Republican administrations, but because they are legitimately concerned about some of the choices of this particular administration and the future of our country if we stop funding and trusting science.

If you haven’t already seen this Neil Degrasse Tyson video on the importance of Science to everyone, I highly recommend it.

Redirect

This page was generated in error. The “Science really is non-partisan: facts and skepticism annoy everybody” blog post is here

Apologies for the inconvenience.

La matrícula, el costo del crédito y las huelgas en la UPR

La Universidad de Puerto Rico (UPR) recibe aproximádamente 800 millones de dólares del estado cada año. Esta inversión le permite ofrecer salarios más altos, lo cual atrae a los mejores profesores, tener las mejores instalaciones para la investigación y enseñanza, y mantener el precio por crédito más bajo que las universidades privadas. Gracias a estas grandes ventajas, la UPR suele ser la primera opción del estudiantado puertorriqueño, en particular los dos recintos más grandes, Río Piedras (UPRRP) y Mayagüez. Un estudiante que aprovecha su tiempo en la UPR, además de formarse como ciudadano, puede entrar exitosamente en la fuerza laboral o continuar sus estudios en las mejores escuelas graduadas. El precio módico del crédito, en combinación con las becas federales Pell, han ayudado a miles de estudiantes económicamente desaventajados a completar sus estudios sin tener que endeudarse.

En la pasada década una realidad preocupante ha surgido: mientras la demanda por la educación universitaria ha crecido, demostrado por el crecimiento de la matrícula en las universidades privadas, el número de estudiantes matriculados en la UPR ha bajado.

¿Por qué ha bajado la matrícula en la UPR? Una explicación popular es que “la baja en matrícula es provocada por el aumento en el costo de la matrícula”. La teoría de que un alza en costos disminuye la matrícula es comúnmente aceptada pues tiene sentido económico: cuando el precio sube, las ventas bajan. Pero entonces ¿por qué ha crecido la matrícula en las universidades privadas? Tampoco lo explica un crecimiento en el número de estudiantes ricos ya que, en el 2012, la mediana de ingreso familiar de aquellos jóvenes matriculados en algún recinto de la UPR era de $32,379; en contraste, la mediana de ingreso de aquellos que están matriculados en una universidad privada era de $25,979. Otro problema con esta teoría es que, una vez ajustamos por inflación, el costo del crédito se ha mantenido más o menos estable tanto en la UPR como en las unversidades privadas.

Ahora, si miramos detenidamente los datos de la matrícula notamos que los bajones más grandes fueron precisamente en los años de huelga (2005, 2010, 2011). En el 2005 comienza una tendencia positiva en la matrícula del Sagrado, con el crecimiento más alto en el 2010 y el 2011.

Actualmente, varios recintos, incluyendo Río Piedras, están cerrados indefinidamente. En una asamblea nacional asistida por 10% de los más de 50,000 estudiantes del sistema, una huelga indefinida fue aprobada en una votación de 4,522 a 1,154. Para reiniciar labores los estudiantes exigen que “no se impongan sanciones a los estudiantes que participen en la huelga, que se presente un plan de reforma universitaria elaborado por la comunidad universitaria, que se audite la deuda pública y se restituya a los miembros de la comisión evaluadora de la auditoría pública y su prepuesto”. Esto ocurre como respuesta a la propuesta por la Junta de Supervición Fiscal (JSF) y el gobernador de reducir el presupuesto de la UPR como parte de sus intentos de resolver una grave crisis fiscal.

Durante el cierre, los estudiantes en huelga le impiden la entrada al recinto al resto de la comunidad universitaria, incluyendo aquellos que no consideran la huelga una manera efectiva de protesta. Aquellos que se oponen y quieren continuar estudiando, se les acusa de ser egoistas o de ser aliados de quienes quieren destruir la UPR. Hasta ahora estos estudiantes tampoco han recibido el apoyo explícito de los profesores y administradores. No debe sorprendernos si los que quieren continuar estudiando recurren a pagar más en una universidad privada.

portones2

Aunque existe la posibilidad de que la huelga ejerza suficiente presión política para que se responda a las exigencias determinadas en la asamblea, hay otras posibilidades menos favorables para los estudiantes:

  • La falta de actividad académica resulta en el exilio de miles de estudiantes a las universidades privadas.
  • La JSF usa el cierre para justificar aun más recortes: una institución no requiere millones de dolares al día si está cerrada.
  • Los recintos cerrados pierden su acreditación ya que una universidad en la cual no se da clases no puede cumplir con las normas necesarias.
  • Se revocan las becas Pell a los estudiantes en receso.

Hay mucha evidencia empírica que demuestra la importancia de la educación universitaria accesible. Lo mismo no es cierto sobre las huelgas como estrategia para defender dicha educación. Y cabe la posibildad que la huelga indefinida tenga el efecto opuesto y perjudique enormemente a los estudiantes, en particular a los que se ven forzados a matricularse en una universidad privada.

Notas:

  1. Data proporcionada por el Consejo de Educación de Puerto Rico (CEPR).

  2. El costo del crédito del 2011 no incluye la cuota.

The Importance of Interactive Data Analysis for Data-Driven Discovery

Data analysis workflows and recipes are commonly used in science. They are actually indispensable since reinventing the wheel for each project would result in a colossal waste of time. On the other hand, mindlessly applying a workflow can result in totally wrong conclusions if the required assumptions don’t hold. This is why successful data analysts rely heavily on interactive data analysis (IDA). I write today because I am somewhat concerned that the importance of IDA is not fully appreciated by many of the policy makers and thought leaders that will influence how we access and work with data in the future.

I start by constructing a very simple example to illustrate the importance of IDA. Suppose that as part of a demographic study you are asked to summarize male heights across several counties. Since sample sizes are large and heights are known to be well approximated by a normal distribution you feel comfortable using a true and tested recipe: report the average and standard deviation as a summary. You are surprised to find a county with average heights of 6.1 feet with a standard deviation (SD) of 7.8 feet. Do you start writing a paper and a press release to describe this very interesting finding? Here, interactive data analysis saves us from naively reporting this. First, we note that the standard deviation is impossibly big if data is in fact normally distributed: more than 15% of heights would be negative. Given this nonsensical result, the next obvious step for an experienced data analyst is to explore the data, say with a boxplot (see below). This immediately reveals a problem, it appears one value was reported in centimeters: 180 centimeters not feet. After fixing this, the summary changes to an average height of 5.75 and with a 3 inch SD.

European Outlier

Years of data analysis experience will show you that examples like this are common. Unfortunately, as data and analyses get more complex, workflow failures are harder to detect and often go unnoticed. An important principle many of us teach our trainees is to carefully check for hidden problems when data analysis leads you to unexpected results, especialy when the unexpected results holding up benefits us professionally, for example by leading to a publication.

Interactive data analysis is also indispensable for the development of new methodology. For example, in my field of research, exploring the data has led to the discovery of the need for new methods and motivated new approaches that handle specific cases that existing workflows can’t handle.

So why I am concerned? As public datasets become larger and more numerous, many funding agencies, policy makers and industry leaders are advocating for using cloud computing to bring computing to the data. If done correctly, this would provide a great improvement over the current redundant and unsystematic approach of everybody downloading data and working with it locally. However, after looking into the details of some of these plans, I have become a bit concerned that perhaps the importance of IDA is not fully appreciated by decision makers.

As an example consider the NIH efforts to promote data-driven discovery that center around plans for the Data Commons. The linked page describes an ecosystem with four components one of which is “Software”. According to the description, the software component of The Commons should provide “[a]ccess to and deployment of scientific analysis tools and pipeline workflows”. There is no mention of a strategy that will grant access to the raw data. Without this, carefully checking the workflow output and developing the analysis tools and pipeline workflows of the future will be difficult.

I note that data analysis workflows are very popular in fields in which data analysis is indispensible, as is the case in biomedical research, my focus area. In this field, data generators, which typically lead the scientific enterprise, are not always trained data analysts. But the literature is overflowing with proposed workflows. You can gauge the popularity of these by the vast number published in the nature journals as demonstrated by this google search:

Nature workflows

In a field in which data generators are not data analysis experts, the workflow has the added allure that it removes the need to think deeply about data analysis and instead shifts the responsibility to pre-approved software. Note that these workflows are not always described with the mathematical language or computer coded needed to truly understand it but rather with a series of PowerPoint shapes. The gist of the typical data analysis workflow can be simplified into the following:

workflows

This simplification of the data analysis process makes it particularly worrisome that the intricacies of IDA are not fully appreciated.

As mentioned above, data analysis workflows are a necessary component of the scientific enterprise. Without them the process would slow down to a halt. However, workflows should only be implemented once consensus is reached regarding its optimality. And even then, IDA is needed to assure that the process is performing as expected. The career of many of my colleagues has been dedicated mostly to the development of such analysis tools. We have learned that rushing to implement workflows before they are mature enough can have widespread negative consequences. And, at least in my experience, developing rigorous tools is impossible without interactive data analysis. So I hope that this post helps make a case for the importance of interactive data analysis and that it continues to be a part of the scientific enterprise.

The levels of data science class

In a recent post, Nathan Yau points to a comment by Jake Porway about data science hackathons. They both say that for data science/visualization projects to be successful you have to start with an important question, not with a pile of data. This is the problem forward not solution backward approach to data science and big data. This is the approach also advocated in the really nice piece on teaching data science by Stephanie and Rafa

I have adopted a similar approach in the data science class here at Hopkins, largely inspired by Dan Meyer’s patient problem solving for middle school math class. So instead of giving students a full problem description I give them project suggestions like:

  • Option 1: Develop a prediction algorithm for identifying and classifying users that are trolling or being mean on Twitter. If you want an idea of what I’m talking about just look at the responses to any famous person’s tweets.
  • Option 2: Analyze the traffic fatality data to identify any geographic, time varying, or other characteristics that are associated with traffic fatalities: https://www.transportation.gov/fastlane/2015-traffic-fatalities-data-has-just-been-released-call-action-download-and-analyze.
  • Option 3: Develop a model for predicting life expectancy in Baltimore down to single block resolution with estimates of uncertainty. You may need to develop an approach for “downsampling” since the outcome data you’ll be able to find is likely aggregated at the neighborhood level (http://health.baltimorecity.gov/node/231).
  • Option 4: Develop a statistical model for inferring the variables you need to calculate the Gail score (http://www.cancer.gov/bcrisktool/) for a woman based on her Facebook profile. Develop a model for the Gail score prediction from Facebook and its uncertainty. You should include estimates of uncertainty in the predicted score due to your inferred variables.
  • Option 5: Potentially fun but super hard project. develop an algorithm for self-driving car using the training data: http://research.comma.ai/. Build a model for predicting at every moment what direction the car should be going, whether it should be signalling, and what speed it should be going. You might consider starting with a small subsample of the (big) training set.

Each of these projects shares the characteristic that there is an interesting question - but the data may or may not be available. If it is available it may or may not have to be processed/cleaned/organized. Moreover, with the data in hand you may need to think about how it was collected or go out and collect some more data. This kind of problem is inspired by this quote from Dan’s talk - he was talking about math but it could easily have been data science:

Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you didn’t have insufficient information and have to go find some?

I realize though that this is advanced data science. So I was thinking about the levels of data science course and how you would build up a curriculum. I came up with the following courses/levels and would be interested in what others thought.

  • Level 0: Background: Basic computing, some calculus with a focus on optimization, basic linear algebra.
  • Level 1: Data science thinking: How to define a question, how to turn a question into a statement about data, how to identify data sets that may be applicable, experimental design, critical thinking about data sets.
  • Level 2: Data science communication: Teaching students how to write about data science, how to express models qualitatively and in mathematical notation, explaining how to interpret results of algorithms/models. Explaining how to make figures.
  • Level 3: Data science tools: Learning the basic tools of R, loading data of various types, reading data, plotting data.
  • Level 4: Real data: Manipulating different file formats, working with “messy” data, trying to organize multiple data sets into one data set.
  • Level 5: Worked examples: Use real data examples, but work them through from start to finish as case studies, don’t make them easy clean data sets, but have a clear path from the beginning of the problem to the end.
  • Level 6: Just the question: Give students a question where you have done a little research to know that it is posisble to get at least some data, but aren’t 100% sure it is the right data or that the problem can be perfectly solved. Part of the learning process here is knowing how to define success or failure and when to keep going or when to quit.
  • Level 7: The student is the scientist: Have the students come up with their own questions and answer them using data.

I think that a lot of the thought right now in biostatistics has been on level 3 and 4 courses. These are courses where we have students work with real data sets and learn about tools. To be self-sufficient as a data scientist it is clear you need to be able to work with real world data. But what Jake/Nathan are referring to is level 5 or level 6 - cases where you have a question but the data needs a ton of work and may not even be good enough without collecting new information. Jake and Nathan have perfectly identified the ability to translate murkey questions into data answers as the most valuable data skill. If I had to predict the future of data courses I would see them trending in that direction.