Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

La matrícula, el costo del crédito y las huelgas en la UPR

La Universidad de Puerto Rico (UPR) recibe aproximádamente 800 millones de dólares del estado cada año. Esta inversión le permite ofrecer salarios más altos, lo cual atrae a los mejores profesores, tener las mejores instalaciones para la investigación y enseñanza, y mantener el precio por crédito más bajo que las universidades privadas. Gracias a estas grandes ventajas, la UPR suele ser la primera opción del estudiantado puertorriqueño, en particular los dos recintos más grandes, Río Piedras (UPRRP) y Mayagüez. Un estudiante que aprovecha su tiempo en la UPR, además de formarse como ciudadano, puede entrar exitosamente en la fuerza laboral o continuar sus estudios en las mejores escuelas graduadas. El precio módico del crédito, en combinación con las becas federales Pell, han ayudado a miles de estudiantes económicamente desaventajados a completar sus estudios sin tener que endeudarse.

En la pasada década una realidad preocupante ha surgido: mientras la demanda por la educación universitaria ha crecido, demostrado por el crecimiento de la matrícula en las universidades privadas, el número de estudiantes matriculados en la UPR ha bajado.

¿Por qué ha bajado la matrícula en la UPR? Una explicación popular es que “la baja en matrícula es provocada por el aumento en el costo de la matrícula”. La teoría de que un alza en costos disminuye la matrícula es comúnmente aceptada pues tiene sentido económico: cuando el precio sube, las ventas bajan. Pero entonces ¿por qué ha crecido la matrícula en las universidades privadas? Tampoco lo explica un crecimiento en el número de estudiantes ricos ya que, en el 2012, la mediana de ingreso familiar de aquellos jóvenes matriculados en algún recinto de la UPR era de $32,379; en contraste, la mediana de ingreso de aquellos que están matriculados en una universidad privada era de $25,979. Otro problema con esta teoría es que, una vez ajustamos por inflación, el costo del crédito se ha mantenido más o menos estable tanto en la UPR como en las unversidades privadas.

Ahora, si miramos detenidamente los datos de la matrícula notamos que los bajones más grandes fueron precisamente en los años de huelga (2005, 2010, 2011). En el 2005 comienza una tendencia positiva en la matrícula del Sagrado, con el crecimiento más alto en el 2010 y el 2011.

Actualmente, varios recintos, incluyendo Río Piedras, están cerrados indefinidamente. En una asamblea nacional asistida por 10% de los más de 50,000 estudiantes del sistema, una huelga indefinida fue aprobada en una votación de 4,522 a 1,154. Para reiniciar labores los estudiantes exigen que “no se impongan sanciones a los estudiantes que participen en la huelga, que se presente un plan de reforma universitaria elaborado por la comunidad universitaria, que se audite la deuda pública y se restituya a los miembros de la comisión evaluadora de la auditoría pública y su prepuesto”. Esto ocurre como respuesta a la propuesta por la Junta de Supervición Fiscal (JSF) y el gobernador de reducir el presupuesto de la UPR como parte de sus intentos de resolver una grave crisis fiscal.

Durante el cierre, los estudiantes en huelga le impiden la entrada al recinto al resto de la comunidad universitaria, incluyendo aquellos que no consideran la huelga una manera efectiva de protesta. Aquellos que se oponen y quieren continuar estudiando, se les acusa de ser egoistas o de ser aliados de quienes quieren destruir la UPR. Hasta ahora estos estudiantes tampoco han recibido el apoyo explícito de los profesores y administradores. No debe sorprendernos si los que quieren continuar estudiando recurren a pagar más en una universidad privada.

portones2

Aunque existe la posibilidad de que la huelga ejerza suficiente presión política para que se responda a las exigencias determinadas en la asamblea, hay otras posibilidades menos favorables para los estudiantes:

  • La falta de actividad académica resulta en el exilio de miles de estudiantes a las universidades privadas.
  • La JSF usa el cierre para justificar aun más recortes: una institución no requiere millones de dolares al día si está cerrada.
  • Los recintos cerrados pierden su acreditación ya que una universidad en la cual no se da clases no puede cumplir con las normas necesarias.
  • Se revocan las becas Pell a los estudiantes en receso.

Hay mucha evidencia empírica que demuestra la importancia de la educación universitaria accesible. Lo mismo no es cierto sobre las huelgas como estrategia para defender dicha educación. Y cabe la posibildad que la huelga indefinida tenga el efecto opuesto y perjudique enormemente a los estudiantes, en particular a los que se ven forzados a matricularse en una universidad privada.

Notas:

  1. Data proporcionada por el Consejo de Educación de Puerto Rico (CEPR).

  2. El costo del crédito del 2011 no incluye la cuota.

The Importance of Interactive Data Analysis for Data-Driven Discovery

Data analysis workflows and recipes are commonly used in science. They are actually indispensable since reinventing the wheel for each project would result in a colossal waste of time. On the other hand, mindlessly applying a workflow can result in totally wrong conclusions if the required assumptions don’t hold. This is why successful data analysts rely heavily on interactive data analysis (IDA). I write today because I am somewhat concerned that the importance of IDA is not fully appreciated by many of the policy makers and thought leaders that will influence how we access and work with data in the future.

I start by constructing a very simple example to illustrate the importance of IDA. Suppose that as part of a demographic study you are asked to summarize male heights across several counties. Since sample sizes are large and heights are known to be well approximated by a normal distribution you feel comfortable using a true and tested recipe: report the average and standard deviation as a summary. You are surprised to find a county with average heights of 6.1 feet with a standard deviation (SD) of 7.8 feet. Do you start writing a paper and a press release to describe this very interesting finding? Here, interactive data analysis saves us from naively reporting this. First, we note that the standard deviation is impossibly big if data is in fact normally distributed: more than 15% of heights would be negative. Given this nonsensical result, the next obvious step for an experienced data analyst is to explore the data, say with a boxplot (see below). This immediately reveals a problem, it appears one value was reported in centimeters: 180 centimeters not feet. After fixing this, the summary changes to an average height of 5.75 and with a 3 inch SD.

European Outlier

Years of data analysis experience will show you that examples like this are common. Unfortunately, as data and analyses get more complex, workflow failures are harder to detect and often go unnoticed. An important principle many of us teach our trainees is to carefully check for hidden problems when data analysis leads you to unexpected results, especialy when the unexpected results holding up benefits us professionally, for example by leading to a publication.

Interactive data analysis is also indispensable for the development of new methodology. For example, in my field of research, exploring the data has led to the discovery of the need for new methods and motivated new approaches that handle specific cases that existing workflows can’t handle.

So why I am concerned? As public datasets become larger and more numerous, many funding agencies, policy makers and industry leaders are advocating for using cloud computing to bring computing to the data. If done correctly, this would provide a great improvement over the current redundant and unsystematic approach of everybody downloading data and working with it locally. However, after looking into the details of some of these plans, I have become a bit concerned that perhaps the importance of IDA is not fully appreciated by decision makers.

As an example consider the NIH efforts to promote data-driven discovery that center around plans for the Data Commons. The linked page describes an ecosystem with four components one of which is “Software”. According to the description, the software component of The Commons should provide “[a]ccess to and deployment of scientific analysis tools and pipeline workflows”. There is no mention of a strategy that will grant access to the raw data. Without this, carefully checking the workflow output and developing the analysis tools and pipeline workflows of the future will be difficult.

I note that data analysis workflows are very popular in fields in which data analysis is indispensible, as is the case in biomedical research, my focus area. In this field, data generators, which typically lead the scientific enterprise, are not always trained data analysts. But the literature is overflowing with proposed workflows. You can gauge the popularity of these by the vast number published in the nature journals as demonstrated by this google search:

Nature workflows

In a field in which data generators are not data analysis experts, the workflow has the added allure that it removes the need to think deeply about data analysis and instead shifts the responsibility to pre-approved software. Note that these workflows are not always described with the mathematical language or computer coded needed to truly understand it but rather with a series of PowerPoint shapes. The gist of the typical data analysis workflow can be simplified into the following:

workflows

This simplification of the data analysis process makes it particularly worrisome that the intricacies of IDA are not fully appreciated.

As mentioned above, data analysis workflows are a necessary component of the scientific enterprise. Without them the process would slow down to a halt. However, workflows should only be implemented once consensus is reached regarding its optimality. And even then, IDA is needed to assure that the process is performing as expected. The career of many of my colleagues has been dedicated mostly to the development of such analysis tools. We have learned that rushing to implement workflows before they are mature enough can have widespread negative consequences. And, at least in my experience, developing rigorous tools is impossible without interactive data analysis. So I hope that this post helps make a case for the importance of interactive data analysis and that it continues to be a part of the scientific enterprise.

The levels of data science class

In a recent post, Nathan Yau points to a comment by Jake Porway about data science hackathons. They both say that for data science/visualization projects to be successful you have to start with an important question, not with a pile of data. This is the problem forward not solution backward approach to data science and big data. This is the approach also advocated in the really nice piece on teaching data science by Stephanie and Rafa

I have adopted a similar approach in the data science class here at Hopkins, largely inspired by Dan Meyer’s patient problem solving for middle school math class. So instead of giving students a full problem description I give them project suggestions like:

  • Option 1: Develop a prediction algorithm for identifying and classifying users that are trolling or being mean on Twitter. If you want an idea of what I’m talking about just look at the responses to any famous person’s tweets.
  • Option 2: Analyze the traffic fatality data to identify any geographic, time varying, or other characteristics that are associated with traffic fatalities: https://www.transportation.gov/fastlane/2015-traffic-fatalities-data-has-just-been-released-call-action-download-and-analyze.
  • Option 3: Develop a model for predicting life expectancy in Baltimore down to single block resolution with estimates of uncertainty. You may need to develop an approach for “downsampling” since the outcome data you’ll be able to find is likely aggregated at the neighborhood level (http://health.baltimorecity.gov/node/231).
  • Option 4: Develop a statistical model for inferring the variables you need to calculate the Gail score (http://www.cancer.gov/bcrisktool/) for a woman based on her Facebook profile. Develop a model for the Gail score prediction from Facebook and its uncertainty. You should include estimates of uncertainty in the predicted score due to your inferred variables.
  • Option 5: Potentially fun but super hard project. develop an algorithm for self-driving car using the training data: http://research.comma.ai/. Build a model for predicting at every moment what direction the car should be going, whether it should be signalling, and what speed it should be going. You might consider starting with a small subsample of the (big) training set.

Each of these projects shares the characteristic that there is an interesting question - but the data may or may not be available. If it is available it may or may not have to be processed/cleaned/organized. Moreover, with the data in hand you may need to think about how it was collected or go out and collect some more data. This kind of problem is inspired by this quote from Dan’s talk - he was talking about math but it could easily have been data science:

Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you didn’t have insufficient information and have to go find some?

I realize though that this is advanced data science. So I was thinking about the levels of data science course and how you would build up a curriculum. I came up with the following courses/levels and would be interested in what others thought.

  • Level 0: Background: Basic computing, some calculus with a focus on optimization, basic linear algebra.
  • Level 1: Data science thinking: How to define a question, how to turn a question into a statement about data, how to identify data sets that may be applicable, experimental design, critical thinking about data sets.
  • Level 2: Data science communication: Teaching students how to write about data science, how to express models qualitatively and in mathematical notation, explaining how to interpret results of algorithms/models. Explaining how to make figures.
  • Level 3: Data science tools: Learning the basic tools of R, loading data of various types, reading data, plotting data.
  • Level 4: Real data: Manipulating different file formats, working with “messy” data, trying to organize multiple data sets into one data set.
  • Level 5: Worked examples: Use real data examples, but work them through from start to finish as case studies, don’t make them easy clean data sets, but have a clear path from the beginning of the problem to the end.
  • Level 6: Just the question: Give students a question where you have done a little research to know that it is posisble to get at least some data, but aren’t 100% sure it is the right data or that the problem can be perfectly solved. Part of the learning process here is knowing how to define success or failure and when to keep going or when to quit.
  • Level 7: The student is the scientist: Have the students come up with their own questions and answer them using data.

I think that a lot of the thought right now in biostatistics has been on level 3 and 4 courses. These are courses where we have students work with real data sets and learn about tools. To be self-sufficient as a data scientist it is clear you need to be able to work with real world data. But what Jake/Nathan are referring to is level 5 or level 6 - cases where you have a question but the data needs a ton of work and may not even be good enough without collecting new information. Jake and Nathan have perfectly identified the ability to translate murkey questions into data answers as the most valuable data skill. If I had to predict the future of data courses I would see them trending in that direction.

When do we need interpretability?

I just saw a link to an interesting article by Finale Doshi-Velez and Been Kim titled “Towards A Rigorous Science of Interpretable Machine Learning”. From the abstract:

Unfortunately, there is little consensus on what interpretability in machine learning is and how to evaluate it for benchmarking. Current interpretability evaluation typically falls into two categories. The first evaluates interpretability in the context of an application: if the system is useful in either a practical application or a simplified version of it, then it must be somehow interpretable. The second evaluates interpretability via a quantifiable proxy: a researcher might first claim that some model class—e.g. sparse linear models, rule lists, gradient boosted trees—are interpretable and then present algorithms to optimize within that class.

The paper raises a good point, which is that we don’t really have a definition of “interpretability”. We just know it when we see it. For the most part, there’s been some agreement that parametric models are “more interpretable” than other models, but that’s a relativey fuzzy statement.

I’ve long heard that complex machine learning models that lack any real interpretability are okay because there are many situations where we don’t care “how things work”. When Netflix is recommending my next movie based on my movie history and perhaps other data, the only thing that matters is that the recommendation is something I like. In other words, the user experience defines the value to me. However, in other applications, such as when we’re assessing the relationship between air pollution and lung cancer, a more interpretable model may be required.

I think the dichotomization between these two kinds of scenarios will eventually go away for a few reasons:

  1. For some applications, lack of interpretability is fine…until it’s not. In other words, what happens when things go wrong? Interpretability can help us to decipher why things went wrong and how things can be modified to be fixed. In order to move the levers of a machine to fix it, we need to know exactly where the levers are. Yet another way to say this is that it’s possible to quickly jump from one situation (interpretability not needed) to another situation (what the heck just happened?) very quickly.
  2. I think interpretability will become the new reproducible research, transmogrified to the machine learning and AI world. In the scientific world, reproducibility took some time to catch on (and has not quite caught on completely), but it is not so controversial now and many people in many fields accept the notion that all studies should at least be reproducible (if not necessarily correct). There was a time when people differentiated between cases that needed reproducibility (big data, computational work), and cases where it wasn’t needed. But that differentiation is slowly going away. I believe interpretability in machine learning and statistical modeling wil go the same way as reproducibility in science.

Ultimately, I think it’s the success of machine learning that brings the requirement of interpretability on to the scene. Because machine learning has become ubiquitous, we as a society begin to develop expectations for what it is supposed to do. Thus, the value of the machine learning begins to be defined externally. It will no longer be good enough to simply provide a great user experience.

Model building with time series data

A nice post by Alex Smolyanskaya over the Stitch Fix blog about some of the unique challenges of model building in a time series context:

Cross validation is the process of measuring a model’s predictive power by testing it on randomly selected data that was not used for training. However, autocorrelations in time series data mean that data points are not independent from each other across time, so holding out some data points from the training set doesn’t necessarily remove all their associated information. Further, time series models contain autoregressive components to deal with the autocorrelations. These models rely on having equally spaced data points; if we leave out random subsets of the data, the training and testing sets will have holes that destroy the autoregressive components.