Sometimes the biggest challenge is applying what we already know

There’s definitely a need to innovate and develop new treatments in the area of asthma, but it’s easy to underestimate the barriers to just doing what we already know, such as making sure that people are following existing, well-established guidelines on how to treat asthma. My colleague, Elizabeth Matsui, has written about the challenges in a study that we are collaborating on:

Our group is currently conducting a study that includes implementation of national guidelines-based medical care for asthma, so that one process that we’ve had to get right is to prescribe an appropriate dose of medication and get it into the family’s hands. [emphasis added]

Seems simple, right?

Sometimes there's friction for a reason

Thinking about my post on Theranos yesterday it occurred to me that one thing that’s great about all of the innovation and technology coming out of places like Silicon Valley is the tremendous reduction of friction in our lives. With Uber it’s much easier to get a ride because of improvement in communication and an increase in the supply of cars. With Amazon, I can get that jug of vegetable oil that I always wanted without having to leave the house, because Amazon.

So why is there all this friction? Sometimes it’s because of regulation, which may have made sense at an earlier time, but perhaps doesn’t make as much sense now. Sometimes, you need a company like Amazon to really master the logistics operation to be able to deliver anything anywhere. Otherwise, you’re just stuck driving to the grocery store to get that vegetable oil.

But sometimes there’s friction for a reason. For example, Ben Thompson talks about how previously there was quite a bit more friction involved before law enforcement could listen in on our communications. Although wiretapping had long been around (as noted by David Simon of…The Wire) the removal of all friction by the NSA made the situation quite different. Related to this idea is the massive data release from OkCupid a few weeks ago, as I discussed on the latest Not So Standard Deviations podcast episode. Sure, your OkCupid profile is visible to everyone with an account, but having someone vacuum up 70,000 profiles and dumping them on the web for anyone to view is not what people signed up for—there is a qualitative difference there.

When it comes to Theranos and diagnostic testing in general, there is similarly a need for some friction in order to protect public health. John Ioannides notes in his commentary for JAMA:

Even if the tests were accurate, when they are performed in massive scale and multiple times, the possibility of causing substantial harm from widespread testing is very real, as errors accumulate with multiple testing. Repeated testing of an individual is potentially a dangerous self-harm practice, and these individuals are destined to have some incorrect laboratory results and eventually experience harm, such as, for example, the anxiety of being labeled with a serious condition or adverse effects from increased testing and procedures to evaluate false-positive test results. Moreover, if the diagnostic testing process becomes dissociated from physicians, self-testing and self-interpretation could cause even more problems than they aim to solve.

Unlike with the NSA, where the differences in scale may be difficult to quantify because the exact extent of the program is unknown to most people, with diagnostic testing, we can precisely quantify how a diagnostic test’s characteristics will change if we apply it to 1,000 people vs. 1,000,000 people. This is why organizations like the US Preventative Services Task Force so carefully considers recommendations for testing or screening (and why they have a really tough job).

I’ll admit that a lot of the friction in our daily lives is pointless and it would be great to reduce it if possible. But in many cases, it was us that put the friction there for a reason, and it’s sometimes good to think about why before we move to eliminate it.

Update On Theranos

I think it’s fair to say that things for Theranos, the Silicon Valley blood testing company, are not looking up. From the Wall Street Journal (via The Verge):

Theranos has voided two years of results from its Edison blood-testing machines, issuing tens of thousands of corrected reports to patients and doctors and raising the possibility that many health care decisions may have been made based on inaccurate data. The Wall Street Journal first reported the news, saying that many of the corrected tests have been run using traditional machinery. One doctor told the Journal that she sent a patient to the emergency room after seeing abnormal results from a Theranos test; the corrected report returned normal readings.

Furthermore, this commentary in JAMA from John Ioannides emphasizes the need for caution when implementing testing on a massive scale. In particular, “The notion of patients and healthy people being repeatedly tested in supermarkets and pharmacies, or eventually in cafeterias or at home, sounds revolutionary, but little is known about the consequences” and the consequences really matter here. In addition, as the title of the commentary would indicate, research done in secret is not research at all. For there the be credibility for a company like this, data needs to be made public.

I continue to maintain that the fundamental premise on which the company is built, as stated by its founder Elizabeth Holmes, is flawed. Two concepts are repeatedly made in the context of Theranos:

  • More testing is better. Anyone who stayed awake in their introduction to Bayesian statistics lecture knows this is very difficult to make true in all circumstances, no matter how accurate a test is. With rare diseases, the number of false positives is overwhelming and can have very real harmful effects on people. Combine testing on a massive scale, with repeated application over time, and you get a recipe for confusion.
  • People do not get tested because they are afraid of needles. Elizabeth Holmes makes a big deal about her personal fear of needles and it’s impact on her (not) getting blood tests done. I have no doubt that many people share this fear, but I have serious doubt that this is the reason people don’t get the medical testing done. There are many barriers to people getting the medical care that they need, many that are non-financial in nature and do not include fear of needles. The problem of getting people the medical care that they need is one deserving of serious attention, but changing the manner in which blood is collected is not going to do it.

Not So Standard Deviations Episode 16 - The Silicon Valley Episode

Roger and Hilary are back, with Hilary broadcasting from the west coast. Hilary and Roger discuss the possibility of scaling data analysis and how that may or may not work for companies like Palantir. Also, the latest on Theranos and the release of data from OkCupid.

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes.

Subscribe to the podcast on Google Play.

Please leave us a review on iTunes!

Support us through our Patreon page.

Show notes:

Download the audio for this episode.

What is software engineering for data science?

Editor’s note: This post is a chapter from the book Executive Data Science: A Guide to Training and Managing the Best Data Scientists, written by myself, Brian Caffo, and Jeff Leek.

Software is the generalization of a specific aspect of a data analysis. If specific parts of a data analysis require implementing or applying a number of procedures or tools together, software is the encompassing of all these tools into a specific module or procedure that can be repeatedly applied in a variety of settings. Software allows for the systematizing and the standardizing of a procedure, so that different people can use it and understand what it’s going to do at any given time.

Software is useful because it formalizes and abstracts the functionality of a set of procedures or tools, by developing a well defined interface to the analysis. Software will have an interface, or a set of inputs and a set of outputs that are well understood. People can think about the inputs and the outputs without having to worry about the gory details of what’s going on underneath. Now, they may be interested in those details, but the application of the software at any given setting will not necessarily depend on the knowledge of those details. Rather, the knowledge of the interface to that software is important to using it in any given situation.

For example, most statistical packages will have a linear regression function which has a very well defined interface. Typically, you’ll have to input things like the outcome and the set of predictors, and maybe there will be some other inputs like the data set or weights. Most linear regression functions kind of work in that way. And importantly, the user does not have to know exactly how the linear regression calculation is done underneath the hood. Rather, they only need to know that they need to specify the outcome, the predictors, and a couple of other things. The linear regression function abstracts all the details that are required to implement linear regression, so that the user can apply the tool in a variety of settings.

There are three levels of software that are important to consider, going from kind of the simplest to the most abstract.

  1. At the first level you might just have some code that you wrote, and you might want to encapsulate the automation of a set of procedures using a loop (or something similar) that repeats an operation multiple times.
  2. The next step might be some sort of function. Regardless of what language you may be using, often there will be some notion of a function, which is used to encapsulate a set of instructions. And the key thing about a function is that you’ll have to define some sort of interface, which will be the inputs to the function. The function may also have a set of outputs or it may have some side effect for example, if it’s a plotting function. Now the user only needs to know those inputs and what the outputs will be. This is the first level of abstraction that you might encounter, where you have to actually define and interface to that function.
  3. The highest level is an actual software package, which will often be a collection of functions and other things. That will be a little bit more formal because there’ll be a very specific interface or API that a user has to understand. Often for a software package there’ll be a number of convenience features for users, like documentation, examples, or tutorials that may come with it, to help the user apply the software to many different settings. A full on software package will be most general in the sense that it should be applicable to more than one setting.

One question that you’ll find yourself asking, is at what point do you need to systematize common tasks and procedures across projects versus recreating code or writing new code from scratch on every new project? It depends on a variety of factors and answering this question may require communication within your team, and with people outside of your team. You may need to develop an understanding of how often a given process is repeated, or how often a given type of data analysis is done, in order to weigh the costs and benefits of investing in developing a software package or something similar.

Within your team, you may want to ask yourself, “Is the data analysis you’re going to do something that you are going to build upon for future work, or is it just going to be a one shot deal?” In our experience, there are relatively few one shot deals out there. Often you will have to do a certain analysis more than once, twice, or even three times, at which point you’ve reached the threshold where you want to write some code, write some software, or at least a function. The point at which you need to systematize a given set of procedures is going to be sooner than you think it is. The initial investment for developing more formal software will be higher, of course, but that will likely pay off in time savings down the road.

A basic rule of thumb is

  • If you’re going to do something once (that does happen on occasion), just write some code and document it very well. The important thing is that you want to make sure that you understand what the code does, and so that requires both writing the code well and writing documentation. You want to be able to reproduce it down later on if you ever come back to it, or if someone else comes back to it.
  • If you’re going to do something twice, write a function. This allows you to abstract a small piece of code, and it forces you to define an interface, so you have well defined inputs and outputs.
  • If you’re going to do something three times or more, you should think about writing a small package. It doesn’t have to be commercial level software, but a small package which encapsulates the set of operations that you’re going to be doing in a given analysis. It’s also important to write some real documentation so that people can understand what’s supposed to be going on, and can apply the software to a different situation if they have to.