NOTE: This post was written by Kevin Elliott, Michigan State University; Nicole Kleinstreuer, National Institutes of Health; Patrick McMullen, ScitoVation; Gary Miller, Columbia University; Bhramar Mukherjee, University of Michigan; Roger D. Peng, Johns Hopkins University; Melissa Perry, The George Washington University; Reza Rasoulpour, Corteva Agriscience, and Elizabeth Boyle, National Academies of Sciences, Engineering, and Medicine. The full summary for the workshop on which this post is based can be obtained here.
On June 6 and 7, 2019, the National Academy of Sciences, Engineering, and Medicine (NASEM), hosted a workshop on the use of artificial intelligence (AI) in the field of Environmental Health. Rapid advances in machine learning are demonstrating the ability of machines to carry out repetitive “smart” tasks requiring discreet judgments. Machine learning algorithms are now being used to analyze large volumes of complex data to find patterns and make predictions, often exceeding the accuracy and efficiency of people attempting the same task. Driven by tremendous growth in data availability as well as computing power and accessibility, artificial intelligence and machine learning applications are rapidly growing in various sectors of society including retail, such as predicting consumer purchases; the automotive industry as demonstrated by self-driving cars, and in health care with advances in automated medical diagnoses.
Building upon the major themes of the NASEM workshop, in this blog post we address the following questions:
How might AI advance environmental health?
Does AI change the standards used for conducting environmental health research?
Does the use of AI allow us to change our established research principles?
How does AI impact our training programs for the next generation of environmental health scientists?
Are there barriers within the current academic incentive structures that are hindering the full potential of AI, and how might those barriers be overcome?
Environmental health is the study of how the environment affects human health. Due to the complexity of both human biology and the multiplicity of environmental factors that we encounter daily, studying environmental impacts on human health presents many data challenges. Due to the data boom we have seen in recent years we now have a multitude of individualized data including genetic sequencing and wearable health and activity monitors. We have also seen exponential growth in the availability of data on individual environmental exposures. Wearable sensors and personal chemical samplers are allowing for more detailed exposure models, whereas advancements in exposure biomonitoring in a variety of matrices including blood and urine is giving more granular detail about actual chemical body burdens. We have also seen an increase in available population level data on dietary factors, the social and built environment, climate, and many other variables affected by environmental and genetic factors. Concurrently, while population data are booming, toxicology is creating a variety of experimental models to advance our understanding of how chemicals and environmental exposures may pose risks to human health. Large-scale high-throughput chemical safety screening efforts can now generate data on tens of thousands of chemicals in thousands of biological targets. Integrating these diverse data streams represents a new level of complexity.
AI and machine learning provide many opportunities to make this complexity more manageable, such as highly accurate prediction methods to better assess exposures and flexible approaches to allow incorporation of exposure to complex mixtures in population health analyses. Incorporating artificial intelligence and machine learning methods in environmental health research offers the potential to transform how we analyze environmental exposures and our understanding of how these myriad factors influence our health and contribute to disease.
While we think the use of AI and machine learning techniques clearly hold great promise for the advancement of environmental health research, we also believe such techniques introduce new challenges and magnify existing ones. While the major standards by which we conduct scientific research do not change, our ability to adhere to them will require some adaptation. Transparency and repeatability are key. We must ensure that the computational reproducibility and replicability of our scientific findings do not suffer at the hands of complex algorithms and poorly assembled data pipelines. Complex data analyses that incorporate more diverse data types from varied sources stretch our ability to track, curate, and validate these data without robust data curation tools. Although some data curation tools that establish standard approaches for creating, managing, and maintaining data are available, they are usually field-specific, and currently there are no incentives or strict requirements to ensure that investigators use them.
Machine learning and artificial intelligence algorithms have demonstrated themselves to be very powerful. At the same time, we also recognize their complexity and general opacity can be cause for concern. While investigators may be willing to overlook the opacity of these algorithms when predictions are highly accurate and precise, all is well until it isn’t. When an algorithm does not work as expected, it is critical to know why it didn’t work. With transparency and reproducibility of utmost importance, machine learning algorithms must ensure that investigators and data analysts have accountability in their analyses and that regulators have confidence in applying AI generated results to inform public health decisions.
AI does not change established research principles such as sound study designs and understanding threats of bias. However, there is a need to create updated guidelines and implement best practices for choosing, cleaning, structuring, and sharing the data used in AI applications. Creating appropriate training datasets, engaging in ongoing processes of validation, and assessing the domain of applicability for the models that are generated are also important. As in all areas of science, it is crucial to clarify whether models solely provide accurate predictions or whether they also provide understanding of relevant mechanisms. The current Open Science movement’s emphasis on transparency is particularly relevant to the use of AI and machine learning. Users of these methods in environmental health should be looking for ways to be open about the model training data, to clarify validation methods, to create interpretable “models of the models” where possible, and to clarify their domains of applicability. Recent innovations like model cards, or short documents that go alongside machine learning models to share information that everyone impacted by the model should know, is one example of a way model developers can communicate their models’ strengths and weaknesses in a way that is accessible.
As complex AI methods are increasingly applied to environmental health research, it is important to consider effective training of the workforce and its future leaders. Currently, training in the application of data science is unstandardized, as trainees learn how to apply methods to a specific research application through an apprenticeship type model, where a trainee works with a mentor. Classroom training standardizes theory and methods, but the mentor teaches the fine details of analyzing data in a specific research area, which introduces heterogeneity into the ways in which scientists analyze data. The lack of training standards leads to a worry that analysts may apply cutting-edge computational/algorithmic approaches to data analysis, without consideration of fundamental biostatistical and epidemiologic principles, such as statistical design, sampling, and inference. Fundamental questions taught in biostatistics and epidemiology courses, such as “Who is in my sample?” and “What is my target population of inference?” are even more relevant in our current era of algorithms and machine learning. Now analysts are agnostically querying databases not designed for population-based research such electronic health records, medical claims, Twitter, Facebook, and Google searches, for new discoveries in environmental health. It is important to recognize that a lack of proper consideration of issues related to sampling, selection bias, correlation of multiple exposures, exposure and outcome misclassification could lead to erroneous results and false conclusions. Training programs will need to evolve so that we do not just teach scientists and analysts how to program models and interpret their results, but also emphasize how to recognize human biases that can be inadvertently built into the data and model approaches, and the continuous need for rigor, responsibility, and reproducibility.
An increased focus on mathematical theory may also improve training in the application of AI to environmental health. A greater effort in developing standardized theory about how and why a specific research area analyses data in a certain way may help adapt approaches from one research area to another. In addition, deeper mathematical exploration of AI methods will help data scientists understand when and why AI methods work well, and when they don’t.
Rigorous data science requires a team science approach to achieve a variety of functions such as developing algorithms, formalizing common data platforms and testing protocols, and properly maintaining and curating data sources. Over recent decades, we have witnessed how the power of team science has improved the understanding of critical health problems of our time such as in unlocking the human genome and achieving major advancements in cancer treatment. These advances have demonstrated the payoff of interdisciplinary, transdisciplinary, and multidisciplinary investigations. Despite these successes, there are still barriers to large team science projects, because these projects often have goals that do not sit precisely within a single funding agency. In order for AI to truly advance environmental health, federal agencies and institutions that fund environmental health research need to create pathways to support large multi-disciplinary and multi-institutional teams that are conducting this research. An example could be a multi-agency/multi-institute funding consortia. A ten-year investment in a well-coordinated initiative that harnesses AI data opportunities could accelerate new findings in not only the environmental causes of disease, but also in informing interventions that can prevent environmentally mediated disease and improve population health.
We believe machine learning and AI methods have tremendous potential but we also believe they cannot be used in a way that overlooks limitations or relaxes data integrity standards. With these considerations in mind, we have tempered enthusiasm for the promises of these approaches. We have to make sure that environmental health scientists stay out in front of these considerations to avoid potential pitfalls such as the allure of hype or chasing after the next new thing because it is novel rather than truly meaningful. We can do this by fostering ongoing conversations about the challenges and opportunities AI provides for environmental health research. An intentional union of the two cultures of careful (and often overly cautious) stochastic and bold (and often overly optimistic) algorithmic modeling can help to ensuring we are not abandoning principles of proper study design when a new technology comes along, but explore how to use the new technology to better understand the myriad ways the environment affects health and disease.