Rooting Out AI’s Biases
Can we unlock AI’s benefits—and protect people who face risks from biased outputs?
It was an inspired idea: Help America’s sickest patients get the care they need sooner.
The artificial intelligence tool was supposed to help hospitals allocate resources to the patients with the most complex health needs by predicting which patients will require extra care—such as extra nursing teams or more follow-up appointments—and flagging critical conditions before they arise. It would make the best use of hospitals’ limited resources and, more importantly, save lives.
One challenge in building such a tool: defining health needs in a way that an algorithm could easily measure.
Does health need relate more to the number of diagnosed conditions, or to their severity? Does sickness relate more to a decrease in quality of life, or to a reduction of lifespan? To train a machine-learning model, you need an easily measurable proxy for need. A 2019 Nature article describes how the developers at Optum, a health innovation company owned by UnitedHealth Group, settled on health care expenditure as a proxy measure: The more health care a person needed, the more they would spend on it. It seemed a reasonable assumption.
The risk-prediction tool, used in health care systems that care for 70 million patients across the country, was part of a rising revolution in public health and medicine. Today, AI is used in hospitals, health systems, and government to predict health costs, determine which patients will likely fail to show up to their appointments, and help diagnose diseases ranging from diabetes to lung cancer.
And this is just the beginning, says Bloomberg Distinguished Professor Rama Chellappa.
In the near future, Chellappa believes, AI will transform public health in a way that will rival the biggest breakthroughs in the field. Scanning for emerging epidemics, public health algorithms will monitor in real time millions of data sources including medical records, hospital inpatient numbers, flight records, reports of animal diseases, environmental data, sewage outflow metrics, social media posts, and more. In medicine, deep-learning models will flag when a patient is about to suffer an otherwise unexpected deterioration. Algorithms will predict outcomes after surgery with staggering accuracy.
Chellappa, PhD, a professor in the Whiting School of Engineering and the School of Medicine, speaks about the coming transformation with contagious excitement. “In a doctor’s office, in the background,” he says, “there will be a machine that’s just constantly churning, looking at the data to see if there is anything to be worried about.” AI will be able to analyze data and monitor health from the time a baby is born through to the last moments of life.
It’s a compelling vision. But to get there, we need to do more than process additional data or develop smarter algorithms. We need to be sure that AI deserves our trust.
In 2019, researchers identified serious flaws in Optum’s algorithm.
For Black and white patients with a similar score for health need, the Black people were actually sicker. They had higher blood pressure, and their diabetes was more acute. In its assessments of risk, the algorithm was perpetuating bias. Because Black people spent less on health care, the algorithm had learned to recommend that individual Black patients be given half the amount of care as white patients.
As this example shows, there are good reasons not to place our health fully in AI’s hands—and they have less to do with how AI models and humans are distinct, and more to do with what we have in common.
In some ways, AI is all too human, subject to biases and discrimination. Algorithms are trained on data from a world full of deeply rooted disparities along the lines of race, gender, socioeconomic status, and other biases. And once trained, they can become black boxes, with only the most technical among us able to decipher in depth how they work. So, how will we know when to trust them? And how can we unlock the benefits of AI while protecting people who face risks from biased outputs?
Kadija Ferryman, PhD, MA, knows that AI is already threatening lives—and that the danger has nothing to do with AI taking over the world.
In the early 2010s, when Ferryman was working on her PhD on the ethical implications of using genomics for disease diagnosis, concerns about AI bias were still under the radar. But as she spoke with physicians, geneticists, and IT workers for her dissertation, she started to hear something worrying.
The federal government was encouraging the use of electronic health records by offering financial incentives to health care providers, and suddenly, there was more electronic health data than ever before. Researchers envisioned a future where EHRs could infuse medical research with more diverse data and help address the worrying lack of representation of people of color in traditional trials.
In 2014, 86% of participants in clinical trials were white—a big problem considering that medications can have different effects on people from different genetic groups. An epilepsy medication called carbamazepine, for instance, can cause a severe skin disorder in people with a gene variant that is more common in people of Asian heritage. Without representative clinical trials, treatments might not work properly for all populations, or they might be more likely to cause side effects in certain groups. Could Big Data solve the problem?
Ferryman was immediately skeptical. Datasets of patient records, she knew, were not a silver bullet. They would reflect the health disparities present in society. You only have a medical record, after all, if you’ve been able to access health care, and it’s difficult for some groups to access care because of cost, or travel time, or earned distrust. And access to care is only part of the problem.
“It’s not just getting people in the door,” says Ferryman, now an assistant professor in Health Policy and Management and core faculty in the Berman Institute of Bioethics. “It's also how they’re treated when they’re in.” In emergency departments, 74% of white patients are given painkillers for broken bones versus 57% of Black patients. Doctors are 50% more likely to misdiagnose a heart attack if the patient is a woman. And Black children with appendicitis are less likely to receive opioid pain medication compared with white children (12% versus 34%).
These biases in diagnosis and treatment all make their way into medical records. Then, they are immortalized on the macro scale through large datasets, which can be used in the development of medical research and policy—and to train AI models.
Ferryman was concerned. How would researchers ensure that findings drawn from datasets of patient records would be free from the biases enshrined in the data?
After her PhD, this question became the focus of her research. It was a lonely field. While several researchers were looking into the ethical and social implications of Big Data and AI in criminal justice, “there just wasn’t a field of people looking at how AI was being used in medicine.”
In 2018, she co-authored a pioneering study with the nonprofit research institute Data & Society. Many of the biomedical researchers, ethicists, technologists, and patient advocates Ferryman interviewed feared biases embedded in health data could lead to misdiagnosis, a failure to treat illness, or to incorrect treatment.
For example, what if a chest pain algorithm’s recommendations reflected the racial bias that results in Black patients undergoing coronary-bypass surgery at lower rates than whites? Or if a triaging algorithm was trained on data that reflected health disparities? This would make it all the more difficult for people in those groups to get the care they needed, further entrenching inequalities.
And with algorithms trained on huge volumes of complex data, it would be hard to pry open the black box and clearly identify biases. How could doctors and patients know when to trust an AI’s outputs, and when to be skeptical?
Ferryman had flagged a potential public health emergency, but many in the health care industry didn’t want to hear it. While people easily accepted that algorithmic bias could exist in the criminal justice system, the idea that it might also be found in health care was disturbing. Many providers had assumed medicine would be a sanctuary from discrimination. “I would say: ‘We’re using a lot of Big Data in health care, maybe we also have some algorithmic bias here?’” says Ferryman. “And people would reply: ‘No, no, it’s different in health care. Here we are trying to help people!’”
The following year, though, Ferryman’s findings were confirmed when news broke about the racial bias of Optum’s algorithm.
In a paper published in Science, a team led by Ziad Obermeyer at the University of California, Berkeley demonstrated the algorithm was racially biased. Obermeyer had unlocked the black box and discovered its inner workings.
First, he chose a different definition of health need, one that had nothing to do with health care expenditure. He analyzed the 50,000 medical records that had trained the algorithm and used the number of a patient’s chronic conditions as the measure of their need. On this metric, Black patients tended to have greater needs than whites. Then, he looked at the risk scores the algorithm had assigned to patients. The results were astounding: Obermeyer estimated that the algorithm’s racial bias reduced the care Black patients received by over 50%.
Ferryman was not surprised. “The model was devoid of the context that health care expenditure has a racial bias to it,” she says. “In some groups, health care expenditure is a pretty good proxy for illness. But in other groups, it’s not.”
After publishing the Optum study, Obermeyer offered to help the company reduce the racial bias in the algorithm, tweaking it to place less importance on expenditure. In collaboration with the company, he managed to reduce the algorithmic bias by 84%. But Optum, while expressing gratitude for Obermeyer’s work, insisted that his conclusion was misleading and that expenditure was only one of many data elements—with the most important being a doctor’s expertise.
But then researchers started noticing a flood of other examples of AI bias.
Scientists at MIT discovered that computer-vision algorithms used to diagnose pathologies based on chest X-rays worked less well for Black people. Algorithms for detecting skin cancer were found to be less accurate on darker-skinned patients. And in the UK, researchers discovered that datasets for training algorithms to diagnose eye diseases contained a disproportionate number of patient records from Europe, North America, and China, meaning they would likely perform worse for patients from low-income countries.
The field exploded, and Ferryman felt vindicated.
Faced with these risks, would it not be safer to abandon health AI altogether?
Along with AI optimists like Chellappa, Elizabeth Chin, PhD, an assistant professor in Biostatistics, thinks the promise is too great to pass up. She believes that AI could even help provide a counterweight to clinicians’ unconscious bias. A 2021 study showed that an algorithm trained on knee X-rays was able to find physical explanations for pain reported by Black patients that would otherwise have gone unexplained and potentially dismissed. “We know that humans are biased,” she says. “AI might be able to provide something that’s more standardized across populations.”
But first it has to earn our trust.
In 2016, a newly introduced algorithm used to determine how many hours of service disabled people needed in an Arkansas Medicaid waiver program ended up drastically underestimating people’s needs and wrongly canceled home care visits for hundreds of people. A lawyer for victims of the decision said it caused “incalculable human suffering.” Two years after the scandal, the algorithm’s designer asserted that we needed to trust the “smart people” who worked on these algorithms.
Blind faith cannot be the answer, say Ferryman and Chin. Bias is a deep social problem, and algorithm designers, no matter how well-intentioned are fallible. So how can health care algorithms merit our confidence? “It’s a question that doesn’t have an easy answer,” says Ferryman—but she and other researchers have some ideas.
One approach is improving the quality of training data, making sure it is as diverse and representative as possible. That makes sense: Better data, so the reasoning goes, will lead to algorithms making smarter predictions.
But getting better data is easier said than done, says Chin. “A few years ago,” she says, “the predominant view was ‘garbage in, garbage out.’” Better data means better results. But when it comes to public health, the reality is far messier because critical data may not exist. It is difficult “to collect data from individuals if they never go to the hospital,” Chin says. “Or where there’s differences in health care utilization between different groups.” And even with more representative data, developers can still run the risk of perpetuating biases in their systems, she says. “AI systems are not neutral.”
Another approach is known as algorithmic transparency—turning the black box into a glass cube. Advocates for algorithmic transparency argue that developers should be sharing as much information as possible with clinicians about the AI model and the assumptions it’s based on. This would allow doctors to make informed decisions about and adjustments for any potential biases.
In 2021, Karandeep Singh, MD, MMSc, an associate professor at the University of Michigan Medical School, studied the algorithm in a widely used AI model for predicting sepsis before it occurred. He and his team found the model was terrible at identifying sepsis ahead of time and was producing a high number of false positives.
The AI, it turned out, had the logic backward: It was using the delivery of antibiotics—which are used in the treatment of sepsis—as a predictor of future sepsis. Because the vendor had not disclosed that the model was, irrationally, using antibiotics as a predictor, it was extremely difficult to track down the root of the problem.
“When I say ‘algorithmic transparency,’” says Singh, “I mean that we should know what an algorithm is predicting, which aspects of a patient or situation are considered by the algorithm in its decision-making, and which population the algorithm was trained and evaluated on.”
Even that’s not enough, says Ferryman. You can put the code in a tab so that it is available to clinicians, she says, “but how many clinicians are really going to be able to understand that code, and what it actually means for patients? It's transparent, but is it really useful?”
How about tinkering with the algorithms to ensure results are fairer and more standardized? It’s a commonly used approach, but there are tradeoffs. Tweaking an algorithm to increase its fairness can diminish its overall accuracy, says Ferryman, and improving it for one group can make it less effective for others.
Take a diagnostic algorithm for skin cancer that underperforms for Black patients. If the goal is to make the algorithm perform better for Black patients while still being as fair as possible across groups, you need to adjust the model so that, for Black patients, it switches a certain number of cases it is least confident about from “low-risk” to “high-risk” (known as “leveling up”). This ensures those patients receive follow-up screening. But achieving mathematical equality between groups will also require some degree of leveling down, which means that, for other patients, a certain number of high-risk cases will have to be switched to low-risk—increasing the chance of misdiagnosis or inadequate treatment.
“The problem is the attempt to turn fairness into a measurable thing, into a strictly enforceable mathematical equation,” says Brent Mittelstadt, PhD, MA, a professor at the University of Oxford, who is concerned about developers leveling down algorithms in an attempt to eradicate bias.
The easiest way to achieve equality of accuracy tends to involve leveling down, and Mittelstadt is critical of an emerging industry that promises to make AI models fairer in a simple way. “If you buy into that, the risk is that you’re going to achieve fairness in a way that appears to be straightforward and mathematically satisfying, but that actually introduces a lot of avoidable harm for people.”
Another option is using different AI models for different population groups. But that is fraught with problems, too. Racial categories, after all, are not clear cut. “You’ll have to figure out the right tool to use for someone who is both Black and Latino,” Ferryman says. “It is not an ideal solution.”
Instead, Ferryman says that at every moment—from the initial concept or research question through development, roll out, and beyond—designers of health care algorithms should be on the lookout for possible bias.
Imagine, she says, that a developer wants to create an algorithm for diagnosing multiple sclerosis. First, they need to understand the clinical disparities that exist in the real world. This means learning that people in minority groups tend to be underdiagnosed. “That is step one: understanding that the social problems we have are part of the medical evidence, too,” she says.
Step two is checking for embedded biases in the training data. For example, data from large numbers of medical records could invite the algorithm to assign Black patients a lower risk score on the basis that fewer Black people tend to be diagnosed with MS. If this kind of bias is found, the developer will have to think carefully about how to adjust the model to make it as fair as possible. “Maybe,” she says, “we calibrate it to have more false positives.” Patients might be referred for follow-ups and further testing when they don’t have MS, but that’s better than the alternative.
Then, the model should be validated against as diverse a group of people as possible. Once in use, Ferryman proposes periodic bias reviews. This ongoing monitoring would check for algorithmic bias, but it would also determine whether the algorithm is making any difference to existing health disparities in the area. “Is it mitigating them?” she says. “Or is it exacerbating them?”
Chellappa says that AI bias, for the most part, is fixable—by improving training data to be more representative, by increasing algorithmic transparency to enable clinicians to spot unfairness more easily, and by adjusting models to reduce bias.
But, Ferryman says, we cannot afford to let our guard down. Nor can we expect health algorithms to be perfect. If we can be alert to unfairness and bias at every stage, AI has the potential to help public health and medicine achieve their main purpose: better health for everybody.