Eliminating Bias in Machine Learning

Woman in office writing on whiteboard — Evidation Health

If modern dystopian media has taught us anything in the last decade, it’s that data models reflect human input — sometimes, to the severe detriment of the human in question. So while machine learning (ML) has its significant pros — including but not limited to streamlining healthcare systems and helping businesses track important metrics — engineers responsible for programming must also consider and prepare for the cons.

The best way to do that? Give the models a variety of voices to learn and draw inspiration from.

That’s exactly what the teams at Grammarly, Sift and Evidation Health have done. Built In SF spoke to members of their data science and research departments to understand why this method is so important and how they track and update models over time.

“Monitoring data or model drift has become a focal point for many mature ML practitioners, so we are constantly looking to the industry for best practices,” Sift Head of Data Science Wei Liu said.

Courtney Napoles

Language Data Manager • Grammarly

In her role as language data manager, Grammarly’s Courtney Napoles understands that the only way to identify and eliminate bias in machine learning models is to constantly update them. So that’s what she and her team do. Below, Napoles explains how internal communication channels and pulling from a variety of sources help keep their product current and comprehensive.

What’s the most important thing to consider when choosing the right learning model for a given problem? And how does this help you get ahead of bias early on in the process?

Before developing a model, closely consider the underlying data to understand how representative the data are and in what ways they exhibit bias. Models that do dimensionality reduction can learn to equate correlated attributes in the data (such as two words that frequently occur together), and thereby may make false and harmful predictions. At Grammarly, we test our deep learning models with texts containing words referring to different ethnicities, for instance, to determine whether the predictions change across ethnicities.

First and foremost, it is important to have a diverse team developing training sets to ensure myriad perspectives are reflected in the data.’’

What steps do you take to ensure your training data set is diverse and representative of different groups?

First and foremost, it is important to have a diverse team developing training sets to ensure myriad perspectives are reflected in the data. A combination of manual and automatic techniques can be employed to measure the diversity of the training data, but any automatic approach should be supported by manual analysis. For example, manually augmenting the training set with metadata allows researchers to investigate potential correlations between underrepresented groups and certain characteristics.

When it comes to testing and monitoring your model over time, what’s a strategy you’ve found to be particularly useful for identifying and eliminating bias?

There’s no fail-safe method to identify and eliminate bias, so deployed models must be routinely monitored on real-world data. One approach Grammarly takes is to consider feedback from a variety of sources, including our users and trained experts in linguistics and sociology. We also have dedicated internal communication channels for employees to voice any concerns or questions related to potential bias.

Wei Liu

Head of Data Science • Sift

At AI company Sift, model explainability matters. With that in mind, Head of Data Science Wei Liu said his team has built tools to help their customers understand the company’s ML output. He believes that doing so, as well pulling training data from different customer bases across verticals and geolocations, has served as a competitive advantage.

What’s the most important thing to consider when choosing the right learning model for a given problem? And how does this help you get ahead of bias early on in the process?

Model explainability is very important for Sift because our customers want to know how our ML model made a decision and whether they can trust it. Focusing on explainability doesn’t mean we sacrifice accuracy and pick simplistic models. Instead, we take a broader view of model explainability and build tools helping our customers understand ML output at both the transaction and population levels. The disparity of model performance on specific populations often reveals gaps in our training regime, feature engineering and input data discrepancy. For example, we once found that for one customer, our ML model produces higher scores in a handful of Asian countries. Through our internal tooling, we were able to pinpoint the root cause of the discrepancy: Our customer only sent us transaction failures in these Asian countries, and our ML model learned to associate transaction failures with Asian users. This finding allowed us to communicate effectively with our customer and fix it early in our pipeline.

Focusing on explainability doesn’t mean we sacrifice accuracy and pick simplistic models.’’

What steps do you take to ensure your training data set is diverse and representative of different groups?

We include training data from vastly different customer bases in different verticals and geolocations. The diversity of our training data is one of our core competitive advantages over other Trust and Safety solutions. We regularly inspect and perform offline analysis of our label sources and label data quality to understand the intention behind our manual reviewers. As previously mentioned, we also look at the model output on specific populations for telltale signs of bias, which often originates from biases introduced by the training data.

When it comes to testing and monitoring your model over time, what’s a strategy you’ve found to be particularly useful for identifying and eliminating bias?

Data and model drift can introduce bias over time. Monitoring data or model drift has become a focal point for many mature ML practitioners, so we are constantly looking to the industry for best practices. At Sift, we also have a unique challenge to monitor ML models for a very large customer base, so we try to automate the processes for monitoring model biases. We build tools that help us to identify model biases at various stages of our pipeline, such as data ingestion, label data quality, feature value distribution, feature importance ranking and the final score distribution.

Evidation

View Profile

The Research, Analysis and Learning (ReAL) Team at Evidation Health addresses potential biases in data by identifying confounding, mediation and effect moderation or modification issues when possible, as well as implementing counterfactual fairness metrics in predictive modeling.

What’s the most important thing to consider when choosing the right learning model for a given problem? And how does this help you get ahead of bias early on in the process?

It is important to note that bias happens before the data. Bias also happens in choosing models, both before and after seeing the data. Recognizing that there is bias in both humans and machines, and working to understand the downstream effects, is important in getting ahead of bias early in the process.

Decisions that can impact bias can happen anywhere in the machine learning pipeline: design, pre-processing, modeling. Detecting bias can be challenging, and there are various approaches to mitigating bias that come with their own benefits and tradeoffs. In fact, there’s even a theorem, Chouldechova’s impossibility theorem, that says there are different definitions of fairness, and you can’t optimize for all of them.

An excerpt from “Machine intelligence in healthcare — perspectives on trustworthiness, explainability, usability and transparency” [published in Nature's digital medicine magazine], states that: “Bias can be introduced into a system in a number of ways, e.g., (1) transfer of [machine intelligence] applications from one population to another with different distributions of features, often affecting protected characteristics with the potential of introducing racial, gender or social bias; (2) intentional or malicious introduction of bias in order to skew performance; (3) introduction of bias by the data, which is often caused by inadequate or narrowly defined datasets.”

What steps do you take to ensure your training data set is diverse and representative of different groups?

At Evidation, we strive to ensure proper epidemiological and econometric analytic handling of protected features such as race and gender. We do this across analyses by addressing confounding, mediation and effect moderation or modification whenever possible while implementing counterfactual fairness metrics in predictive modeling.

Tools that we leverage in practice include data sheets for data sets that encourage transparency and accountability by providing information to the consumer on how the data was collected, how it should be used and what potential biases exist, as well as the AI Fairness 360 toolkit.

Data ethics is at the forefront of our mission.’’

When it comes to testing and monitoring your model over time, what’s a strategy you’ve found to be particularly useful for identifying and eliminating bias?

At Evidation, we nurture a culture where all are encouraged and empowered to raise concerns. Data ethics is at the forefront of our mission. Pairing our mission-driven ideals with our data handling, we routinely test models on sub-cohorts to see if performance holds up. This approach seeks to ensure that the model performance metric has equality across different groups. For example, we used a group fairness approach in a COVID-19 detection model to establish that different gender and race sub-cohorts attained similar model performance (detecting the area under the receiver operating characteristic, in this case). However, it is important to recognize that performance parity is not a one-size-fits-all solution as models can, unfortunately, be “fairwashed.”

We believe team education and training in bias, data ethics and causal inference; bringing awareness to the societal impacts and issues that can arise from seemingly benign decisions; and learning how to identify design and analysis decisions that could be introducing bias are important for mitigating it. We also seek a diversity of backgrounds, experience and ideas on our team.

Specific Methods for Eliminating Bias in Machine Learning

Evidation

Recent Articles