Ghosts in the Machine

In a brightly lit office, Joy Buolamwini sits down at her computer and slips on a Halloween mask to trick the machine into perceiving her as white.

For Buolamwini, a black PhD student at MIT’s Center for Civic Media, electronic racial deception is sometimes the most efficient way she can do her job. Buolamwini’s research focuses on facial analysis, a suite of technologies used in everything from auto-focusing smartphone cameras to advertisements to border security. But there’s a problem with many of these algorithms—they sometimes can’t detect Buolamwini or people who look like her.

Joy Buolamwini often tests her software using a mask to overcome biases encoded in facial recognition algorithms.

That’s because facial detection algorithms made in the U.S. are frequently trained and evaluated using data sets that contain far more photos of white faces, and they’re generally tested and quality controlled by teams of engineers who aren’t likely to have dark skin. As a result, some of these algorithms are better at identifying lighter skinned people, which can lead to problems ranging from passport systems that incorrectly read Asians as having their eyes closed, to HP webcams and Microsoft Kinect systems that have a harder time recognizing black faces, to Google Photos and Flickr auto-tagging African-Americans as apes.

Coded machine bias can work against lighter-skinned people as well. Research shows that some facial analysis algorithms built in Asia tend to perform better with Asian faces than Caucasian ones. Algorithms may also show accuracy rates that vary along age or gender lines.

As computer vision systems become more widespread, these demographic effects can have serious consequences. A seminal 2012 study of three facial recognition algorithms used in law enforcement agencies found that the algorithms were 5–10% less accurate when reading black faces over white ones and showed similar discrepancies when analyzing faces of women and younger people. A 2010 analysis by the National Institute for Standards and Technology (NIST) found that for some algorithms the opposite was true, that people of color were more easy to identify than Caucasians. But both studies showed that facial recognition programs aren’t equal opportunity. Bias in algorithms extends well beyond facial recognition, too, and into things as disparate as car insurance rates and recommendations for criminal sentencing.

Joy Buolamwini wasn’t aware of these issues when she first encountered algorithmic bias as a computer science major at the Georgia Institute of Technology. Working on a research project that involved teaching a robot to play peek-a-boo, Buolamwini noticed that the robot had no trouble detecting faces of her light-skinned roommates. But under the same lighting conditions, it didn’t work as well for her. She encountered the same problem in 2011 with another robot, but she didn’t think much of it until she began working more directly with facial recognition at MIT. One of Buolamwini’s early projects—a system called the Aspire Mirror that layers inspirational images, quotes, or even other faces over a reflection of the user’s face—worked well for users with lighter skin, but not so much for the woman who built it.

“I was getting frustrated. I drew a face on my palm and held it up to the camera and it detected the face on my palm. I was like ‘Oh this is ridiculous,’ ” she says. “Just being goofy, I put the white mask on to see what would happen, and lo and behold, it detected the white mask.”

Flawed Benchmarks

Facial analysis bias remains a problem in part because industry benchmarks used to gauge performance often don’t include significant age, gender, or racial diversity. For example, one popular benchmark for facial recognition is Labeled Faces in the Wild (LFW), a collection of more than 13,000 face photos. Tech giants including Google, Facebook, and Baidu—as well as a variety of smaller companies—have used the data set to measure algorithmic performance. LFW includes photos that represent a broad spectrum of lighting conditions, poses, background activity, and other metrics, but a 2014 analysis of the data set found that 83% of the photos are of white people and nearly 78% are of men. “It’s not necessarily diverse identities,” Buolamwini says.

A sample of faces from the research dataset known as Labeled Faces in the Wild.

Erik Learned-Miller, a computer science professor at the University of Massachusetts, Amherst, who co-created the data set, agrees that benchmarks like LFW “cannot be depended upon to evaluate algorithms for their fairness in face identification.” When LFW was released in 2007, he says, that was never the intention. Learned-Miller says that it’s critical for facial recognition vendors to conduct “exhaustive evaluations” of the technology’s accuracy—and not just on one group of users. But he suspects that many don’t as there are few financial incentives to do so.

How bad is the bias problem? There isn’t a lot of research on the subject. There’s “limited evidence” of bias, racial or otherwise, in facial analysis algorithms, in part because there simply haven’t been many studies, says Patrick Grother, a computer scientist specializing in biometrics at the National Institute for Standards and Technology and lead author of the 2010 NIST study.

“There are anecdotes that certain people have trouble using them,” he adds. “But nobody has formally quantified it, and to formally quantify it, you would need a large amount of data.”

Buolamwini is one of a growing number of researchers fighting the problem. She is joined by a team of volunteers who support her nonprofit organization, the Algorithmic Justice League, which raises awareness of bias through public art and media projects, promotes transparency and accountability in algorithm design, and recruits volunteers to help test software and create inclusive data training sets. Buolamwini’s goal isn’t just to improve algorithms—it’s also to make AI more understandable and accessible to everyone.

“This domain of algorithmic bias is inhabited by the digerati or the Brahmin high priests of tech,” Buolamwini says. “But these aren’t necessarily the people who are going to be most affected by the decisions of these automated systems…What we want to do is to be able to build tools for not just researchers, but also the general public to scrutinize AI.”

Bias Busters

Exposing AI’s biases starts by scrapping the notion that machines are inherently objective, says Cathy O’Neil, a data scientist whose book, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, examines how algorithms impacts everything from credit access to college admissions to job performance reviews. Even when parameters that guide algorithms are completely reasonable, she says, discriminatory choices can still sneak in through the cracks.

“When you make reasonable choices without thinking very hard about them, you’re automating the status quo, and the status quo might be unreasonable,” she says.

If, for example, a company wants to automate its hiring process, it might use an algorithm that’s taught to seek out candidates with similar profiles to successful employees—people who have stayed with the company for several years and have received multiple promotions. Both are reasonable and seemingly objective parameters, but if the company has a history of hiring and promoting men over women or white candidates over people of color, an algorithm trained on that data will favor resumes that resemble those of white men. Rejected applicants will probably never know why they didn’t make the cut, and it will be tough to hold anyone accountable since the decision was machine-made, not manmade, O’Neil says.

NIST computer scientist Ross Micheals demonstrates a NIST-developed system for studying the performance of facial recognition software programs.

O’Neil adds that algorithms don’t need explicit data on race, gender, or socioeconomic status to exhibit bias. Risk assessment tools used in commercial lending and insurance, for example, may not ask direct questions about race or class identity, but the proprietary algorithms frequently incorporate other variables like ZIP code that would count against those living in poor communities.

Credit scores are another data point that can allow bias to creep into algorithms. In 2015, Consumer Reports published the result of a two-year-long investigation into car insurance pricing. They analyzed more than 2 billion price quotes across approximately 700 companies and found that a person’s financial life dictated their car insurance rate far better than their driving record. Credit scores—which are affected by factors related to poverty but often not related to driving—factored into these algorithms so heavily that perfect drivers with low credit scores often paid substantially more than terrible drivers with high scores. In Florida, for example, an adult driver with a pristine record but a low credit score paid $1,552 more on average than a driver with great credit and a drunk driving conviction.

When practices like this are automated, it can create negative feedback loops that are hard to break, O’Neil says. Higher insurance prices for low-income people can translate to higher debt and plummeting credit scores, which can mean reduced job prospects, which allows debt to pile up, credit scores to sink lower, and insurance rates to increase in a vicious cycle. “Right now we have essentially no rules or regulations around algorithms, about the accountability of algorithms in particular,” O’Neil says.

Just knowing when an algorithm has made a mistake can be difficult. Last year, the investigative journalism nonprofit organization ProPublica released an analysis of COMPAS, a risk assessment tool that evaluates criminals to determine how likely they are to commit future crimes. They compared predicted recidivism rates of 10,000 criminal defendants in Broward County, Florida, with whether the defendants committed a crime over the next two years. The algorithm was equally accurate at predicting recidivism rates for black and white defendants, but black defendants who didn’t re-offend were nearly twice as likely to be classified as high-risk compared with similarly reformed white defendants. By contrast, white repeat offenders were twice as likely to be erroneously labeled as low-risk.

Equivant, the company behind COMPAS, rebutted ProPublica’s findings, and a separate analysis by Community Resources for Justice supported Equivant. ProPublica countered, but sussing out who’s correct underscores a major obstacle: Assessing an algorithm’s fairness depends on having an agreed-upon definition of fairness. In situations where there are multiple potential outcomes, as is the case with COMPAS, it may not be mathematically possible to assess different groups in various scenarios, concluded an independent analysis by researchers from Cornell and Harvard.

“Everybody is in some sense right,” says Andrew Selbst, an attorney and postdoc at the Data and Society Research Institute who specializes in legal questions surrounding big data and machine learning. “The real issue is that we have, for a long time, been able to avoid being very clear as a society about what we mean by fairness and what we mean by discrimination.”

Law and Order and AI

There are laws that could provide some protection against algorithmic bias, but they aren’t comprehensive and have loopholes. Current anti-discrimination laws in sectors like education, housing, and employment prohibit both intentional discrimination—called “disparate treatment”—as well as unintentional “disparate impact,” which happens when neutral-sounding rules disproportionately affect a legally-protected group. (It’s currently against the law to unintentionally discriminate on the basis of sex, age, disability, race, national origin, religion, pregnancy, or genetic information.)

Proving disparate impact is notoriously difficult even when algorithms aren’t involved. Plaintiffs must first prove that they were disproportionately and negatively affected by a policy or practice. If discrimination is job-related, for example, the disparate impact would only be illegal if there were alternative hiring methods that were equally effective without being discriminatory. With algorithms, Selbst says, clear alternatives may not exist.

“If someone had actually stepped up in the making of this and tested several different versions of the [software], then there would be alternatives and you should choose the one that’s the least discriminatory and most effective,” Selbst says, adding that organizations often don’t have incentive to fully evaluate software for fairness concerns. “If we want to have best practices, we should be testing a lot of versions of the software and not just relying on the first one that we’re presented with.”

Since algorithms are proprietary and frequently protected under non-disclosure agreements, organizations that use them, including both private companies and government agencies, may not have the legal right to conduct independent testing, Selbst says.

“It’s not completely clear that the companies or police departments or judiciaries that buy this software from these companies have done any testing whatsoever. In fact, it is clear in many cases that they don’t, and they’re not allowed to,” Selbst says.

The ability to audit an algorithm would answer some questions about bias, but there’s a group of algorithms that are moving beyond our current abilities to analyze them. Artificial neural networks are one example. They are fed huge amounts of data and, through a process of breaking it down into much smaller components and searching for patterns, essentially come up with their own algorithms, which could potentially be incomprehensible to humans.

Computer scientists have also developed artificial neural networks that can write new AI programs without human input. Just this month, Google announced a major advance in this field—a system that writes its own machine learning code, one that out-performed code developed by its own makers.

Ben Shneiderman, a computer scientist at the University of Maryland, says that greater automation means greater concern over whether machine-built algorithms will exacerbate the bias problem.

Machine-built algorithms are “somewhat more concerning because if you’re automating a process, then you’re reducing the opportunities for a human being to check the bias,” he says. “The central technical improvement I’d like to see is a log of the actions of the algorithms in a way that’s interpretable and explainable.”

Explaining AI

Some researchers are working on that problem. Last May, the Defense Advanced Research Projects Agency (DARPA), home to the U.S. government’s most top-secret research and weaponry programs, launched a four-year initiative to encourage the development of new machine learning techniques which produce algorithms that can be understood by users. The 12 research teams received contracts under the Explainable AI program aim to help military forces understand the decisions made by autonomous systems on the battlefield and whether that technology should be used in the next mission, says David Gunning, Explainable AI’s program manager.

Some Explainable AI teams are adopting what Gunning calls a “model induction” approach that essentially reverse engineers how an algorithm makes a choice. Instead of starting with the algorithmic recipe itself, these teams are building software programs that run millions of experiments that compare the data going into the system with the decisions that come out in hopes of finding enough patterns. From that, they’ll create a model of what they think is happening in between.

Explainable AI’s products aren’t being designed to root out bias, but what results from this program could help researchers evaluate algorithms and spot bias in the future. Gunning is quick to point out that the Explainable AI program has only just begun— right now, no single strategy looks more promising than another—and that there will perhaps be some aspects of artificial intelligence that we will never understand.

“I don’t think we’ll ever create perfect explanations for the most complex machine learning systems,” he says. “I think they’ll always be inherently difficult to explain.”

Overseeing the Indecipherable

That’s one reason why some are encouraging researchers and programmers to consider bias as they’re building tools. One group, the Partnership on Artificial Intelligence to Benefit People and Society, brings together academics, ethicists, and representatives from Amazon, IBM, Facebook, Microsoft, Google and Apple to hash out best practices. Other organizations like the AI Now Institute have issued recommendations for AI researchers, developers, and policy makers while agencies like the National Institute of Standards and Technology are beefing up their programs for rooting out bias. Starting this past February, NIST began offering evaluations of facial recognition technologies on an ongoing basis rather than every two to four years. This allows developers to continually check how their algorithms perform across diverse data sets.

Shneiderman says programs like this are a step in the right direction, but voluntary quality controls are not sufficient. He has proposed establishing a National Algorithm Safety Board which would oversee high-stakes algorithms and investigate problems. Operating similarly to the way the National Transportation Safety Board investigates vehicular accidents, Shneiderman’s safety board would be an independent agency that could require designers to assess the impact of their algorithms before deployment, provide continuous monitoring to ensure safety and stability, and conduct retrospective analyses of accidents to inform future safety procedures.

While Shneiderman’s proposed board, like the NTSB, wouldn’t have regulatory power, it would have significant investigative powers. “There needs to be some teeth” to the oversight board, Shneiderman says. “If they don’t have the power to investigate in a substantive way, if the people involved won’t talk to them, then there’s a limitation to how much they can accomplish.”

Some of these issues will probably be resolved through lengthy litigation, a process that’s already begun. Last year, the New York Supreme Court ruled that an algorithm used to evaluate a fourth-grade public school teacher’s job performance produced “arbitrary and capricious” results that were biased against teachers with both high and low-performing students.

For the foreseeable future, Andrew Selbst says we should expect more lawsuits and regulatory activity as the field strives to establish standards for algorithmic transparency and accountability. “All of this is cutting edge research in law and technology,” he says. “It’s all sort of up in the air right now.”