Software engineer Henry Gan got a surprise last summer when he tested his team’s new facial recognition system on coworkers at startup Gfycat. The machine-learning software successfully identified most of his colleagues, but the system stumbled with one group. “It got some of our Asian employees mixed up,” says Gan, who is Asian. “Which was strange because it got everyone else correctly.”

Gan could take solace from the fact that similar problems have tripped up much larger companies. Research released last month found that facial-analysis services offered by Microsoft and IBM were at least 95 percent accurate at recognizing the gender of lighter-skinned women, but erred at least 10 times more frequently when examining photos of dark-skinned women. Both companies claim to have improved their systems, but declined to discuss how exactly. In January, WIRED found that Google’s Photos service is unresponsive to searches for the terms gorilla, chimpanzee, or monkey. The censorship is a safety feature to prevent repeats of a 2015 incident in which the service mistook photos of black people for apes.

The danger of bias in AI systems is drawing growing attention from both corporate and academic researchers. Machine learning shows promise for diverse uses such as enhancing consumer products and making companies more efficient. But evidence is accumulating that this supposedly smart software can pick up or reinforce social biases.

That’s becoming a bigger problem as research and software is shared more widely and more enterprises experiment with AI technology. The industry’s understanding of how to test, measure and prevent bias has not kept up. “Lots of companies are now taking these things seriously, but the playbook for how to fix them is still being written,” says Meredith Whittaker, co-director of AI Now, an institute focused on ethics and artificial intelligence at New York University.

Gfycat dove into facial recognition to help people find the perfect animated GIF response when messaging friends. The company provides a search engine that trawls nearly 50 million looping clips, from kitten fails to presidential facial expressions. By adding facial recognition, executives thought they could improve the quality of searches for public figures like movie or music stars.

As a 17-person startup, Gfycat doesn’t have a giant AI lab inventing new machine learning tools. The company used open-source facial-recognition software based on research from Microsoft, and trained it with millions of photos from collections released by the Universities of Illinois and Oxford. But as well as showing a kind of Asian blindness around the office, the system proved unable to distinguish Asian celebrities such as Constance Wu and Lucy Liu. It also performed poorly on people with darker skin tones.

Gan realized that although it was easier than ever to access powerful machine learning components, making them work equally well for different ethnic groups wasn’t easy. “Because of how the algorithm worked I expected it to be universally good,” he says. “Clearly that was not the case.”

He first tried to fix the problem by gathering more examples of the kinds of faces where his software stumbled. The free datasets used to train the system included photos of celebrities and other public figures sourced online; they were balanced between men and women, but had a much higher percentage of white people than other races. Adding in Asian and black celebrity faces from Gfycat’s own image collection helped only a little. The face recognizer still sometimes mixed up Asians, such as K-Pop stars, one of the site’s most popular genres of GIFs.

The fix that finally made Gfycat’s facial recognition system safe for general consumption was to build in a kind of Asian-detector. When a new photo comes in that the system determines is similar to the cluster of Asian faces in its database, it flips into a more sensitive mode, applying a stricter threshold before declaring a match. “Saying it out loud sounds a bit like prejudice, but that was the only way to get it to not mark every Asian person as Jackie Chan or something,” Gan says. The company says the system is now 98 percent accurate for white people, and 93 percent accurate for Asians. Asked to explain the difference, CEO Richard Rabbat said only that “The work that Gfycat did reduced bias substantially.”

Making software explicitly look for racial differences might seem like an odd way to make it treat different ethnic groups the same. But Erik Learned-Miller, a professor at the University of Massachusetts who works on facial recognition, says it makes sense. As facial recognition technology has been used and developed more broadly, disparities in accuracy for different groups have become more obvious, Learned-Miller says. Guiding the software by giving it an awareness of patterns of physical variation among humans, such as skin tone or facial structure, could be helpful.

LEARN MORE

The WIRED Guide to Artificial Intelligence

That idea has some backing from research in academia and industry. Google researchers released a paper in December reporting a new accuracy milestone for software that detects smiles. They did it by building a system that looks for signs a person is a man or woman, and which of four different racial groups they belong to. But the paper also includes an ethical disclaimer noting that AI systems shouldn’t be used to try and profile race, and that using just two gender categories, and four for race may not be wise in all cases.

Competitive pressure to deploy AI everywhere, and fast, has companies large and small grappling with tricky issues like face recognition and race on the fly.

Toronto-based Modiface provides technology that draws a virtual preview of makeup on live smartphone video. It has been used in apps from brands including Sephora; on Monday, Modiface was acquired by L’Oreal. The company has developed a careful workflow to ensure that the technology works equally well on all kinds of faces.

Modiface initially pooled several open datasets, like the ones Gfycat used, but they proved to be too unrepresentative, says CEO Parham Aarabi. The company had to enhance its view of ethnic minority groups by paying for extra images, and tapping employees, friends, and family, he says. Modiface now has nearly a quarter of a million images, with at least 5,000 each from key ethnic groups such as Middle Eastern, Hispanic, and Asian. The company has also developed deep expertise in how certain facial features vary across the human race. “We spend a lot of time making sure we get the eye boundaries correct across different ethnicities,” Aarabi says.

Future entrepreneurs should get more resources to help them manage tricky issues like that—and may face pressure to report how their systems perform on different demographic groups. Some researchers working on bias in AI have proposed industry standards to require transparency about the limitations and performance of machine learning data and software.

Learned-Miller, of UMass, says that organizations deploying face recognition such as Facebook and the FBI should disclose statistics on the accuracy of their systems for different groups of people. “We need transparency,” he says.

A proposal released last week suggests machine learning datasets, software, and APIs like the face and image recognition services offered by cloud providers should all come with similar disclosures. The Microsoft, Georgia Tech, and NYU researchers behind the idea took inspiration from the electronics hardware industry, where components such as capacitors come with datasheets describing their limitations and operating parameters.

Speaking in San Francisco Monday, Microsoft’s Timnit Gebru argued that something similar could help lessen the risk that people will accidentally make something biased or unethical with freely available machine learning tools. “They don’t come with recommended usage,” she said, at an event hosted by MIT Technology Review. “AI has a lot of opportunities but we really have to take safety and standardization and process seriously.”

Misreading Faces

  • In tests, face-analysis services from Microsoft and IBM were near perfect at identifying the gender of men with lighter skin, but frequently erred on images of women with dark skin.
  • Image collections used to train machine-learning systems associate shopping and washing with women, and coaching and shooting with men.
  • Google’s Photos service is unresponsive to searches for the terms gorilla, chimpanzee, or monkey.

Read more: