iMEdD Lab republishes William Crumpler’s article from the Technology Policy Blog of the Center for Strategic and International Studies at Washington, DC.
The accuracy of facial recognition systems, the confidence scores and the forward-looking risk management
Researchers have found that leading facial recognition algorithms have different accuracy rates for different demographic groups. The first study to demonstrate this result was a 2003 report by the National Institute of Standards and Technology (NIST), which found that female subjects were more difficult for algorithms to recognize than male subjects, and young subjects more difficult to recognize than older subjects. In 2018, researchers from MIT and Microsoft generated news with a report showing that gender classification algorithms—which are related, though distinct from face identification algorithms—had error rates of just 1% for white men, but almost 35% for dark-skinned women. The most thorough investigation of this disparity was completed by NIST in 2019. Through their testing, NIST confirmed that a majority of algorithms exhibit demographic differences in both false negative rates (rejecting a correct match) and false positive rates (matching to the wrong person).
NIST found that demographic factors had a much larger effect on false positive rates—where differences in the error rate between demographic groups could vary by a factor of ten or even one hundred—than false negative rates—where differences were generally within a factor of three. Differences in false positive rates are generally of greater concern, as there is usually greater risk in misidentifying someone than in having someone be incorrectly rejected by a facial recognition system (as when your iPhone doesn’t log you in on the first try). NIST found that Asians, African Americans, and American Indians generally had higher false positive error rates than white individuals, women had higher false positive rates than men, and children and the elderly had higher false positive rates than middle aged adults.
The most important factor in reducing bias appears to be the selection of training data used to build algorithmic models. However, an expanded audit regime could face resistance from developers.
However, NIST also came to several encouraging conclusions. The first is that differences between demographic groups were far lower in algorithms that were more accurate overall. This means that as facial recognition systems continue to improve, the effects of bias will be reduced. Even more promising was that some algorithms demonstrated no discernible bias whatsoever, indicating that bias can be eliminated entirely with the right algorithms and development processes. The most important factor in reducing bias appears to be the selection of training data used to build algorithmic models. If algorithms are trained on datasets that contain very few examples of a particular demographic group, the resulting model will be worse at accurately recognizing members of that group in real world deployments. NIST’s researchers theorized that this may be the reason many algorithms developed in the United States performed worse on Asian faces than algorithms developed in China. Chinese teams likely used training datasets with greater representation of Asian faces, improving their performance on that group.
Because of the importance of training data selection on the performance and bias of facial recognition algorithms, these datasets have become an increasingly popular target for regulatory proposals. The EU, for example, recently proposed that a regulatory framework for high-risk AI systems like facial recognition include requirements that training data be “sufficiently broad,” and reflect “all relevant dimensions of gender, ethnicity and other possible grounds of prohibited discrimination”. Training data audits to confirm the quality of training datasets could become an important tool for addressing the risks of bias in facial recognition. However, an expanded audit regime could face resistance from developers who will oppose adding additional time or cost to the development process, or opening any part of their algorithm to third party investigation.
African American males, for example, are disproportionately represented in the mugshot databases many law enforcement facial recognition systems use. This is the result of larger social trends, but this could mean that African American males will be more frequently identified and tracked.
Government action will be necessary to encourage the adoption of training data audit practices. The easiest first step would be to update procurement policies at the state, local, and federal level to ban government purchases from facial recognition vendors that have not passed an algorithmic audit incorporating the evaluation of training data for bias. These audits could be undertaken by a regulator or by independent assessors accredited by a government. At a minimum, this should be required by law or policy for high-risk uses like law enforcement deployments. Federal policymakers could also help to reduce bias risks by empowering NIST to oversee the construction of public, demographically representative datasets that any facial recognition company could use for training.
However, bias can manifest not only in the algorithms being used, but also in the watchlists these systems are matching against. Even if an algorithm shows no difference in its accuracy between demographics, its use could still result in a disparate impact if certain groups are over-represented in databases. African American males, for example, are disproportionately represented in the mugshot databases many law enforcement facial recognition systems use for matching. This is the result of larger social trends, but if facial recognition becomes a common policing tool, this could mean that African American males will be more frequently identified and tracked since many are already enrolled in law enforcement databases. Unlike the question of differential accuracy, this is not a problem that can be solved with better technology.
This highlights the importance of shifting the conversation around the risks of facial recognition. Increasingly, the primary risks will not come from instances where the technology fails, but rather from instances where the technology works exactly as it is meant to. Continued improvements to technology and training data will slowly eliminate the existing biases of algorithms, reducing many of the technology’s current risks and expanding the benefits that can be gained from responsible use. But this will also make deployments more attractive to operators, creating new sets of concerns. As policymakers consider how best to construct governance systems to manage facial recognition, they should ensure their solutions are tailored to where the technology is heading, not where it is at today. Bias in facial recognition algorithms is a problem with more than one dimension. Technical improvements are already helping contribute to the solution, but much will continue to depend on the decisions we make about how the technology is used and governed.
Source: William Crumpler, The Problem of Bias in Facial Recognition, Technology Policy Blog, CSIS: Technology Policy Program, May 1, 2020
*The author of the article, William Crumpler, is a research assistant with the Technology Policy Program at the Center for Strategic and International Studies in Washington, DC.
This piece is reposted. The article was originally published on the Technology Policy Blog, which is produced by the Technology Policy Program at CSIS, a private, tax-exempt institution focusing on international public policy issues. Its research is nonpartisan and nonproprietary. CSIS does not take specific policy positions. Accordingly, all views, positions, and conclusions expressed in this publication should be understood to be solely those of the author.