iMEdD Lab republishes William Crumpler’s article from the Technology Policy Blog of the Center for Strategic and International Studies at Washington, DC.
Facial recognition has improved dramatically in only a few years. As of April 2020, the best face identification algorithm has an error rate of just 0.08% compared to 4.1% for the leading algorithm in 2014, according to tests by the National Institute of Standards and Technology (NIST)1. As of 2018, NIST found that more than 30 algorithms had achieved accuracies surpassing the best performance achieved in 2014. These improvements must be taken into account when considering the best way to regulate the technology. Government action should be calculated to address the risks that come from where the technology is going, not where it is currently. Further accuracy gains will continue to reduce risks related to misidentification, and expand the benefits that can come from proper use. However, as performance improvements create incentives for more widespread deployment, the need to assure proper governance of the technology will only become more pressing.
What is facial recognition?
Facial recognition systems are a sub-field of AI technology that can identify individuals from images and video based on an analysis of their facial features. Today, facial recognition systems are powered by deep learning, a form of AI that operates by passing inputs through multiple stacked layers of simulated neurons in order to process information. These neural networks are trained on thousands or even millions of examples of the types of problems the system is likely to encounter, allowing the model to “learn” how to correctly identify patterns from the data. Facial recognition systems use this method to isolate certain features of a face that has been detected in an image—like the distance between certain features, the texture of an individual’s skin, or even the thermal profile of a face—and compare the resulting facial profile to other known faces to identify the person.
Broadly, facial recognition systems can be used to accomplish one of two different tasks: verification or identification. Verification (also known as 1:1 matching) is used to confirm that a person is who they say they are. An example of verification is when a person uses their face to unlock their smartphone, sign into a banking app, or verify their identity when boarding a plane. A sample image is taken of a person’s face during login, which is then compared to a known image of the person they claim to be. Facial recognition algorithms tend to have good accuracy on verification tasks, because the subject usually knows they are being scanned and can position themselves to give their cameras a clear view of their face.
Verification is used to confirm that a person is who they say they are. Identification is when software takes an unknown face and compares it to a large database of known faces to determine the unknown person’s identity.
Identification (also known as 1:N or 1:many matching) is when software takes an unknown face and compares it to a large database of known faces to determine the unknown person’s identity. Identification can be used on “cooperative” subjects who know they are being scanned, or “uncooperative” ones who do not. The latter has drawn the most attention, due to the fear that law enforcement or private businesses will use the technology to remotely gather data about individuals without their knowledge. However, remote identification can also be used to identify suspects from surveillance footage, track down missing persons or the victims of kidnapping, and improve private sector services. Remote identification systems tend to have lower accuracies compared to verification systems, because it is harder for fixed cameras to take consistent, high-quality images of individuals moving freely through public spaces.
How accurate is facial recognition?
In ideal conditions, facial recognition systems can have near-perfect accuracy. Verification algorithms used to match subjects to clear reference images (like a passport photo or mugshot) can achieve accuracy scores as high as 99.97% on standard assessments like NIST’s Facial Recognition Vendor Test (FRVT)2. This is comparable to the best results of iris scanners3. This kind of face verification has become so reliable that even banks feel comfortable relying on it to log users into their accounts.
However, this degree of accuracy is only possible in ideal conditions where there is consistency in lighting and positioning, and where the facial features of the subjects are clear and unobscured. In real world deployments, accuracy rates tend to be far lower. For example, the FRVT found that the error rate for one leading algorithm climbed from 0.1% when matching against high-quality mugshots to 9.3% when matching instead to pictures of individuals captured “in the wild,” where the subject may not be looking directly at the camera or may be obscured by objects or shadows4. Ageing is another factor that can severely impact error rates, as changes in subjects’ faces over time can make it difficult to match pictures taken many years apart. NIST’s FRVT found that many middle-tier algorithms showed error rates increasing by almost a factor of 10 when attempting to match to photos taken 18 years prior5.
The wide variation between vendors is a major issue with facial recognition accuracy: Though some vendors have constructed highly accurate facial recognition algorithms, the average provider on the market still struggles.
Sensitivity to external factors can be most clearly seen when considering how facial recognition algorithms perform on matching faces recorded in surveillance footage. NIST’s 2017 Face in Video Evaluation (FIVE) tested algorithms’ performance when applied to video captured in settings like airport boarding gates and sports venues. The test found that when using footage of passengers entering through boarding gates—a relatively controlled setting—the best algorithm had an accuracy rate of 94.4%6. In contrast, leading algorithms identifying individuals walking through a sporting venue—a much more challenging environment—had accuracies ranging between 36% and 87%, depending on camera placement7.
The FIVE results also demonstrate another major issue with facial recognition accuracy—the wide variation between vendors. Though one top algorithm achieved 87% accuracy at the sporting venue, the median algorithm achieved just 40% accuracy working off imagery from the same camera8. NIST’s tests on image verification algorithms found that many facial recognition providers on the market may have error rates several orders of magnitude higher than the leaders9. Though some vendors have constructed highly accurate facial recognition algorithms, the average provider on the market still struggles to achieve similar reliability, and even the best algorithms are highly sensitive to external factors. According to NIST, this large accuracy range between vendors “indicates that face recognition software is far from being commoditized”.
What are confidence scores, and why are they important?
It is also important to consider the effect on accuracy when adjusting algorithms to avoid false positives. Because facial recognition will likely be used in contexts where the user will want to minimize the risk of mistakenly identifying the wrong person—like when law enforcement uses the technology to identify suspects—algorithms are often set to only report back a match if they have a certain degree of confidence in their assessment. The use of these confidence thresholds can significantly lower match rates for algorithms by forcing the system to discount correct but low-confidence matches. For example, one indicative set of algorithms tested under the FRVT had an average miss rate of 4.7% on photos “from the wild” when matching without any confidence threshold. Once a threshold requiring the algorithm to only return a result if it was 99% certain of its finding was imposed, the miss rate jumped to 35%. This means that in around 30% of cases, the algorithm identified the correct individual, but did so at below 99% confidence, and so reported back that it did not find a match.
Introducing confidence thresholds like this are important in situations where a human isn’t reviewing the matches made by the algorithm, and where any mistakes could have serious effects on those being misidentified. In these cases, higher miss rates may be preferable to allowing false positives, and strict confidence thresholds should be applied to prevent adverse impacts. However, when facial recognition is used for what is often termed investigation—simply returning a list of possible candidates for human operators to review—confidence thresholds are usually reduced, as humans are checking the results and making the final decision about how to use the information that is returned. In these cases, facial recognition is just a tool to speed human identification rather than being used for identification itself. Theoretically, incorrect matches from lineups generated this way should be no higher than if the technology had not been used since in both cases humans are the ones ultimately doing the matching. However, there are some concerns that human operators could be biased towards accepting the conclusions reached by the algorithm if certain matches were returned with higher confidence scores than others.
In 2018, the ACLU made headlines with their finding that Amazon’s facial recognition technology incorrectly matched 28 members of Congress with people who had been arrested. The ACLU ran its search using a confidence threshold of 80%, Amazon’s default threshold.
Understanding the proper role of confidence intervals is essential when considering the way facial recognition is being deployed. In 2018, the ACLU made headlines with their finding that Amazon’s facial recognition technology incorrectly matched 28 members of Congress with people who had been arrested. In their test, the ACLU input photos of members of Congress, and searched a database of 25,000 mug shots of arrested individuals to see if the system would return any matches. In 28 instances (around 5% of all members tested), Amazon returned a match. The ACLU argued that these results show that facial recognition is not yet accurate enough to be deployed without the serious risk of abuses caused by incorrect matches.
The ACLU ran its search using a confidence threshold of 80%, Amazon’s default threshold. This is a very low confidence level, and far below Amazon’s recommended threshold of 95% for law enforcement activities. Amazon argued that if the system had been calibrated according to its guidelines, it is likely that few if any of these matches would have been returned. The ACLU and others have noted that regardless of Amazon’s recommendations, most users will simply use the system in its default configuration without taking the time to adjust threshold. Indeed, in 2019 the Washington County Sheriff’s Office in Oregon—a customer of Amazon’s facial recognition product—stated that they do not set or utilize confidence thresholds when using the system. This highlights the importance of ensuring that operators using facial recognition for sensitive uses have proper training and oversight to ensure systems are properly configured. If facial recognition matches are to be used as evidence, or to inform automated decision-making, a far more robust process is required to protect citizens from abuse.
However, as facial recognition is overwhelmingly used to simply generate leads, criticism of the technology based solely on instances of false matches misrepresents the risk. When facial recognition is used for investigation, most investigators know that the vast majority of matches will be false. In these cases, the point is to return a broad range of potential candidates of whom the vast majority, if not all, will be discarded by operators. This does not mean that there are no risks to the use of facial recognition for investigation, but rather that any eventual governance framework for the technology will have to account for the fact that these systems will be used in a variety of different ways and that each creates a different set of risks.
The importance of forward-looking risk management
Facial recognition technology has inspired widespread anxiety due to its potential for abuse. Understanding the reality about how accuracy affects these risks will be crucial for policymakers trying to craft protections for their citizens while preserving the benefits the technology could bring. The rapidly improving accuracy of facial recognition systems will help to prevent harms stemming from mis-identification, but could also increase other risks by making the technology more attractive for those who may abuse it. At the same time, improvements to the technology will increase the potential value that can be gained for law enforcement and private businesses, creating pressures to allow for adoption to continue to expand. As policymakers work to regulate the use of these technologies, it will be important for them to focus on the risks and opportunities associated with where the technology will look like in the future, rather than becoming fixating on concerns related to where the technology is at right now.
It will be important for policy makers to focus on the risks and opportunities associated with where the technology will look like in the future, rather than becoming fixating on concerns related to where the technology is at right now.
Measures to protect against misidentification will always be important, as facial recognition will never be 100% accurate. Today, these protections are particularly important as many vendors still do not have systems that operate at extremely high accuracies, and even the best algorithms still struggle in more challenging real-world settings. Increasingly, however, the risks of facial recognition will not stem from instances where the technology fails, but rather instances where it succeeds. Unconstrained use of facial recognition in public spaces could allow for the unprecedented collection of behavioral and movement data about individuals, creating opportunities for economic exploitation, political manipulation, discrimination, and more. However, if properly governed, facial recognition technology could also bring substantial benefits to security and accessibility. Policymakers are now facing the question of how to balance these interests for the good of their citizens, but first they must understand the true strengths, weaknesses, and potential of facial recognition systems.
The problem of bias in facial recognition
The selection of training data, audits to confirm quality and governance systems to manage facial recognition
1Comparing rank-1 FNIR at N=1.6M FVRT 2018 mugshot photos for 2020 Yitu-004 algorithm (0.0008) and 2014 NEC-30 algorithm (0.041). Source: Patrick Grother, Mei Ngan, and Kayee Hanaoka, “FRVT Part 2: Identification,” March 27, 2020, https://pages.nist.gov/frvt/reports/1N/frvt_1N_report.pdf and Patrick Grother, Mei Ngan, and Kayee Hanaoka, “FRVT Part 2: Identification”, November 2018, https://nvlpubs.nist.gov/nistpubs/ir/2018/NIST.IR.8238.pdf.
2yitu-003 algorithm achieved a 0.0003 (0.03%) false non-match rate (FNMR) at a false match rate (FMR) of 1e-4 for visa photos. Source: “FRVT 1:1 Leaderboard,” NIST, February 27, 2020, https://pages.nist.gov/frvt/html/frvt11.html#overview.
3NIST testing in 2018 found that the most accurate iris scans achieve accuracy rates of 99.43%, with more than 50% achieving accuracy rates above 98%. Source: George W. Source: Quinn, Patrick Grother, and James Matey, “IREX IX Part One Performance of Iris Recognition Algorithms,” NISTIR 8207, April 2018, https://nvlpubs.nist.gov/nistpubs/ir/2018/NIST.IR.8207.pdf
4NEC-002 FNMR at N=1.6M, R=1 on FRVT 2018 mugshots, and N=1.1M and R=1 on wild photos. Grother et al. “FRVT Part 2: Identification,” March 27, 2020, https://pages.nist.gov/frvt/reports/1N/frvt_1N_report.pdf.
5Average change in error rate for six indicative mid-tier algorithms selected by Grother et al. was 9.24x for identification mode, FPIR=0.001. Patrick Grother, Mei Ngan, and Kayee Hanaoka, “Face Recognition Vendor Test (FRVT) Part 2: Identification” NIST, NISTIR 8271, September 2019, https://nvlpubs.nist.gov/nistpubs/ir/2019/NIST.IR.8271.pdf.
6M32V FNIR of 0.056 at FP(T)=1 on dataset U. Patrick Grotehr, George Quinn, and Mei Ngan, “Face in Video Evaluation (FIVE): Face Recognition of Non-Cooperative Subjects,” NIST, NISTIR 8173, March 2017, https://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8173.pdf
7M32V FNMR at low 6ft placement and door 8ft placement on dataset P. Grother et al. “FIVE: Face Recognition of Non-Cooperative Subjects”, March 2017, https://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8173.pdf.
8M32V FNMR vs. N31V FNMR at low 6ft placement on dataset P. Grother et al. “FIVE: Face Recognition of Non-Cooperative Subjects,” March 2017, https://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8173.pdf.
9The leading verification algorithm as of February 27, 2020—yitu-003—had a false non-match rate (FNMR) of 0.0003 (0.03%) for visa photos with an FMR of 1e-4. intellicloudai-001, the median algorithm, had an FNMR of 0.0064 (0.64%). Allgovision-000, which achieved the third quartile of accuracy scores, had an FNMR of 0.0210 (2.1%). Source: Patrick Grother, Mei Ngan, and Kayee Hanaoka, “Ongoing Face Recognition Vendor Test (FRVT) Part 1: Verification”, NIST, February 28, 2020, https://pages.nist.gov/frvt/reports/11/frvt_report_2020_02_27.pdf
Source: William Crumpler, How Accurate are Facial Recognition Systems – and Why Does It Matter?, Technology Policy Blog, CSIS: Technology Policy Program, April 14, 2020
*The author of the article, William Crumpler, is a research assistant with the Technology Policy Program at the Center for Strategic and International Studies in Washington, DC.
This piece is reposted. The article was originally published on the Technology Policy Blog, which is produced by the Technology Policy Program at CSIS, a private, tax-exempt institution focusing on international public policy issues. Its research is nonpartisan and nonproprietary. CSIS does not take specific policy positions. Accordingly, all views, positions, and conclusions expressed in this publication should be understood to be solely those of the author