It’s a scene that is played over and over again on TV. A grainy image appears on screen, a person says “enhance,” a computer beeps, the image sharpens, and a red square flashes around a face. Some variation on “Got ‘em!” is inevitably what’s said next as a pair of detectives grab their jackets on the way out the door to nab the newly identified suspect. But that’s not at all what happened in the days after April 15, 2013, when terrorists struck at the Boston Marathon.
Following the explosions, people immediately looked to video from surveillance cameras for answers. The two crude but deadly devices had exploded in a wealthy commercial district that was thick with the recording devices. On top of that, the terrorists had targeted one of the most famous sporting events in the world. Thousands of people were filming and photographing the race. Somewhere in this pile of images there had to be at least a few frames that captured the bombers. Once found, it would simply be a matter of running those images through facial recognition algorithms to pluck the suspects’ names out of a virtual line-up.
If only it were that easy. To find the suspects, investigators pored over hundreds of hours of video footage and countless photographs searching for someone who stood out. The real break came from one of the victims, Jeff Bauman, who had lost both his legs below the knees and was laying in a hospital bed. He scrawled on a piece of paper, “Bag, saw the guy, looked right at me.”
With Bauman’s description, investigators were able to isolate images in which that suspect, who we now know to be Tamerlan Tsarnaev, appeared. They followed him frame by frame until footage revealed a co-conspirator, now identified as Tamerlan’s younger brother Dzhokhar. Yet despite the dozens of images of the pair, the FBI’s sophisticated algorithms were unable to match the suspects faces to names. It wasn’t because they weren’t in the system—both alleged bombers had Massachusetts driver’s licenses and Tamerlan was in the FBI database. It was because automated face recognition simply wasn’t up to the task.
Instead, the brothers were identified the old fashioned way: The FBI held a press conference, publicized photographs of the suspects, and asked for the public’s assistance. Soon tips started pouring in, including one from the Tsarnaev brothers’ aunt. In the end, it couldn’t have been more low-tech.
So why couldn’t automated face recognition identify the bombers? What about the human mind makes it so adept at picking a face out of a crowd? And what will it take for computers to match our own ability? The first question is easy to answer, but the other two less so, and they reveal not only the limitations of computers but also of our own understanding of how we recognize a face.
Face recognition outside of a controlled environment is no simple task. Most facial recognition algorithms excel in matching one image of an isolated face with another, say a driver’s license or a passport. In those situations, three important parameters—pose, illumination, and expression—are tightly controlled, says Anil Jain, a computer scientist and expert on biometric identification at Michigan State University. Pose is how the subject is positioned relative to the camera—ideally straight-on. Illumination is the lighting conditions, which should be bright enough for the camera to capture all of the individual’s features but not so bright that they are overexposed. Finally, the subject should be holding a neutral expression. Controlling these parameters helps minimize variation between any two images. “As long as we can control that, these three factors, we have very efficient face recognition systems available in the market,” Jain says. Under ideal conditions, he says, they can be up to 99% accurate.
Video further complicates face recognition.
But once those conditions start to deviate from the ideal—the lighting dims or a subject isn’t looking straight into the camera, for example—accuracy drops significantly. Skewed images, where poses are anything but straight on, is one of the biggest challenges. Accurate face recognition requires reference points so the computer can correct for differences in distance and perspective between images. If one image is tilted relative to another, the computer can eliminate that variation by lining up the reference points. The eyes are the most widely used reference points, in part because they are the easiest for a computer to extract from an image. But they’re also a limitation. “If you cannot locate the two eye points,” Jain says, “that starts causing a problem.”
“If you are provided a profile image in which only one eye is visible, suddenly finding the point of reference becomes problematic,” he continues. “And that’s when the performance starts dropping.” The same is true if a person is wearing a hat which obscures their eyes or sunglasses, as Tamerlan Tsarnaev was in the surveillance video. Jain’s lab is developing new methods for partial, or unconstrained, face recognition. It’s still a work in progress, though, as identification accuracy can be as low as 50% in some cases.
Video further complicates face recognition. Just finding faces in video is challenging, Jain says. Once a computer does, it has to deal with varied lighting, motion which can blur a person’s face, and other complications. Surveillance cameras also tend to record low-resolution video at choppy frame rates, which decreases the amount of image data available for processing. Facial recognition algorithms that the FBI likely employed in the hunt for the marathon bombing suspects had to contend with all of these issues.
Technical limitations may not be all that is complicating the development of reliable automated face recognition algorithms. Our own skill in recognizing faces may also be to blame, says John Gabrieli, a neuroscientist at the Massachusetts Institute of Technology. “If we understood more how people did it in an explicit, definable way, you could program it.”
We do understand some of the basics. Entire portions of the human brain specialize in distinguishing one face from another. We know that part of the process involves our minds computing ratios between different features like the eyes and nose, Gabrieli says. These vary from person to person, and while the differences are incredibly small, we excel at the task. We can also compensate for variability introduced by changes in distance, lighting, and expression. “All those things distort those metrics,” Gabrieli says. “It’s that variety, and that subtlety, of the spatial relations of the face that identify an individual that are remarkable about human face perception,” he adds. “People have found it very hard to program into machines or software.”
All of these problems seem possible for computers to solve. Higher resolution cameras could allow for more accurate measurements between features, for example, and cameras that sense outside the visible spectrum could correct for bad lighting.
Humans understand the face holistically.
Rather, the insurmountable hurdle in the quest for highly accurate automated face recognition, at least for now, is how our minds resolve that information into a recognizable face. It’s nothing like how computers do it. Computer algorithms scan an image of a face with a series of windows, or filters, of different sizes. These filters are looking for subtle changes in texture—the brightness of one pixel compared with its neighbors in the window—over the entire face. Each filter creates its own data layer, and final analysis is based on the unique texture values for each point in each layer. If enough values agree with data from a reference image, then it’s declared a match.
Humans, on the other hand, understand the face holistically, Gabrieli says. We don’t break it down into its component parts, analyzing a nose’s size, shape, and color, for example. In fact, we’re terrible when we try to do it that way. “If you show a nose on a face, people do way better at recognizing that nose when it comes back with a face as opposed to the nose by itself,” Gabrieli says. An analogy, he adds, is proofreading. When we’re reading over a passage we just wrote, we’ll often miss typos because we don’t read letter by letter—we understand the word as a whole, every letter simultaneously. “Faces are like that with a vengeance.”
There is evidence that, like with computers, the human mind requires some level of training to recognize faces. Babies are born with an innate ability to pick out a friendly visage, but as we grow up, we hone our skills on the people around us. Scientists have found evidence of this in studies where people are asked to identify others of a different race. For people who have had little to no exposure to that race, differentiating between individuals is surprisingly difficult.
It suggests that computers could get better at face recognition, too, if we train them enough. Other problems that have stymied automated facial recognition, such as low-resolution footage like that of the Tsarnaev brothers at the Boston Marathon, can easily be solved by technological advances. But even more training and better technology may not be enough to bring computers up to our level. Until we understand exactly how humans perceive faces, there may never be an algorithm that’s as accurate as we can be.
“The fact that we have this ingrained, evolved, not-well-understood-at-all coding for faces,” Gabrieli says, “and because we haven’t been able to push too far beyond saying, ‘It has this holistic quality,’ psychologists can’t give much helpful information to computer scientists.”