what lies ahead?
baldy, courtesy dr cliff nass, stanford university

videoVirtual Relationships
Watch a discussion about how we react when computers speak to us

videoMatched Guise Test
Cliff Nass explains that when the  face on a computer screen and the voice we hear appear mismatched, we automatically distrust it

Talk to the Screen
Learn the difference between speech recognition and speech understanding

Additional Resources
Technology Spotlight Index

Technology Spotlight

Machine Voices

Synthesize This.
Artificial voices may sound mechanical, but they’re getting better all the time — and people seem to react as to them as if they’re listening to real people. Cliff Nass explains how old brains respond — exquisitely — to new technologies, as learned in working on new BMWs.
Read Summary.

When BMW introduced its in-car navigation system in Germany, the system was a model of technological excellence, using a computer-generated voice to give highly accurate information about the car’s location and how to get to almost all city and street addresses. Unfortunately, a large number of drivers had a strong negative reaction to this technological marvel and demanded a product recall.  The problem? The navigation system had a female voice. German drivers felt uncomfortable with, and untrusting of, a “female” giving directions! BMW acquiesced and switched to a “male” synthetic voice.

This is a striking illustration of how the human brain is voice activated: Although all of the drivers knew that this was simply a computer-based voice in the car, they automatically reacted with gender stereotypes and gender expectations. Gender assignment to machine-based voices is part of a fundamental human response: People are built to behave toward and draw conclusions about voice-based technology using the same rules and expectations that they normally apply to human beings. As a result of these automatic and unconscious social responses, the psychology of interfaces that “talk” and “listen” is identical to the psychology of responding to other people. Voice interfaces are intrinsically social interfaces. In this case, the rapid categorization of the car’s voice as “female” and the invocation of a set of beliefs and expectations of appropriate behavior for “females” could not be overcome by the knowledge that “it’s only a machine.”

The role of voices in cars has expanded

The role of voices in the car has expanded from talking to both speaking and hearing, and from only giving directions to providing warnings (“your wiper fluid is low” and “time to check your oil”), information (“what is the traffic on Highway 280?” or “you are driving 15 miles over the speed limit”), safety (“there is a pedestrian crossing in the middle of the road,” “there is black ice on the road”) and even controlling the car (“lower the windows” or “play the third song on my Beatles CD”).

Designers now must worry about more than having the car talk and listen. They must take into account an array of social rules and expectations in order to ensure that drivers have a safe and pleasant driving experience that is consistent with the branding of the car and the preferences of the driver.

Designer Passions

When my colleagues and I were asked to “fix” the voice interface in the BMW Five Series, we started by reviewing the social-science literature on every aspect of how one person speaks with and listens to another. That was extremely helpful and guided many of our decisions, but there were several reasons to create and interpret a number of new experiments.

cliff nass discusses voice interface with robert macneil

First, design prescriptions can compete. For example, there are suggestions that “birds of a feather flock together” (people will like voice personalities similar to their own) … and evidence that “opposites attract” (people will like voice interfaces that complement their vocal style). Similarly, an established theory may not be relevant to every situation. For example, research suggests that friendliness is desirable. In some persuasive situations, however, seeming aloof might be more effective. Although the existence of contradictory or vague principles has charm and opens up new possibilities for research, designers want direct answers.

Second, new technologies often raise new questions. For example, no human exhibits the bizarre speech patterns associated with synthetic speech — inexplicable pauses, misplaced accents and word emphases, discontinuities between phonemes and syllables, and inconsistent prosody. No human would cheerfully announce “your radiator is overheating!”— only a machine would mix happy and upsetting in this way. Current and future voice interfaces are not likely to acquire the same capabilities as humans. So, systematic experimentation can point to clever “tricks” to disguise the system’s ignorance of or inability to obey social rules.

In sum, to ensure that we could successfully design voice interfaces not just for cars but also for voice-based e-commerce, stock checking and purchase, technical support, virtual secretaries, appliances and toys, we combined the psychology of human-human interaction with our own research on how people think and feel about voice-based interfaces. Here are some examples.

Fender Gender Bender

The fact that BMW drivers were disturbed by the “gender” of the voice in the car might seem odd. Surely they know that it’s not a tiny person inside the dashboard!  But, our experiments have demonstrated that this reaction is just the tip of the iceberg.

In one study, we created a system in which a voice-based computer tutored people about two subjects — love and relationships and engineering. A different voice-based computer (with a different voice) then praised the performance of the tutoring computer. The tutoring and the evaluating computer used human voices that were either male or female. Consistent with the stereotype that praise from males is taken more seriously than praise from females, experimental participants rated the tutor computer as significantly more competent and more friendly (regardless of tutor voice) when it was praised by a “male” computer rather than a “female” computer, even though the computers always said the same thing. Furthermore, the female-voiced tutor computer was perceived as more informative about love and relationship and less informative about engineering, again, when the computers said the exact same things!

The gender neutral experiment

We were stunned and disturbed by these results, but thought a simple remedy might work: Make the voices sound so clearly non-human that people would be reminded that the idea of “gender” should not be applied to inanimate machines. We therefore did a study in which we had either a female-voiced or male-voiced computer with a clearly synthetic voice make suggestions about various “choice dilemmas,” such as, “Should a person try to become a pianist, their risky first choice, or choose the safer choice of medical school?”

The results were disheartening. Not only did people succumb to gender stereotypes by being more influenced by the (clearly not human) “male” voice than the “female” voice, they also exhibited social identification. Females agreed more with the (unambiguously machine-like) “female voice,” while males agreed more with the “male voice.” Clearly, companies using a voice interface must take critical societal consequences into account before its designs blithely adhere to gender or other stereotypes.

That Voice is Not My Type!

When you think about a product that has a strong brand — as do most cars — it’s very easy to come up with a long list of adjectives to describe the item; that list can be called the brand “personality.” Psychologists are much more careful about how they use the term “personality,” tending to focus on a small, well-defined set of traits such as the four dimensions of the Myers-Briggs Trait Inventory (information perception, judgment, energy consciousness, and life-management orientation) or the two dimensions of the Wiggins Interpersonal Circumplex (affiliation and extraversion).

Powerful ways to manifest a personality are through volume, pitch, and word speed

Because the term “personality” comes from the Latin personare, to sound through, referring to the mouth opening in a mask worn by an actor, it should not be surprising that one of the most powerful ways to manifest a personality is through such voice characteristics (or “para-linguistic cues”) as volume, pitch and word speed. For example, extroverts speak with louder, faster, and with a higher pitch than do introverts.

In another study, we had a book-selling Web site describe books with clearly computer-generated speech.  Using the rules that determine how people will judge introversion and extroversion in human speech, we created two different “voices” via para-linguistic cues (the language was identical). Not only could people recognize which voice was introverted and which was extroverted, but the voices strongly affected people’s reactions to the books! Not only did extroverts like the extroverted voice better, but they also were more likely to buy the book, like the descriptions more, and find the reviews more credible with an extroverted voice. Meanwhile, introverts showed stronger preferences along all of these dimensions for the books described by the introverted voice. Even more remarkably, people thought that the main character in the book was more introverted when the book was read by an introverted voice as compared with an extroverted voice.

Because both extroverts and introverts buy BMWs, the company was less concerned with matching the customer than they were with making sure that the personality of the voice was consistent with the role of the voice in the car. After deciding that the voice in the car should not be the car itself (as in KITT of the TV series “Knight Rider,”), we began to consider who the voice was and what personality matched that role. Should it be a “golf buddy” (match the user’s personality), a chauffeur (obsequious, terse), a pilot (very dominant and not very friendly), a person riding “shotgun,” (talkative, not very smart), a demanding employer (hypercritical, grating voice), etc.

participant in match guise testing

A detailed analysis of the brand positioning of the car suggested that the perfect voice would be a stereotypical co-pilot, who could take over when the driver was in trouble but who understood that the driver (pilot) was in charge: Male, not at all dominant, somewhat friendly and highly competent.  This suggested a voice that was relatively deep, medium in volume, with moderate pitch range and very little volume range, and slightly faster-than-average word speed.

We also carefully avoided the use of “I.” Although our experiments demonstrated that in general it is good for recorded voices (as in the BMW) to use “I” and for clearly synthetic voices to avoid the use of “I,” co-pilots try to place themselves in a subordinate role and thus avoid the use of “I.” Furthermore, the language was relatively terse, phrased as statements rather than commands (pilot) or questions (chauffeur). This careful matching of voice and role makes users feel safer, more confident and happier than when the voice is not carefully matched or it swings between personality types.

What We Have Here is a Failure to Communicate

As in the majority of the latest voice interfaces, the BMW recognized speech as well as spoke. Drivers could issue commands such as “Radio, give me station 88.5” or “Set the temperature to 20 degrees (Celsius).” Despite tremendous advances in technology’s ability to understand human speech, the combination of the noisy environment of the car and the difficulties in translating speech’s complex wave forms into usable sounds meant that many users’ utterances would not be understood and would have to be repeated. How should the car tell the user it failed to understand?

We studied this question by creating a telephone voice-interface in which people could ask questions about products. In the research, at various points in the interaction, we intentionally had the interface fail to understand the user. People worked with one of three different systems: self-blame systems (“This computer did not understand. Please repeat.” or “This computer failed to understand that.”), user-blame systems (“You must speak more clearly; please repeat.” or “You are not speaking loudly enough.”), or no-blame systems (“That was not understood; please repeat.” or “There was noise on the phone line; please repeat.”).  Consistent with the psychology of human-human interactions, the self-blame voice was liked for being modest but thought to be dumb; the user-blame system was smart but hated. The perfect choice turned out to be the no-blame system, the option that BMW selected.

Psychology Trumps Technology

These examples are drawn from more than 100 experiments and research results we took into account when we worked with BMW to design the voice of its cars, including how to mix recorded and synthetic speech, the type of accent, how and when to manifest emotion, and even when to tell a joke! All of these discoveries and the accompanying decisions raised the odds that drivers would have a safe and enjoyable experience. Even though cars are simply thousands of pounds of steel and plastic, once they or any other machine include a voice, design becomes less about physics and more about people.

Suggested Reading/Additional Resources

Back to Top

Cliff Nass is a Professor at Stanford University. His primary appointment is in the Department of Communication. He also has appointments by courtesy in Computer Science, Science, Technology and Society, Sociology and Symbolic Systems (cognitive science). Dr. Nass is a director of the Social Responses to Communication Technologies (SRCT) Project at Stanford University. Recent research in this area includes how people apply social rules and heuristics to ubiquitous technologies in support of office, meeting, and classroom interaction; alignment in natural language understanding; mixed agent and avatar systems for learning; voice interfaces and driving (particularly drowsy driving); and human-robot interaction. Dr. Nass is also one of two Directors of the Kozmetsky Global Collaboratory (KGC) at Stanford University and its Real-time Venture Design Laboratory (ReVeL). Recent research in this area includes: links between personal identity and venture sustainability; information technology and development; and compression in small groups. Dr. Nass is the author of: The Media Equation: How People Treat Computers, Televisions, and New Media Like Real People and Places (New York: Cambridge University Press) - and an upcoming book: Voice Activated: The Psychology and Design of Interfaces that Talk and Listen.

EDITOR’S NOTE: There were two sources of funds for Dr. Nass’ research:  The National Science Foundation and an industrial affiliates program at the Center for the Study of Language and Information at Stanford University. The affiliates program accepts gifts from approximately 40 companies and allocates those funds to research projects without any guidance or control by the contributing companies. At the time of this research, no car companies had contributed any funds to the industrial affiliates program. As part of a consulting relationship with a third-party company, Dr. Nass provided design guidance to BMW independent of his research. All of the studies were performed in Dr. Nass’ laboratories at Stanford University in California.
Back to Top

Sponsored by:

National Endowment for the Humanities Hewlett Foundation Ford Foundation   Arthur Vining Davis Foundations Carnegie Corporation

National Endowment
for the Humanities

William and Flora Hewlett


Rosalind P.

Arthur Vining
Davis Foundations

Corporation of New York