Alan 2.0 -- Nick Campbell
Several technologies of the future are combined to create a "digital double" of Alan Alda. Nick Campbell of Interpreting Telecommunications Research Laboratories in Kyoto, Japan, took on the challenging assignment of creating speech for "Digital Alan" using a speech synthesizing system called CHATR. Campbell answered viewers' question about this new technology and how it may be used in the future.
How much longer do you think you will have to work on this program to make the synthesized speech sound like a normal person?
Sometimes it already sounds remarkably like the person whose voice we are using, but not always. We expect to get the engineering problems worked out quite soon -- probably within the next year --and we know that using bigger speech databases helps a lot (speech is very easy to record and we are working hard on labeling it automatically to make the index). But what is more interesting is the number of ways that a "normal person" can speak -- CHATR can laugh (if there are laughs in the database) but we don't know how to make it sound like an angry person without having an "angry" database. We already have three different emotion databases (angry, sad, and happy) for one speaker, but if you listen carefully to what people can do with their voices and the way they speak, you'll begin to realise that the human being is a VERY sophisticated organism, and that what we are doing with the computer is really only at the very beginning of being able to model all the ways that people can convey information to other people by different tones of voice.
What do you think will be the commercial applications for your program CHATR in the future?
ATR is a basic research lab for key technologies and doesn't actually make any products, but we have had a lot of suggestions from our sponsoring organisations (and from others around the world) about how CHATR could be used. We are testing it for a speech translation system and for in-car information systems, where it is important to be able to distinguish different voices easily and where the quality of the voice can be very important, but we can imagine many applications like a phone-based web-browser or an interactive television program where the speech should change according to how much detail a person wants.
We have to be very careful though, because this is a technology that could also be abused. Many people earn their livings from their voices and while we might be happy to make a lot of "talking web pages," for example, we have to find a way to do that without alienating the people whose voices we might use. At the moment, the law seems to be undecided about the use of very small samples of a person's voice -- it seems that they probably don't have ownership of the "sounds" of their speech -- so we also have to be careful about how other people can re-use what has been said somewhere else.
Until now, speech synthesis hasn't been used very much by normal people in everyday life, and we think one of the reasons for this is that many people just don't like to listen to synthesised speech that sounds like a robot, especially when they're driving a car or using a phone, when they just need to get simple information quickly. We know that there are also many people who aren't able to use their own voices, perhaps because they have suffered in an accident, and hope that we will be able to offer them a new voice, with a personality that they can choose so that it suits the way they want to sound. And we can think of many other interesting situations where people who don't have any difficulty speaking might like to make use of another voice sometimes.
I was pretty amazed at how you were able to change the intonation in that one phrase of Alan's by having it "copy" what you said. Can you give us an explanation of how the program does that?
In text-to-speech synthesis, the computer has to calculate an intonation contour (changing the pitch and timing information and loudness, etc.) for each sentence it has to speak. However, at ATR we are researching speech processing with the goal of interpreting between two human speakers, and that environment gives us much more information about how to say something than is usually available to a simple text-to-speech synthesiser. For example, we can take the input speech (which might be in Japanese) and find out which parts have been said with emphasis, or with a rising voice (as in a question: okay?) and use that information to give extra meaning to the speech that we synthesise (which might be in English or some other language).
For Alan's voice it was relatively easy because both the input and the output were in the same language. We used the analysis of my input voice directly as parameters for the synthesis. For each sound in the speech sequence (the word "cat" for example is said with the sounds /k/ /a/ /t/) we measured the pitch, the duration, and the loudness, and then converted them to be in the same range as Alan's voice (which is a bit lower than mine) and used those values instead of the ones that the computer had calculated for the "text" of each sentence. In some cases there was no difference, but it can be VERY hard for a computer to guess how any given text should be said, because it doesn't usually have the same background knowledge that a human speaker does. For example, even a simple sentence like "My name is Nick" can be said in many different ways, depending on whether it is answering the question "What's your name?," "Is your name Rick?." "Whose name is Nick?," and so on.
To make the computer say a sentence differently, we have to search in the database for just the right sequence of sounds that also have just the right tone of voice. The perfect match for every sentence probably won't be found in a normal-sized database, but we have ways of automatically selecting the nearest match, so that the resulting speech sounds close to how we think it should be said. That's probably the most sophisticated part of our synthesis system.
Are there versions of CHATR or similar programs available to the public now? If not, when do you think they will be?
CHATR isn't being sold as a commercial product yet because there is still a lot of work to be done to make it easy to use for non-professionals, but we are always interested to hear from people who might be interested in working with us, or who have ideas about the way that it could be put to good use. I think we can expect to see many information systems and applications that incorporate this kind of speech re-synthesis starting in the year 2000.
As the technology gets better, what is going to prevent people from using it to fabricate scenes regarding non-fiction issues like government leaders? In the movie "Contact" scenes of Pres. Clinton were edited into scenes to make it appear that he was discussing alien transmissions. What if someone made a computer character of the Pres. declaring war or bringing dead people back to life?
This is a very real danger. At the moment the technology is still in its early stages, but given enough speech data (and there are surely plenty of recordings available of this president's and many other peoples' voices) then it might be possible even now to fool somebody for a short time into believing that a virtual person is speaking. This is a perennial problem of science: should we develop a technology that has potential bad uses? I think the answer has to be yes, but that while we are doing so, we must also take on the responsibility of educating people into the good uses of that technology, and of informing as many people as possible of its potential dangers (which, when you come to think of it, is exactly what we are doing now :-)
The example that we often use when talking about this is 'ordering a pizza' -- if my neighbourhood store delivers a hundred pizzas to all my friends because my voice rang up and asked them to do it, who has the responsibility when it comes to paying for everything? Legally, this is a tricky issue, but in practice they've got used to hearing me order pizzas on the phone and would probably trust my voice the first time it happened. Thankfully presidents don't usually declare war with a single telephone call!
But something very similar to what you suggest has actually happened already -- in England in the 1950's there was a television program made from H.G. Wells' book [War of the Worlds] and the initial 'news announcement' apparently frightened many people into believing that it was really happening. Maybe our generation is becoming more skeptical? We have become used to the manipulation of visual images from television commercials and from some very clever filming and editing techniques. I guess we'll soon get used to believing our ears as little as we believe our eyes.
As for bringing dead people back to life, we did that already. In Japan, Matsushita Konosuke (the founder of the National Panasonic empire) placed a recording of his voice in a time-capsule buried under Osaka Castle. We were given a copy of that tape by his successors and were able to recreate his voice in an uncannily realistic way. Actually it was so disturbing to listen to such a well-known voice from the past talking about present-day events. We all agreed that it was not a very nice thing to do, and that database has not been used again since the first tests.
How long did you have to go to school, and what were the most important things you studied, to be able to do the great work you are doing now?
Wow, I've almost forgotten! I went through six years of elementary school and seven years of secondary school at a grammar school in Canterbury in England. I seem to remember that I didn't do an awful lot. I failed Latin because I knew more Church Latin than textbook Latin, and I failed French because I had a French girlfriend and we spoke a colloquial version that wasn't the same as was being taught in the textbooks! I enjoyed English though, because we had a master who would read us all the juicy bits out of Chaucer that had been edited from our school versions of the Canterbury Tales. And I enjoyed maths because it was so clean.
When I left school I trained to be a teacher, but I think the best thing I ever did was give up teaching and go back to school myself. I really enjoyed the opportunity to work closely with some very bright people and gradually became more and more caught up in the puzzle that is science. Now 'work' for me is more like doing a crossword -- looking at lots of clues and trying to find out what the hidden answers might be. It can be a lot of fun, but it can also be very frustrating!
What was the most important thing I studied? Perhaps it was to listen to people. Not just to what they say, but also to 'how' they say it. Language can be very rich and expressive -- and the flavouring that we add to language through the overlay of prosody and intonation is almost magical in what it can do. The same words can be said in an amazingly diverse number of ways. Fortunately, the computer is still decades away from being able to mimic all of the variety in the human voice, but on the other hand, that just means that I've still got a lifetime's worth of work ahead of me!
Nick, loved the piece about "Alan 2.0". I am a science teacher in St. Petersburg, Fla. My students really enjoyed incredible technology that can seemingly make a double of another human. Most of the girls, in my class, want to know if you can make a "copy" of the young heart-throb in the Titanic film. I, however, want to know if actors, like Alan, are getting nervous about their future. Seems like it won't be long before new "digital actors" will come alive and put some of the human actors (many of which, come with a hefty price tag) out to pasture. This has happened, to a certain extent, with robotics in the car-manufacturing business. What are your thoughts?
Yes, I think I can, but I'd need a lot of speech samples from the voice that we'd want to copy. Scientific American Frontiers sent me about an hour of Alan speaking, and if you listen carefully, you can hear that there are still many parts of the synthesised speech that sound mechanical. But with all the recordings for the sound track of a film, for example, there'd probably be enough. It also depends a bit on the language -- working with Japanese is easier than English because there are only five vowel sounds in that language -- English has about fifteen. And in Japanese, there is usually at most only one consonant between any pair of vowels. In English there can be as many as seven (think of joining words like 'strengths' and 'striped', for example).
I am sure that digital voices will be used in place of human ones in many situations in the future, but they needn't put people out of work. The robots in the car factories can actually free people from the boring repetitious work, and give them more time to do work that is a bit more enjoyable. Having people in or out of work is not a matter decided by technology but has more to do with high-level economic policy, and that is usually decided by governments and large organisations.
I do think though that actors are beginning to get nervous, and I share their worries. I believe that we will soon need some changes in the present laws to protect people's voices. Currently (so I'm told) the sound of a voice is legally similar to the colour of a painting or the words in a book. We can't copyright colours or words, only the shapes or ideas that are made up by sequences of them. In the past, this wasn't much of a problem, but now sampling technology has come a long way, and the laws will probably soon begin to change.
Actually, I used to be a teacher myself, and I often wished I could find someone to talk to my class for me while I did something else. While we're discussing the dangers of voice re-creation, let's also think a little of how your voice could be used to instruct individual students, leaving you (and your teaching assistant) free to monitor how each one is doing. I'm sure they'd be able to tell the real you, but they might benefit a lot from being able to have an instructional conversation with a 'virtual you' while the real one is busy talking to someone else.
Is it "easier" to work with some voices than others -- e.g., men's voices or women's, American or English accents? Was Alan Alda's voice easy or difficult to work with?
Yes, it certainly is. I believe that softer voices, like those of young children, and breathier ones, especially those of women, are easier to join in the CHATR process (You can find examples of these under our home page at www.itl.atr.co.jp/chatr). A harsher voice can sound quite 'choppy' when small segments are joined together. This was actually quite a surprise to me, because in the past it has been the middle-aged male voice that was most used for speech synthesis. That was probably because the pitch of a child's voice is usually much higher than that of an adult (similar differences can be found between men's voices and women's) and the signal processing needed for re-creating a higher-pitched voice was and still is much more difficult to do. CHATR doesn't use any signal processing to make the voice so we win on this point.
Another point that I think we win on is the variety of accents and dialects found between different places and speakers. CHATR doesn't try to re-create a voice, it just re-uses it. The subtle differences that we as humans use to recognise a speaker or a dialect are all preserved by the use of speech waveform segments in our synthesis method. If you go to our home page, you'll see that we can apply the system to many languages -- some that we don't even know how to speak! Many people recognised Alan not just from the sound of his voice, but from the way that he spoke. Much of his (and other people's) speaking style can be preserved automatically in a system of synthesis like CHATR. You might have noticed that I speak with a British accent, but when we used my voice to give the intonation to Alan's speech, what came out sounded more like an American than an Englishman speaking (at least to me, anyway ;-)
Maybe one day you'll find out how your own voice comes across in speech synthesis -- I hope so, but I must say that I don't always enjoy what I hear the people in my lab making my voice say sometimes.
Is the "digital double" only able to speak when it has been programmed to, or could someone hold a "normal" conversation with it? If not possible now, could this be a reality one day?
The "digital double" is just a combination of speech synthesis and graphics animation, so it will only do what it is programmed to do. In our test with graphics and speech for the Alan 2.0 program we just tried out a few short sentences, but there is really no reason why it shouldn't have been possible to do a lot more.
A speech synthesiser simply says whatever is sent to it -- and that may be text typed by a human or something generated by a computer program. There are already some very clever computer programs that you can hold a conversation with -- Weizenbaum's famous program "Eliza" was one of the first -- but they are not really intelligent, and after a while you realise that they are just taking parts of what you said previously to make a new question so that you will say something else for them to make a new question out of. But isn't that just what a lot of successful conversationalists do too? I suppose the answer depends on what you mean by "normal."
I think it is normal to ask someone who has a lot of information to tell you about something that they know. As we are now entering the "Information Age," with so much information available on the world-wide web that we literally do not know what to do with it all, computers are soon going to become our `information agents', finding out and telling us naturally what we want to know.
Most people now have to use a keyboard and screen in order to access information from the computer, but very few people enjoy typing and most people seem to enjoy talking more. Talking with computers is soon going to be a very 'normal' way of finding out what is going on in the world, and a way that we can use anywhere, safely -- so if you have a phone, then you'll be able to stay informed.
I there any way to distinguish high-quality synthesized speech (created from a person's own speech sounds) from his or her "real" speech? I'm curious about the future legal implications for material like tape recorded conversations as evidence, etc.
Yes, if you look at the speech waveform, or better still, make a spectrogram of it, then you'll see that there are many places where evidence of joins in the speech can be found. I can imagine a day not too far in the future when those joins will be imperceptible, but even then it would be very easy to "watermark" the speech, in a way similar to the pictures you can see in the paper of banknotes when you hold them up to the light. Watermarking is a way of embedding a kind of signature in a digital message so that someone who knows how to read it can verify where that signal came from. For example, speech data files are usually very large, and if I set every hundredth "bit" in the speech signal to be part of a special sequence, then the effect on the sound of the speech would probably not be noticeable. However, even if the listener couldn't notice that the speech had something written into it, you would still be able to read the hidden message back from the original signal and decode it. I think it might be a good idea in the future for people who sell synthesisers that make use of human speech to put watermarks in the output as a way of stopping the kind of misuse that you are thinking of. But for the time being, I wouldn't worry about it too much -- it would need a very long recording to make a synthesis database from which the joins were really hard to find, and there are not many of those yet.
I guess the result of all this new technology is that we could "bring back" some our favorite stars like John Wayne and create new roles and dialogue for them. Do you know of any plans to do this with some of our late and great movie actors?
Let's just say that I know some people who are thinking about it quite a lot! I think that the race is now on between the speech people and the graphics people to see who can get their part finished first. Alan 2.0 was an impressive test, but it would take a lot of time to make a whole film in the same way. However, both the graphics and the speech can be automated, and that is what we are trying to do now.
With completely automatic speech synthesis though, you can hear that the voice "doesn't know what it's talking about." That's because a computer can't yet understand the text of what is being synthesised, so it can't predict a really natural-sounding intonation. But just as they used human originals to give the motion to the figures in the "Titanic" sequences, so we can also use human speakers to give the emotion and feeling to the computer voice.
It will probably not be done before the next century, but I don't think we'll have to wait very long to see films using this technology. However, I bet it will be cartoon characters that are tried out first, because their faces will be easier to do!
Nick, what do you think you will do next, once you perfect this speech software?
I think it is still very far from perfection! If you stop for a moment to think of all the ways that we humans can use speech, then you'll realise that what we are doing when making Alan 2.0 talk is just at the very beginning of all the things that we'd like to be able to do. For example, if a person who is talking to you is smiling, you can usually hear the smile in the voicebut we can't do that with our synthesis yet. We can use laughter and other noises in the speech to make it seem more natural, but we can't make it sound convincingly happy, or angry, or sad. We are collecting more and more speech data so that we can include these emotions, but I think it will be a long time before we can really make a voice that can do all the things a human can do.
Perhaps what most interests me most at the moment is the kind of speech that you'd need for interaction with a web browser -- you wouldn't want a synthesiser to just read out everything on a web page for you -- that would be very slow and probably quite boring (this page of course is an exception ;-), but if you could talk with a computer and ask it questions about what's where on the web, then you'd probably want a very different style of speech than just reading. I'm also interested in ways to liven up the speech and put a sparkle into the voice. Computers should be fun to talk to, and even more fun to listen to!
Scientific American Frontiers
Fall 1990 to Spring 2000
Sponsored by GTE Corporation,
now a part of Verizon Communications Inc.