HAL's Legacy
 
HOME | THE DAWN OF HAL | TRIUMPH OF THE MACHINE | COMPUTER SPEECH AND VISION
COMMON SENSE | EMBODIED INTELLIGENCE | EMOTIONAL INTELLIGENCE | EVOLUTION OF INTELLIGENCE
spacer
Raymond Kurzweil Interview
Interviewer, Dr. David G. Stork & Michael O' Connell
  Main Page < Part 1 < Part 2
spacer Stork: Now tell us about your translating phone project?
Kurzweil: This translating telephone demonstration system combines 3 different technologies, which is a pattern we’re going to see where worthwhile systems actually combine multiple AI systems. It’s combing speech recognition, which recognizes what the person is saying. Then automatic language translation which translates it from one language to another. Then speech synthesis in the target language, which generates the translated text and generates a human sounding voice. The weakest part is the translation. The speech recognition can be reasonably accurate if it is trained on that person’s voice. The synthesis sounds quite human, though not with human levels of naturalness and inflection, but its readily understandable. The language translation makes mistake but is good enough for small talk or carefully structured business communications. This type of system will be seen for routine business applications, international communication, this sort of thing.

Stork: HAL can recognize the crew’s emotion from their speech. How would systems approach that problem?
Kurzweil: We get a lot of information from speech in addition to the words being spoken. We can infer emotion, their intent, and aspects of their psychology. We don’t know how we do that. HAL is able to do that, perhaps even better than the crew members themselves.
The first thing we do in speech recognition is capture the sounds and digitize them. We then have to eliminate background noises, like door slams and other extraneous noises. We can’t do that as well as humans, but we do a pretty good job of eliminating non-speech sounds. We need to transform that information, analogous to what the brain does. The information is transformed in ways that emphasize the information bearing aspects of the speech signal. We then look for specific features, like the burst of sound when a person makes a "p" sound. The information is highlighted with all of these very specific features. All that information, the transformed speech signals, the recognized features, goes into a pattern matcher that hypothesizes where specific words are. In a particular region of the speech it might say, ‘Here’s a word, it might be ‘speech’ or it could be ‘beach’, we’re not really sure yet.’ It makes all of these hypotheses of words or word fragments. If at a later stage in the processing it determined that the words leading up to that were ‘I am going to the’, it would then reason that ‘beach’ made more sense than ‘speech’. It will resolve ambiguities.

Stork: Why did you go into this field?
Kurzweil: I've been fascinated with pattern recognition since high school. And it does represent in my opinion the heart of human intelligence - most of our brain is devoted to recognizing patterns. That's is in fact how human beings play chess. We don't actually think all the millions of possibilities. We can't think a million moves in a few seconds, whereas a machine is able to do that. So how do we play chess at all? Well, we recognize patterns. We do this by saying, 'Oh, this is like that situation grand master so and so encountered a few years ago.' We are constantly recognizing situations that we have thought about before. I became fascinated with this problem more than thirty years ago. I worked on character recognition in the 70’s, because the computers in that era were good for that particular problem. In the 80’s computers became more powerful, and I felt were feasible to do early generations of speech recognition. It’s been a difficult problem. We’ve been working on that for twenty years. The systems are getting better because the computers are getting more powerful but we still have a long way to go to match human levels of performance. The systems already are useful. People use them on the phone, they use them to create documents. I wrote my book using it. So it is a useful technology.

Stork: Can you talk about how understanding is necessary for intelligence?
Kurzweil: There's a difference between just recognizing speech, which is turning a sequence of spoken language into a transcription of what the person has said, and actually understanding what it means. And again natural language understanding is not a yes or no thing so there are different levels of it. We do have systems today that can understand human speech within a very constrained domain. For example making a reservation for travel or buying a product. So within a narrow domain, systems can have some facility with understanding. The language translation systems are able to model what people are saying to some extent. As we gather more real world knowledge, as we can learn more about how human intelligence works, and we have a long way to go in terms of reverse engineering and understanding the human brain, we can develop systems that have more and more sophisticated models of the world and can understand language within broader and broader domains and ultimately with the flexibility of human intelligence - but that's going to take several more decades.

MOC: Can you describe the text to speech system?
Kurzweil: The original Kurzweil Reading Machine, which we introduced in 1976, combined several technologies. It was actually the first CCD flat bed scanner, which scanned the page. We then had to find where the letters on the page were and recognize them, regardless of their type font - and it actually did that with pattern recognition. It didn't just match pixels called template matching it would look for the abstract qualities of what makes an 'a' and 'a', the fact it has a concave region ... and so it would recognize each of the letters then group them into words. It then had to figure out how to pronounce them. It had general pronunciation rules and then had an exception dictionary when the rules didn't work. It then had a model of the human vocal tract where it would actually mathematically simulate how the vocal tract works and actually generate spoken language, so it would go all the way from the scanned image of print to spoken language. And did it with sufficient accuracy that even though it made mistakes a blind person could actually read ordinary printed material.

Stork: You’ve spent your life making technology that helps people. Does the big picture worry you?
Kurzweil: Well intelligent machines represent a very powerful technology and already the level of intelligence that machines have today, although primitive, are amplifying our intelligence. And it is power that is used by humans for all of human purposes, some of which are creative and some of which are destructive. We've certainly seen technology this century amplify both those tendencies. Technology is accelerating at an exponential rate and is going to get more and more powerful and there are some dangers as we sort of empower individuals to do a great deal of harm. It is a major challenge for the human race. I think there is going to be tremendous benefit in terms of wealth overcoming afflictions, disabilities, becoming more productive and creative through technology. But also tremendous dangers and dealing with those dangers in a way that preserves our privacy and freedom is probably the biggest challenge we have as a civilization.

  Main Page < Part 1 < Part 2
The Documentary | The Book | Resources | Contact