Home > TechTalks > Transcripts Archive > TechTalks Transcript

TechTalks Transcript

Voice Recognition Technologies�Is It Time?

March 21, 2002

Audio
  • Streaming MP3
  • Download MP3 (Download Tips)

Judith
Judith Boettcher
[JB]
Howard
Howard Strauss
[HS]
Morris
John Morris
[JM]

JB: Welcome to the CREN Tech Talk series for Spring of 2002 and to this session on Voice Recognition Technologies�Is It Time? You are here because it�s time to discuss the core technologies for your future campus. This is Judith Boettcher, your CREN host for today, and our session is coming to you today with the support of the CREN member institutions. If your institution is not a CREN member, please go to our website and become a member and become a supporting institution. Let me welcome first, right away, Howard Strauss of Princeton. Howard is a well-known web technology expert and portal expert and Howard now has, I would say, a very recognizable voice. Welcome, Howard.

HS: Okay, in fact, you�re going to hear another very recognizable voice, I hope, sometimes during this. As Judith said, I�m the technology anchor for the Tech Talk series of technology webcasts. Today, we�ll engage our guest expert, John Morris from Drexel University, in a lively technical dialogue that will answer your questions about voice recognition and will ask those very important follow-up questions. You can ask your own questions anytime during this webcast by sending e-mail to expert@cren.net. If we don�t get to your questions during the webcast, we�ll provide an answer in the webcast archives. Human beings have a long history of wanting to converse with things that are not human. People talk to their pets, commune with nature and verbally urge their cars and boats to go ever faster. An even fonder dream is not just to chat with things but to verbally control them, to make them do our bidding. We like to have our horses obey our commands, have our dogs fetch our slippers and have our dolphins jump through hoops. During the industrial revolution, our dream was extended to controlling machines. At first we controlled them with knobs, levers and buttons but the real vision was to talk to a machine and have it understand us. Perhaps the earliest and best known example of what we would call voice recognition today is in Arthur C. Clarke and Stanley Kubrick�s 2001: A Space Odyssey. When Dave and Frank are plotting to turn off the HAL 9000 computer, it not only understands them but�hold it! Listen! �I know that you and Frank were planning to disconnect me. And I�m afraid that�s something I cannot allow to happen.� This illustrates a possible danger of having machines that understand us. HAL actually does quite incredible voice recognition. When it hears Frank or Dave or anyone speak, it determines who the speaker is and based upon that, decides what the speaker is authorized to do. Obviously, no one is authorized to turn HAL off. One aspect of voice recognition uses a fact that everyone�s voice generates unique sound patterns that can be used for identification, making VR or voice recognition a potential replacement for ID�s and passwords. HAL also changes speech to text for real-time analysis. Converting a sentence such as �The sky is blue� from speech to text requires very little analysis, but dealing with something really messy such as, �Mr. Wright would like you to write about finding the Mr. Right rite of passage right now� requires a high level of contextual understanding to get things right. It not only uses four homonyms of the sound �right� but also uses the phrase Mr. (W)Right twice with different spellings and quite different meanings. A VR program needs to deal with accents, regional differences in how words are spoken, voices altered by colds and allergies and the normal individual differences in speech that make authentication possible but recognition difficult. Even when the program is good enough to do that well, it still must analyze the syntax of a sentence. When a voice recognition program hears the word �right,� it must guess which of the possible spellings to use. As the program gets more context, it may need to change its choice and of course, if the program encounters a word that is not in its dictionary, it will almost definitely get things wrong. At every step of the VR process, there is the potential for errors. How good does VR have to be to make it useful? That will depend upon what we want to use it for. If VR is integrated into an office application such as a text processor or spreadsheet, some editable errors may be okay, but probably not an error rate of even five percent. If VR will be used to steer a car or control a nuclear reactor, we�d want a very high accuracy. In MIT�s Artificial Intelligence Lab, director Rodney Brooks talks to robots that his lab is building. �Computer, turn on the lights,� he says as he enters a room, but no lights go on. He tries again to get the busy computer�s attention and it admonishes him by saying, �I�m already listening.� VR has certainly gotten Bill Gates� attention. It is an integral part of Microsoft Office XP and it has gotten the attention of Drexel University where a large VR experiment has already taken place. You may not be building robots and HAL�which according to the movie was built in Urbana, Illinois, on the twelfth of January, 1992 and is now quite old technology (and it was not available for this Tech Talk webcast)�to see what use VR might have on your campus, we�ll have to settle for talking with John Morris who actually does voice recognition much better than any of today�s VR systems. We�ll see how long that will remain true on today�s webcast of Tech Talk. Judith?

JB: Oh, thank you, Howard, and it was a delight hearing HAL�s voice again, one of the more famous sentences in all of movie history, I think.

HS: I have lots of them!

JB: Do you? That�s great! I love that one!

HS: We�ll just skip the rest of this and play the rest of the movie.

JB: Yeah, we can�that�s great. Well, you know what, I often find myself talking to my computer, saying things like, �Why are you doing that?� But you know, we won�t�not all [inaudible] is yes. So let me introduce our expert. John Morris is the Coordinator of Academic Technology for the Department of Information Resources and Technology at Drexel. After 20-plus years as a professor of architecture and architectural engineering, he is now a champion�I love that term!�a champion for institutionalizing web-based education and technology use. His focus is on the appropriate use of technology in education and the development of best practices in web-based learning. I suspect that one of the issues we�ll be talking about today is just how�what kind of role voice recognition has in best practices in teaching and learning. Welcome to the CREN Tech Talks, John.

JM: Thank you very much. I hear it�s going to be a hard act to follow, I�ll tell you!

JB: All right, well, we have lots of questions!

HS: I have lots of stuff from HAL so if you think you�re going to get a little bit light. I don�t think they�re going to� FM: Actually, if you look at some of our new Tech TV stuff, you�ll find some of HAL in there also.

HS: Okay! John, to get started, I mentioned really two possible uses of voice recognition. I talked about voice recognition to authenticate somebody to determine who we are and the speech to text thing. Can you tell us a little bit about both of those and tell us how are they being used today? Are both those things being done today? FM: Okay, I think it�s maybe best to look at the differences or the similarities of these two things because they�re all based on the same basic core technology. And the analogy is, picture a field of haystacks and the haystacks represent the way we might say a particular word. One haystack is mine and the way I say a word, another haystack is yours and the way you say a word. They way authentication works is that we note that all haystacks are alike. It�s all the same word, so there�s a certain set of similarities, but within those haystacks, when we�re looking at authentication we�re looking at the differences between those. We�re looking at what fine differences one haystack has vs. another. On the other side of that, though, from the voice-to-text side, we�re looking for the similarities between all those haystacks so that my dialect or my country of origin or my inflection and all is sort of minimized across all of those different views of the same thing. So the authentication is basically looking for the differences within a set of tonal qualities where the voice-to-text is looking for the similarities.

HS: John, Arthur Clarke imagined that HAL was built back in January of 1992. Maybe you didn�t know the date. But when we really look at the history of voice recognition, when did things get started?

JM: Well, my first use of voice recognition was in 1983 in the computer lab, what was called the Architectural Computing Lab at University of Michigan and we were dealing with control of machines. We were trying to do basically command line interfaces to [inaudible] systems.

HS: When you say �machines,� do you mean computers that are trying to control some kind of software or something like that, not trying to do gears and wheels and widgets?

JM: That�s correct. We were trying to do basically computer commands and instead of having to type those things in, just go through voice recognition. And very, very limited vocabularies.

HS: [inaudible]�

JB: Yeah, what was the driver at that time? I mean, why were you trying to do that, just

JM: I think it was varied. There was a�it was an interesting technology. I think it was a technology that embodied some of the things that AI was trying to deal with. I think that we were looking at room controls, being able to walk into a space, you know, verbalize a series of commands and have the computer��

HS: Tell it to turn the lights on and have it ignore you.

JM: Yeah. I mean, as simple as that as well as looking at system alerts or the sort of�there was a whole verbal input-output recognition interface that we were looking at.

HS: How did things go back then? Probably not too��

JM: Not too good.

HS: [inaudible].

JB: I was going to say, did you say that it was going to be in full deployment in five years?

JM: Oh, yeah, right! Well, of course, everybody was saying that about AI, too. It was an interesting technology, an extremely limited set of vocabulary and recognition levels or articulation levels were very, very low. And so it was more of something of interest and you know, we�ve got to wait for the technology to mature a bit before you go anywhere with it.

HS: So what�s happened since? Where are we now?

JM: Well, I like to think of it as the 80/20 rule. We�ve spent 20% of our time getting to 80% of the solution and yet we�re going to have to spend 80% of the time to get to the last 20%. I mean, it�s an asymptotic kind of situation. Some of our products are getting pretty good, even with relatively large vocabularies.

HS: What�s the 20% that needs to be done? What are the things that aren�t working?

JM: The things generally, dialects can really throw the system off quite a bit. Again, like the haystack example, when you�re looking across a field of haystacks and each one, we know, is different, but we�re looking for the similarities in order to choose a particular word, that particular part, dialects can throw that off significantly. �R**f� and �ruff� or, you know, I might say �roof� here. We go down south, they might say �ruff.� That can completely throw off a system. The write, the Mr. Wright and the rite can certainly throw off a system, so there�s both sound recognition issues and then contextual issues at the same time. So we�re getting better in context but the contextual management is probably an even more daunting problem than the dialect management.

HS: What about having to train systems? I know that when we pick up a telephone and we get one of these things that says �Say one or say a department number� or something like that. They seem to recognize any speaker. But when we buy one of these voice recognition systems, they make us sit around and get trained. What�s the difference between these systems and why do we have to train?

JM: Yeah, well, it�s mainly the size of the vocabulary. I mean, if we�re talking about a system where it�s saying you have a probably vocabulary of the numbers one through ten, it�s fairly easy to distinguish one from two or from three, even across great dialects or changes in dialects. The vocabulary level just isn�t�and there�s no linking between those. It isn�t one, three, five, seven, nine.

HS: Okay.

JM: There�s no sentence there. So the amount of computing and the amount of structure and the amount of power can really be minimal. So that�s the�the major difference is the size of the vocabulary.

JB: So what is�give us a sense, perhaps, of some of the voice systems that are--you know, these menu-driven�what is a normal size of vocabulary [inaudible] represent?

JM: It�d be 150 words or so.

JB: Okay, so��

JM: Yeah, it

JB: And that�s probably up from a room control system, those, what? Thirty-something?

JM: Yeah. It�you really don�t need�as an example of room control system, even if you�re trying to control AV materials with that, you know, play, record, stop, pause. You�ll find that you really had a relatively small set of vocabulary and the smaller the set of vocabulary, the higher the probability is that any user could use that particular VR system with less and less and less training to--you know, the less the vocabulary, the smaller the training necessary.

HS: How long does it take to train one of these things?

JM: Well, most of the systems now�as an example, Naturally Speaking from ScanSoft or ViaVoice or any of what I would call consumer systems are down from three hours or more about three or four years ago to down about a half an hour.

HS: And so I could�when I once tried one of these things, also it wanted me to say words with little pauses between each word. Can these systems now let you speak continuously?

JM: Yes, and I think that�s one of the significant changes. Back in �83, you basically had to use the pauses between words because the amount of computing and all necessary in order to get an individual word far exceeded the capacity of natural or continuous speech. So now, with the buffering capabilities and the high speed�or at least compared to years ago�the basic laptop or desktop computer, you can do an awful lot of computing in the background and so continuous speech is possible.

HS: And a half hour training is probably enough to get one of these things to know you?

JM: Well, it�s enough to get started. The other thing that has happened is that initially you�d spend three hours teaching and then the system itself did not have much capability to learn beyond that. Modern systems now, or more modern systems of today are basically allowing you to put in that half an hour now but as the system goes on and maybe it fumbles over a word or two and wants you to clarify those particular words, it learns further. Its learning doesn�t stop at that half hour.

HS: It�s learning while you�re actually using it which is kind of nice.

JM: Yes.

JB: That sounds pretty great. You know, on some of the links on the website, you know, go out to a couple of the reviews of some of the latest and greatest voice recognition systems, John, how good are the systems that are out right now?

JM: Well, if you�re looking at some of the reviews, they�re talking about upwards of 98% recognition and that�s [inaudible] pretty phenomenal. It�s really�I�m not sure that you�re going to get that for everybody or anybody who would use the system. In the systems that I�ve used, and I can tell you that I have not used the latest and greatest versions of neither ViaVoice�which apparently has just come out with another version�and Naturally Speaking, the Dragon system which has just come out recently with another version. But we�re finding in general for the people who are continuing to use these systems somewhere between 85 and 95%.

JB: Still, that�s really quite phenomenal.

HS: That�s still really quite good. Yeah, we have a question from Molly Ruggles who�s at MIT and Molly says, �Voice recognition could conceivably replace the keyboard as an input device. However,� she says, �one of the advantages of the keyboard was its quiet and non-intrusive nature. It could be used unobtrusively in the presence of others.� She says, �If business and education evolve toward increased use of voice recognition, do you foresee the necessity to redesign work and study areas so that speech can be better muffled or absorbed?��

JM: And it�s really quite a good question, but I�m not sure that the redesign of the space is necessarily the way to handle it. And I think that most of us who are around cell-phone users

HS: Cell phone [inaudible].

JM: �with the headphone sets and all, walk down the street and have what appears to be a dialogue with themselves, is we really have to address the problem at the source which means that maybe if we ever reach that particular point, we�re looking at some form of cup over the mouth, a transducer over the mouth, almost like a gas mask kind of a thing.

HS: I hope not!

JM: Well, I understand, but you know, here I�m just talking off the top of the head to��

JB: Won�t have to worry about lipstick or anything then!

JM: [inaudible]�

HS: I�m not going to worry much about lipstick, Judith.

JM: But you see people walking around cities with the little white masks on, you know, to keep the bad air out so it may be that we find the same thing. And none of us, certainly, in the technology area want to see more facilities. We don�t want--you know, we�re going to wireless because we don�t want to have to wire.

JB: Well, probably because we don�t want facilities. What about actually at Drexel, you�re real pioneers in a large VR deployment. What did you see, what happened as far as the question from Molly regarding the noise�one might say the noise pollution with people talking and all of that? Did that become a problem or did people solve it?

JM: Well, I don�t think it became a problem. We deployed about 20,000 copies.

HS: When was this?

JM: In 2000, September of 2000. About 20,000 copies of Dragon Naturally Speaking.

HS: And who had them? Students, faculty, staff, everybody?

JM: Everybody. Everybody.

HS: So everybody in the whole place had one of these things. What did you do about headset? Do you just attach them to the machines or just gave them out to everybody?

JM: The headsets were purchasable at a greatly reduced rate, in some cases depending�some faculty got headsets, but in general, the students get the raw end of the deal. They had to purchase their headsets, but they were only a few dollars apiece.

HS: And why did you think you were doing this? I mean, what was the goal of this project?

JM: There really were a number of exigencies that sort of pushed us in this direction. One was a general�well, actually Dragon had come to us and said, �Look, we really never had a wide-scale deployment and whether or not�� The Dragon is also Drexel�s mascot, I [inaudible] to say. So I hate to say that we had these because of that

JB: That�s cool!

JM: But I�m sure that there was a little PR involved in that also. And we try to be a rather technologically advanced school. And I think that from an interest standpoint, we�re all�gooey interfaces and other kinds of interfaces for the computational environment is something that intrigues everybody.

HS: So you did it because the planets were all aligned correctly?

JM: I think that that�s probably the biggest reason. And so we rolled out 20,000 copies on CD and basically had training sessions for people. We had like couple-hour training so that we could train them how to train the system, train them how to edit and go through that process. This was on Naturally Speaking version 3 so we�re a few versions beyond that now. And so we rolled this out and we sort of stood back to see what was going to happen. The original question was, well, what kind of impact relative to Molly�s question? And we really didn�t see a lot of impact. Students didn�t come into the classrooms and begin to take notes verbally. Students didn�t go into the library and sit there and take notes verbally. We did have actually quite a few professors who would use it to dictate outlines and to do things like that. I�ve used it to actually convert lectures that we�ve taped into text, so in other words, to create close-caption kind of things without even training the systems. So it was somewhat of a success on the faculty side. It was somewhat of a success on sort of my IT side, from the classroom technology side. It was probably more of just something of great interest on the student side.

HS: How�s it being used now at Drexel? It�s now a couple years later.

JM: Well, we rolled out Microsoft Office XP this last September and so it is basically, VR is integrated in that. What we�ve seen is pretty much the same kind of thing. People�the editing is still one of the biggest problems that�if you don�t have great amounts of recognition, then the amount of time to edit and say what you were going to say far exceeds the amount of time it would have taken to type it in the first place. And if you have to stop and edit, your train of thought gets all messed up. So it really�people play around with it for a while and when they get frustrated enough, they drop it. But having said that, we just so happen to have�one of my boss, who is the Associate VP of IT here�she had to go in for carpal tunnel on a hand. This was really about ten days ago. And she�we decided�she decided that she would use Naturally Speaking to do all of her work after she got her hand operated on. And I asked her the other day about this and I actually have a quote from her.

JB: Yes?

JM: If you would like me to read that.

HS: Sure!

JB: Sure! Tell us

JM: Her name is Jan Biros, Dr. Jan Biros and again, she�s the Associate VP of IT here at Drexel. And she says, �In anticipation of having surgery on my hand, I wanted to try Dragon Naturally Speaking so I could remain productive during my recovery.� And that�s in quotes.

JB: Um-hum.

JM: �It only took about a half an hour to both train the software and to learn the basics of how to use it. I began using it immediately to dictate e-mails and to compose longer memos and reports. I was able to speak normally, with a good speed and flow, which let me compose just as I would if I were typing. There were minimal edits and corrections to make to the draft. Dragon helped me continue working and not lose any time or productivity and it was fun to use.� So her experience was really quite good.

HS: We have a question actually that came in that�s sort of related, I think, to�what�s her name again? I�m sorry.

JM: Jan.

HS: Jan?

JM: Jan Biros.

HS: And the question is from Russ Munton at Carleton University in Ottawa, Canada. And Russ says, �Do you know if anyone is experimenting with voice recognition creating real-time captioning for the hearing impaired that could be used in a video transmission?� I�d like to generalize Russ�s question, if I could, just to talk about the use of this thing generally for people with disabilities.

JM: Right. We�re actually actively working with our ADA group here, our disabilities group, sort of across the board and this is one of those things that�s cropped up quite a bit. Usually for the visually impaired, as an example, web pages and distance learning is relatively difficult unless you have a special reader that will read the web pages back. Of course, that�s now text to speech. That�s the other way around. But we�ve experimented�an example�to do just that. I had mentioned that we had taken recordings and used Naturally Speaking to go and create basically the captions for those things. In those circumstances, you don�t really mind editing too much because you�re not the one saying it and it�s already been recorded, it�s already on tape. It sort of, it doesn�t matter if you�re doing it live or if you�re doing it on tape. The recognition levels might be somewhat low, 70%, 75% for an untrained system. But then again, when you see some of the close-captioning that�s going on on TV and other things, it can be pretty awful. So it�s a great start to do, to support those ADA kinds of efforts.

JB: Going back to Russ�s question, do you know of someone who is, in fact, using it in that way right now or are you at Drexel using it that way?

JM: Yeah, we are using it that way right now.

JB: Okay.

JM: Not extensively, but we are using it�when we have a call for that, as an example, if we know that there�s going to be somebody with a disability in a class that has a web component to it, we�ll close-caption as much as we can.

HS: John, you said that XP, Office XP had voice recognition integrated into it. How is that compared to the other systems that you can buy? If somebody has Office XP, for example, is that what they should use rather than buy one of the other products?

JM: Boy, that puts me on the spot!

HS: I just wonder how good it is. I haven�t played with it. I noticed that it was there once you told me it was there and��

JM: Yeah, and that�s kind of the funny thing. Microsoft doesn�t particularly advertise it that awfully strongly as a component.

HS: But you�re using it at Drexel so you�have you had some experience with it?

JM: I�to tell you the truth, not a whole lot. I have some minimal personal experience with it but I don�t have any anecdotal experience with it. It�s one of those things that I�m not involved directly with the choice of software that goes out on our CD. And I didn�t really learn that it had that built in until basically around the first of the year, at which point I sort of turned it on, experimented it for a while and said, �Okay, well, so what?� It doesn�t seem particularly any better. It does a relatively short training time. I think the training time was actually less than a half an hour. It was something on the order of about 20 minutes. One of it�s pros is that it does work across the whole fleet of software that Microsoft has in Office XP so it could work in Excel or it could work in Word. So there are some advantages to that. Other systems�I�m not sure about, again, the latest versions of ViaVoice and the latest versions of Naturally Speaking, but those tended to be limited to very specific applications.

HS: How powerful a computer do you need to do good voice recognition?

JM: Well, most of these systems, going back to like the version 3, if you had a 200 megaHertz Pentium PC you were probably doing okay. I would imagine that the�power is going to be related to what you see and how fast, not only the actual word recognition occurs, but the actual whole sentence recognition occurs.

HS: All the syntactical stuff.

JM: Yes.

HS: Looking words up in the dictionary and all that.

JM: Yeah, and the thing�one of the things that frustrates me is that when I dictate, I tend to watch the screen and I�m looking for�unfortunately, I probably shouldn�t do this�but I�m looking for whether it�s actually picking up what I�m saying or not. So if I�m speaking in a continuous voice and a natural flow, it may be a whole sentence behind me and so I may tend to slow down a little bit, waiting for it to catch up.

HS: And that�s on your desktop?

JM: That�s on your desktop, yeah.

HS: How does this stuff ever work on PDA�s, then, which have a fraction of the horsepower of a desktop machine?

JM: Well, it may be, again, a more limited vocabulary. I mean, from the desktop, we could probably get by with probably�maybe a thousand words or so for most of our interactions, if you really think about it. And it�having a vocabulary of 200,000 words, which some of these do, is wonderful. But now, I can pull open a dictionary and probably open to any page and probably find � of the words on those pages, you know, are words I never use. So just having the fact that those words are there doesn�t necessarily mean that they�re words that you use all the time. So it might be that you could select a list or a small subset, you know, of the vocabulary that you want to use that wouldn�t require nearly as much processing power. And, I mean, just the changes in the last couple years with the mobile computing capabilities has been rather significant.

HS: Are PDA�s doing voice recognition of�beyond being able to say something like, you know, numbers and simple things like that?

JM: Ahh

HS: I mean, is this an alternative to graffiti?

JM: I believe�yeah.

HS: Well, you might as well [inaudible].

JB: Graffiti works!

JM: Well, the interesting thing about graffiti�and again, I think, the translation is that graffiti is a shorthand. And it�s not, you know, in some cases you�re drawing the letters and in some cases, you�re not. So the natural next step for that is sort of the Chinese characters, which is kind of a graffiti so you draw a character and it really is a whole sentence or a whole thought or a whole concept. Well, we�re seeing changes now in terms of use and how to use these interfaces over to extensions of XML which will allow some level of voice interaction or voice interface to PDA or web pages and all, so there is some reasons to see in the future that we will interact with these devices. As an example, your typical cell phone, when you�re telling it to call home, it�s had to recognize �home� vs. �office,� again, a small vocabulary. But it is still recognizing those differences in order to automatically dial for you.

HS: Um-hum.

JM: And that�s a fairly small processor. I mean, we�re not talking about much processing capabilities at all there.

JB: I�d like to remind folks that we�ve gotten some good questions in, and also if you have your question, to go ahead and send questions into expert@cren.net.

HS: In fact, we�ll take one now.

JB: Okay, go for it, Howard.

HS: We�ll take one from Bill at billnet.org which is really the department of Computer Science at the University of Georgia, and I do wonder how Bill ever got to be on Billnet. But we�ll discuss that later. Bill�s question��

JB: Maybe he�ll tell us.

HS: Maybe he knows and will tell us. Send us a note, Bill! At any rate, he says, �My research is in the area of speech services so I am very interested in capturing different language variations or dialects in synthesized speech. Is there a good bit of dialectology work being done in speech recognition?��

JM: Well, I�d like to say that I was really an expert on that but I don�t really know. When you look�again, going back to the haystack stuff, the haystack analogy looks across the field and says, �What�s similar to all of these haystacks?� in order to create a base pattern that is recognizable no matter what the dialect is. So dialectology is basically looking from a base set out and saying, �What additional patterns can be overlaid or extracted or subtracted from our speech pattern that leaves us with something, leaves us with the base speech so that we could compare it to a set of vocabulary?� But I don�t know the research in this particular area.

HS: Okay, we have a bunch of questions from our friend Ed Goray and I�m going to take a couple from the middle, if I can, because there�s a couple that really strike me here, Ed. So if we can take these totally out of order�one of his questions is, he says, �Should we spend precious CPU and battery power on PDA�s running voice recognition applications?� Again, I�d like to interpret the question and the question is, I think he�s saying, is VR a frill? Really, is this just a toy or something we�re playing around with?

JM: Well, I think there�s some obvious places where it�s very important. Again, back to the cell phone and now, certainly, the technology in the telecom area. I�m certainly waiting for my PDA-cell phone-Palm, all one unit. I mean, it only makes sense that we�ll go there. But I know that if I�m on a cell phone and I�m driving and I�m trying to dial, I�m a hazard. There�s no doubt about it.

HS: Oh, I think in most states, you�re more than a hazard. You�re doing something illegal.

JM: Well, it depends on which state you�re in.

HS: In New York��

JM: Well, not in Pennsylvania, not yet, at least.

HS: Don�t drive across that state line!

JM: No, I understand. I understand, and I think it should be illegal, but that�s�see, the purpose of having some of this is really quite strong. But as a�are we going to see people walking down the street with this little muzzle mask on, connected up to the PDA and taking memos and writing things as they�re walking down the street when they could actually also use a little voice recorder or something like that in the same way, or something else? There�s not a lot of reason for that. It�s kind of fun, it�s kind of interesting, but would you do that every day? I don�t know. I don�t think so.

JB: I don�t know, I would kind of think that we�re creating our own little, like a little office bubble wherever we go. We do it, in fact, we see people in Starbuck�s and other coffee places and they have their computer and they�ve got their cell phone and they--you know, people are setting up little offices wherever they are and I think that�don�t you think voice recognition just takes us one more step down that road?

JM: Oh, it absolutely does! Now, the question is, we have also�I look at the number of hours that I work a week and I work more now than I did a week, than I did five years ago. We know that we spend less time, less quality time at many things. We appreciate our vacations a lot more! But I mean, there�s�we can also kill ourselves with all this stuff. I mean, yes, I love the technology. I mean, I�m really�I really love all of the toys. But I didn�t have�I refused to have a computer at home until almost the mid-90�s and I�ve been doing this since 1970, so it�s an invasion. So how much, how many more balls and chain, you know? I have my cell phone and my PDA. You know, how much more connected do we really want to be? And I sort of think that that�s a social issue.

JB: I think that�s a really good question. I do think that maybe it�s a good segue to go into�we have another question coming in from the [inaudible] Organization of the University of Wisconsin, from Blair Bundy and he�s saying that he�s hoping you could discuss VR and its uses and future of captioning on the web and I think that the whole issue�we mentioned this during our prep session, John, that a whole area of voice recognition may, in fact, be very important for meeting ADA standards on the web.

JM: Right.

JB: Would you mind saying a little bit about where perhaps Drexel is on this and where it�s going?

JM: Sure. We have actually played around with this probably for the last four or five years. One of the things that certainly got us interested is there�s a product called Hear Me which was�it was a voice chat, it was a voice over IP kind of application where you could basically drop in an object on a web page and it would connect up to the vendor�s server and you could get into a chat session, a little private chat session with other people. And so we were playing around with that, but again, the technology was relatively young at that particular time and the quality of the interactions�in fact, limitations on the number of people that could come into the conference and all were rather limited. But that�s�being able to create another interface to people provides�I mean, that�s one of the advantages when you put it in an ADA arena, that it doesn�t have to be typed. It doesn�t necessarily have to be visual, that you can use the century connections that the individual has, depending on what they have available to them. We move from there to maybe just a year or two ago where we used a product by Convers*, which we call Conversa, and it was basically an embedded voice ML type of technology. And they added an Activex control on top of Internet Explorer which would basically read the XML tagging. It would look at�since it was running on your local browser, one, it had access to all the tags that were there anyway and could actually read out all of your reference tags. And then also could take your voice input to move around and navigate from page to page, to scroll and all, without actually having to put any tags on the page. So that was really quite nice. Now, what�s happened is with that company, as happens to a lot of companies, they put out technology free to get people interested in seeing what kind of impact it might have or what kind of audience that they have and they used that to sort of develop and direct and focus their product. And once they reached a particular stage there, then they sort of pulled the rug a little bit and said, �Okay, now we�re going to charge you for that product.� Well, I mean, that�s fine, but at the university scale, we can�t pay too much. And then they go out and find a corporate environment in order to try to sell that product in. And that kind of a product is very, very, very intriguing for having�I mean, some of the applications and all, you could have a menu as an example on the web and you could voice your menu selections and it would take it into the form and send it out to the local pizza parlor and end up delivering pizzas to you. That would be one. Another would be, again, web-based education. If you can�t use a keyboard but you needed to take a quiz, you could vocalize your answers and have those answers as part of the response. So I�m seeing or beginning to see a significant potential in voice interaction within the Internet, the web and all. So that�s

HS: What is the status of voice XML? Is there a standard now or are a bunch of people using?\�

JM: For the longest time, there wasn�t a standard. There is a standard at this particular point, though I think some would say that there�s actually a couple of competing standards. There are some fairly significant companies that are putting out tagging systems and voicing systems in this area. Again, it�s oriented towards corporate clients but we haven�t seen much at the academic level.

HS: I mean, I do see that W3C is doing some stuff.

JM: Right.

HS: With voice XML and with something called CCXML, so a couple of these kinds of things are [inaudible].

JM: Right, right. But I think it�s again a relatively immature technology. I think we�re going to see quite a bit more of it. I think that those kinds of technologies are becoming embedded technologies. Again, the cell phone makes it very easy to imbed basically a mini web server in a cell phone and interact with it, or cell phone/PDA, as these things are going now.

JB: John, if we go back to looking at the title for our session today was Voice Recognition Systems�Is It Time? And we found out they�ve been around for 20 years and yet they still seem to be much more of a niche application rather than really widely deployed. Are [inaudible] that�s actually going to change soon or is it going to continue a long time in the niche market, do you think?

JM: Well, I haven�t seen the killer app associated with this yet. I mean, we�ve seen some of the interactive capabilities on the web, but just thinking about the web itself, that took off so quickly. People saw that, the light went off and everybody jumped.

JB: Um-hum.

JM: This is certainly much more evolutionary than revolutionary and I don�t see that killer app coming out of the closet anytime soon. My belief is it�s going to continue to be evolutionary.

HS: But it looks like Microsoft has included that as part of their operating system. I mean, I�m not asking you to figure out what they�re doing.

JM: No.

HS: But obviously, for them to do that, they must�I would guess they believe somehow this is going to be mainstream. What I wondered about is, I�ve heard that Microsoft is coming out with tablet-based machines that don�t have keyboards. Is this possibly one of the reasons they�re trying to include this? I mean, if you had a tablet-based device without a keyboard, perhaps you�d be more likely to want VR on that thing.

JM: Well, I think that that�s�I can see the line of reasoning and it�s possibly true, but think of how many PDA users go out and buy a tablet�well, I mean, go out and buy a keyboard for it.

HS: And actually use it.

JM: Yeah.

HS: A lot of people buy it and never use it.

JM: Well, I�ve had tablet interfaces to computers. Is that laptop really just a big PDA, like a big IPAQ Pocket PC? I mean, are you going to use it in the same way? I mean, you may have the full power of a PC but it�s sort of a so-what?

HS: How do you edit with VR? I mean, I take my mouse and I select some text. And I select text because I look at the screen and say, �Hmm, I want to select text starting here and ending there.� If I had to describe that verbally, how do you really? I mean, because if you�re going to edit this stuff, that�s the kind of thing you�re going to want to do, right? You�re going to want to select text and copy it and edit it and move it around and change a word. And I do those things pretty quickly, I think because I�ve learned how to use a mouse. But how does one do that with VR?

JM: Tediously! Editing, going back to edit a word, there�s a short�there�s a sort of language subset that once you�re in an editing mode, it knows two words. It knows a couple different modes. It knows a recording mode where it�s trying to interpret what you�re saying, and then you can turn it into an edit mode. Well, that�s a very different mode, so it�s listening to commands that are moving you around, so a command of an Up Line, Down Line, Back-back-back-back-back.

HS: So that�s the kind of thing I�d really do, [inaudible]?

JM: That�s what you have to do.

HS: I�m going to Down Line, Down Another Line.

JM: Yeah.

HS: Over a couple.

JM: Right.

HS: And then somehow tell it, �Okay, that�s the anchor point for the beginning of a text selection.��

JM: Right. Now, I mean, later on, if you want to take the next logical step, you put an Iris tracker on there and you watch�look where you�re watching, you know, and then��

HS: But today. Today, it�s a matter of�it�s almost the equivalent of giving you the cursor control keys.

JM: That�s correct.

HS: And so you move the cursor around and somehow you can anchor the thing so then you can move the cursor to somewhere else and somehow tell it, �It�s everything in between that.��

JM: Right. Most people I know will go back and go into�get out of it to get into editing mode and then edit along the way. You know, just replace words or just type over or insert or whatever.

HS: But that would mean we�d�you couldn�t get rid of a keyboard then.

JM: That�s right.

HS: If a thing were to have a device with no keyboard, you�d want editing to be fairly simple.

JM: Right. Um-hum.

HS: But you�re saying today

JM: Right, and the other option is, again, going back to that tablet. You might have, like IBM Think Pads have that little red button on it which is a little joystick. You could potentially use a little joystick or a pointing device to go and point to the word or word sentence or fragment that you want to operate on. But voicing up to it, you know, it could be very tedious.

HS: Yeah, I mean, that seems like one place where a mouse is actually easier once you learn to use it than to try to describe the thing.

JM: Well, I said, I think, from the beginning that recognition is fairly decent but the frustration usually comes in the editing.

JB: I think one of the things we talked about before is the idea that perhaps if you�re a person who is really very facile with the keyboard, that it might not be as good a tool as for someone who is having more difficulty with the keyboarding. Is that something that you have found?

JM: Well, I personally am a very fast typist. I can probably do 60, 70 words a minute with pretty decent accuracy, and because of that, I expect the voice recognition to be as good as I am relative to typing and again, that just increases the frustration. If I get 90% of 1,000 words and that means that 10% of that 1,000 words or 100 words are wrong�and that�s not really including the contextual issues that might exist�that�s a fairly significant edit.

HS: Isn�t there also another issue, and that is when I keyboard something or when most people keyboard something, they keyboard actual sentences and things. They don�t have a lot of um�s and er�s and the end of the sentence usually sort of matches the front of the sentence.

JB: I think you�re making a lot of assumptions there!

HS: But you hear people speak, I mean, people conversationally speak quite differently than they write and I know I�ve tried to use these little microcassette recorders and I say, �Well, I�ll just say things into that and later I�ll have somebody type that.� Well, I�ve had people do it and I can�t believe I said things like that! And so I mean, I really think that perhaps you get ideas out quite differently when you speak them than when you keyboard them, and without a lot of training to dictate into this thing, you�re going to have a lot more editing to do than perhaps you even think.

JM: Yeah.

JB: John, we�re getting a little close to the end here and we have time maybe for one or two more questions, but before we actually do those, from your experience right now, if an institution would come to you and say, �Really, John, I don�t have much�we�re not going to do a lot with voice but we�re going to do something,� what would be the first or second thing that you would recommend institutions to do?

JM: Well, I think that we had been talking about the ADA and section 508 kinds of issues, and I think that while most of the universities aren�t under a gun right now, they�re under a gun to support their students. But they�re not under the gun from a website or an Internet site to do a whole lot. I think that we�re eventually going to see that come down the pipe and so I would certainly put that on the event horizon, especially if they�re doing any distance ed or using any web-based support for classes.

JB: Um-hum.

JM: The other thing is that it is an interesting technology, that we�re a completely wireless campus. We�re looking at merging all kinds of technologies into our system, PDA, Messenger, pager type messengers and all, and I think that there will be potentials there. So it�s always, I think, always going to be�it�s been there, as I said, certainly for me since �83. And it�s constantly on that horizon. It�s just that when does it become really a necessity? It�s become�rather than something that�s rather interesting.

JB: And what about the impact of wireless and PDA�s? Do you think that�s going to push the VR?

JM: Well, I think it already has and I think that�s why a lot of VR work right now is being done in embedded systems. You know, the voice recognition in the cell phone, maybe limited vocabulary work that could be done in a PDA. Those kinds of things, I think, are already finding their niche and I think that those will expand. Then, you know, the notebook computer, the flat screen thing�the mirror, I guess it�s called, the one that Microsoft just showed, and of course, we�ve been talking about that kind of notebook, notepad kind of computer for quite a long time. But if that kind of thing takes off, then maybe again there is much more urgency to move towards this kind of technology.

JB: Okay, good.

HS: On another note from Bill at billnet.org.

JM: Bill.org now?

JB: Billnet.

JM: Oh, Billnet.

HS: Billnet.org and first of all, thank you, Bill, for telling me�he says Billnet.org is his personal domain name which I�m kind of envious because I don�t have a personal domain name. But he�Bill raises, I think, a very interesting question about voice recognition, one that I think we haven�t even considered here. I know this is late in the day here, but he says, �You were mentioning disabilities. I would like to see speech recognition used in speech pathology to diagnose and help correct speech problems. Do you see this happening in the near future?

JM: Well, it actually is. It�s just that it may be a bit primitive at this point. I mean, most speech pathologists, the ones that I know and ones that are speech therapists, for years and all, they used an oscilloscope to look at patterns and to study those things.

HS: Kind of primitive.

JM: Voice recognition is pattern recognition. And once a pattern can be discerned, that pattern can be compared to other patterns, so I actually believe that those kinds of things are already there, or already being done. Now, to what extent, I don�t know since that�s not my field. But it seems like a natural.

JB: Okay. Howard, do you have a final question for John?

HS: Yeah. I do, and I should�first, we should apologize to Ed Goray for not answering all the questions he sent us. We love to get your questions, Ed. There�s just more than we can possibly handle, but keep them coming! John, if I decide I want to go out and I want to do some VR on campus and I actually need permission to do it, what kind of case do I make to my vice president that I want to do some more of this stuff or can I just kind of do it secretly here?

JM: Clandestine!

JB: Behind closed doors, Howard.

HS: [inaudible] secretly anymore [inaudible].

JM: Well, it depends on the case that you want to make. I mean, is it just because it�s there, you know, it�s sort of, because we built it, did they come? Just because it�s out there, do we have wide use? And I think that the uses right now are fairly niche. I think it�s hard to make a general case for widespread distribution. That doesn�t mean that it isn�t useful, it�s just that unless you�re not paying anything for it, it makes that�I just don�t see the real reasoning behind it.

HS: Now, my real last question, because we just keep having a real last question as long as you want to keep [inaudible]. But what about pitfalls when you use this kind of stuff? I mean, is not giving the right headset a disaster or�what are the things you really have to watch out for, John?

JB: Wow, we haven�t even talked about the headsets!

HS: To make this thing successful.

JM: Well, that�s a really major one. One of the things that Dragon did is they boxed a headset from Andrea Electronics, I think, is in with theirs and it actually improves the quality. I mean, it was�the headset that was there was a much higher quality headset than you would typically get with a PC. Certainly, background noise and noise environment all play a significant role in the quality of the�how much is actually understood b the system. I mean, a noisy environment, you�re picking up all of that noise. It makes it much more difficult for the computer to separate out noise from the actual speech pattern.

HS: What about speaking slower or faster? If you slow down, does that help recognition?

JM: No, and in fact, it may actually hurt you because remember that these systems are asking you to learn it. They want to learn your typical, normal speech pattern so by slowing down and speeding up, you�re actually diverting from the pattern that you�ve already set. So while certain words, it won�t really matter because the patterns are fairly distinct. Other words that it might have�might be on the edge, okay, it can understand when you�re speaking normally but if you slow down or speed up, it might not get those words at all.

HS: Okay, Judith?

JB: Okay, well, as we wrap up here on this very important technology, voice recognition, I�d like to just remind folks that there�s some great links on the website that are linked off to reviews of Naturally Speaking from Dragon and also the ViaVoice from IBM and encourage you to take a look at them. There�s obviously some more interest right now because of the new releases of these products. And with that, I�d like to thank all of our audience participants and all the folks that have sent so many questions for us today. Thanks for doing that! Thanks also very much for being with us here generally and to please join us again in two weeks, on April fourth, for a conversation with Julie Little and John Peters from the University of Tennessee at Knoxville. Our topic is going to be Collaboration Technologies for Teaching and Learning. Many thanks to the CREN member institutions for their support of CREN Tech Talks. Thanks to our Tech Talk expert for today, John Morris; to technology anchor, Howard Strauss; to Terry Calhoun, who is our Tech Talk web guru; to Jason Russell, to Bonnie Boyles and the support team at Merit; to Susie Berneis, who does all of our audio file transcribing; and finally, a thanks again to all of you for being here. You were here because it�s time. Bye, John. Bye, Howard.

HS: Bye, Judith.

JB: See you all on April fourth. Check out the archives on the cren.net site.

END OF WEBCAST