June 14, 2001

How to: Understand voice recognition applications

In the film Star Trek IV – The Voyage Home, one of my favourite moments is when James Doohan, paying the part of chief engineer Scotty tries talking to a 1986 vintage desktop PC and his colleague Dr “Bones” McCoy says, “Scotty, I think you have to use the keyboard.” Scotty replies, “A keyboard? Oh! How quaint!”

The truth of the matter is that at the time the Star Trek film was made, speech recognition technology was still in its infancy, and was on the whole, confined to highly specialised markets. For example, in most medical dramas at some stage there will be film of the pathology examination with the doctor dictating notes into a microphone hanging over a body. In practice, this was one of the specialist markets for voice recognition technology.

A brief history
In 1985, Apricot Computers produced a portable PC fitted with a hinged microphone at the side of the display. Users of the computer were able to perform simple dictated tasks in DOS (the early operating system). However the system could at best only be described as imperfect.

Kurzweil were the first company producing usable PC based speech recognition software, Whilst IBM, were also developing products along a similar line. Both of these software productions required specialist hardware to be included within the PC build, and additionally, both systems required “discrete” dictation, i.e. where – every – word – needs – to – be – said – separately.

A leap forward was made in 1996, with the introduction of continuous speech recognition. Continuous speech recognition allows you dictate in a more natural manner, and dictation speeds of up to 120 words per minute are achievable. From the computer’s point of view there is a lot more work involved in the process, and current developments in speech recognition technology are highly reliant upon increased computer processing power.

As they were originally developed speech recognition programmes were designed to run within simple word processing applications. In many cases this would mean that they were designed to run using Windows Notepad or WordPad, indeed even with modern speech recognition software, users would find that these simple word processors might still give the best results. However, more recently the software is not only able to cope with more complex web processing applications such as Word 2000, but they also offer “command and control” of other PC applications, so that for example you are able to browse the web, build an excel spreadsheet or operate peripheral equipment such as a scanner, all by voice control.

How it works
Modern speech recognition software is based on something called pattern recognition: you say something into a microphone and your speech is converted into digital data, which is then compared to that stored in the programme’s memory. It naturally follows that the more memory available to the software then the higher the number of words it can recognise. It also follows that the faster processing speed available to the software the higher the speed of dictation available. Do not for one moment believe that your computer actually understands what you are saying though!

The comparison process uses complicated mathematical formulae known as “algorithms”. The algorithms are based on detailed relational statistical techniques for predictive modelling – known as the Hidden Markov Model or HMM – in effect, the process makes educated guesses about the audio sound pattern of your voice to predict the words that you might be using.

The scenario is further complicated by taking into account words, which sound the same, but may have different meanings and spellings. And speech recognition programmes have to be able to differentiate the different contextual uses of words in sentences. For example: “annexe” and “annex” are two similarly sounding and similarly spelled words with different meanings.
Annexe is a noun meaning “a building added to, or used as an addition to another building”.
Annex is a verb meaning, “to take possession of”.

Users of speech recognition software are therefore best advised not to look at the screen whilst dictating, as the comparisons being made in the software take time, and changes to the displayed text can be made several sentences on from when you spoke them!

Training
Your voice is different from my voice. If we were both to be using the same voice recognition software package, then that same package would need to understand both your voice and my voice. Your speech recognition programme needs to be trained to recognise your voice. The process is known as enrolment. Your PC can only successfully recognise your words when it is fully acquainted with your pronunciation and the way that you speak.

In many ways, it could be said that enrolment is a never-ending process because the more you use the speech recognition software the more accurate it will become. Every time you use a new word this has to be added to the vocabulary. Initial enrolment may take anything up to two hours, and consists of the new user reading prescribed passages into the microphone, and the new user’s voice patterns are compared to the programmes standard patterns. It’s hardly an exciting process for the new user, and I would advise them to keep a glass of water close by! However once this has been finished it can take up to another hour or so for the programme to analyse and compare the voice data patterns.

With modern voice recognition programmes accuracy rates of up to 98% are achievable provided the user is prepared to invest time in training the software.

Performance and hardware
Modern PC performance is not usually a problem for voice recognition software. Processor speeds and memory capacities are generally quite ample to achieve satisfactory accuracy. However, it is worth remembering that voice recognition software relies on the use of a suitable sound card. Whilst most sound cards in production today will not present any difficulties, investment in decent quality sound cards will help in overall fidelity. Another thing to remember when using speech recognition software is that special effects in sound such as 3-D and surround-sound should be turned off.

Most current speech recognition programmes have some 50,000 to 100,000 words in their vocabulary, and whilst most of these can be crunched with 32MB memory, increasing the memory to 128MB or even 256MB would be a worthwhile investment. This is especially true if the voice recognition software is being used with more memory intensive applications such as Word 2000, or MS Access.

Somewhat similar to your hi-fi system the end result that you achieve will be dependant upon the quality of the equipment that you use. Central to the use of speech recognition software are microphones and headphones. Most recognition software packages nowadays include a minimum standard headset microphone; some of these have integral earphones.

The need for high sensitivity and fidelity in microphones is far more important than the need for quality earphones. The microphones need to be sensitive to sound and sound variations and to some extent have a degree of directional concentration. Even in the quietest locations there is inevitably going to be some element of background noise. The ability of the voice recognition software to separate dictated input from background noise is essential. Trying to use such software in a busy office environment where conversations are going in the background, telephones are ringing, printers are humming and doors are slamming is virtually impossible.

When should it be used?
Most business uses of voice recognition software stem from three main situations:
1. The need to increase productive capacity in keyboard use
2. The need to prevent injury or further injury from keyboard use
3. Conversion of recorded notes from digital recorders – they used to be called ‘Dictaphones’!

In the first case, a two finger typing ability can severely restrict a person’s ability to produce documents, reports, spreadsheets etc. After the short period of training required for the software, someone who swaps over from the keyboard to the microphone will find that his or her productivity can substantially increase. Don’t expect overnight results however, whilst it may only take a couple of hours for the person to enrol their voice with the software, changing an individual’s habits so that they feel happy talking into a microphone in front of a screen can be a much lengthier process.

In the second case, a person diagnosed with conditions such as repetitive strain injury (RSI) or arthritis may find that keyboard use is not only painful but can aggravate their condition. The use of voice recognition software in such circumstances can be an absolute godsend. It is also worthwhile remembering that organisations these days have responsibilities under the Disability Discrimination Act, and if they don’t make suitable provision for people with such injuries or conditions they could find themselves facing claims for compensation.