An apparatus and method for the robust recognition of speech during a call in a noisy environment is presented. Specific background noise models are created to model various background noises which may interfere in the error free recognition of speech. These background noise models are then used to determine which noise characteristics a particular call has. Once a determination has been made of the background noise in any given call, speech recognition is carried out using the appropriate background noise model.
|
0. 27. A method comprising:
sampling a noise signal to yield a sampled noise signal;
searching a database for a noise model matching the sampled noise signal to yield a matching noise model; and
applying the matching noise model to a speech recognition process.
9. A method for improving recognition of speech subjected to noise, the method comprising the steps of:
sampling a connection noise to yield sampled connection noise;
searching a database for a noise model most closely matching that matches the sampled connection noise to yield a matching noise model; and
applying the most closely matching noise model to a speech recognition process.
1. A method for the robust recognition of speech in a noisy environment, comprising the steps of:
receiving the speech;
recording an amount of data related to the a noisy environment, to yield recorded data;
analyzing the recorded data;
selecting at least one appropriate a background noise model on the basis of based on the recorded data, to yield a selected background noise model; and
performing speech recognition with the at least one selected background noise model.
0. 23. A speech recognition apparatus comprising:
a database having stored thereon templates of a plurality of background noises; and
a controller that identifies a background noise template, from the templates of the plurality of background noise templates, that matches background noise from a received input signal, to yield a matching background noise template, and supplies the matching background noise template to a speech recognizer.
0. 34. A speech recognition method, comprising:
identifying a background noise component from an input signal;
comparing the background noise component to a plurality of previously-stored noise models, to yield a comparison;
selecting a noise model from the plurality of previously-stored noise models based on the comparison, to yield a selected noise model; and
performing speech recognition on the input signal with reference to the selected noise model.
0. 14. A speech recognition apparatus comprising:
a speech recognizer;
a database having stored thereon templates of a plurality of background noises; and
an identifier that identifies, via a processor, a background noise template from the plurality of background noise templates, the background noise template matching a background noise from an input signal, to yield a matching background noise template, wherein the speech recognizer recognizes speech from the input signal with reference to the matching background noise template.
2. The method according to of
modeling at least one a background noise in a the noisy environment to create at least one the background noise model.
3. The method according to of
determining the a correctness of the at least one selected background noise model, wherein if when the at least one selected background noise model is determined to be incorrect, the method comprises loading at least one other another background noise model for use in the step of performing speech recognition.
4. The method according to of
constructing a background noise database for use in analyzing the recorded data on the noisy environment.
5. The method according to of
6. The method according to of
7. The method according to of
8. The method according to of
10. The method according to of
11. The method according to of
12. The method according to of
13. The method according to of
0. 15. The speech recognition apparatus of claim 14, wherein the identifier compares hidden Markov models of the plurality of background noise templates to a hidden Markov model of the background noise from the input signal.
0. 16. The speech recognition apparatus of claim 14, wherein the identifier identifies a portion of the input signal that is unlikely to contain speech, to yield an identified portion, wherein the identified portion is used as the background noise.
0. 17. The speech recognition apparatus of claim 14, wherein the identifier, when a plurality of background noise templates match the background noise, selects a template selected in a prior iteration as the matching background noise template.
0. 18. The speech recognition apparatus of claim 14, further comprising:
a restrictor that restricts a number of candidate templates based on geographic information associated with the input signal;
a comparer that compares the background noise to the restricted candidate templates to yield a comparison; and
a selector that selects the matching background noise template based on the comparison.
0. 19. The speech recognition apparatus of claim 14, further comprising:
a restrictor that restricts a number of candidate templates based on time of day information associated with the input signal to yield restricted candidate templates;
a comparer that compares the background noise to the restricted candidate templates to yield a comparison; and
a selector that selects the matching background noise template based on the comparison.
0. 20. The speech recognition apparatus of claim 14, further comprising:
a restrictor that restricts a number of candidate templates based on an identifier of a user at a location from which the input signal is captured to yield restricted candidate templates;
a comparer that compares the background noise to the restricted candidate templates to yield a comparison; and
a selector that selects the matching background noise template based on the comparison.
0. 21. The speech recognition apparatus of claim 14, further comprising a microphone to capture the input signal.
0. 22. The speech recognition apparatus of claim 14, further comprising a telephone to capture the input signal.
0. 24. The speech recognition apparatus of claim 23, further comprising the speech recognizer.
0. 25. The speech recognition apparatus of claim 23, further comprising a microphone to capture the input signal.
0. 26. The speech recognition apparatus of claim 23, further comprising a telephone to capture the input signal.
0. 28. The method of claim 27, wherein the searching comprises comparing hidden Markov models in the database to a hidden Markov model of the sampled noise signal.
0. 29. The method of claim 27, further comprising, prior to the sampling, isolating the noise signal from an input signal.
0. 30. The method of claim 27, further comprising, when a plurality of stored noise models match the sampled noise signal, selecting one of the plurality of stored noise models as the matching noise model according to a selection made in a prior iteration.
0. 31. The method of claim 27, wherein the searching comprises:
restricting a set of candidate noise models based on geographic information associated with the sampled noise signal, to yield a restricted set of candidate noise models;
comparing the sampled noise signal to the restricted set of candidate noise models, to yield a comparison; and
selecting the matching noise model based on the comparison.
0. 32. The method of claim 27, wherein the searching comprises:
restricting a set of candidate noise models based on time of day information associated with the sampled noise signal, to yield a restricted set of candidate noise models;
comparing the sampled noise signal to the restricted set of candidate noise models, to yield a comparison; and
selecting the matching noise model based on the comparison.
0. 33. The method of claim 27, wherein the searching comprises:
restricting a set of candidate noise models based on an identifier of a user at a location from which the sampled noise signal is captured, to yield a restricted set of candidate noise models;
comparing the sampled noise signal to the restricted set of candidate noise models, to yield a comparison; and
selecting the matching noise model based on the comparison.
0. 35. The speech recognition method of claim 34, further comprising:
identifying a subsequent background noise component from the input signal;
comparing the subsequent background noise component to the plurality of previously-stored noise models, to yield a second comparison;
selecting a second noise model from the plurality of previously-stored noise models based on the second comparison, to yield a second selected noise model; and
performing speech recognition on the input signal with reference to second selected noise model.
0. 36. The speech recognition method of claim 34, further comprising:
when speech recognition fails, selecting a second noise model from the plurality of previously-stored noise models based on the second comparison, to yield a second selected noise model; and
performing speech recognition on the input signal with reference to the second selected noise model.
0. 37. The speech recognition method of claim 34, further comprising, wherein the identifying occurs while prompting a user with an introductory message.
0. 38. The speech recognition method of claim 34, wherein the comparing uses hidden Markov models of the plurality of previously-stored noise models and a hidden Markov model of the background noise component.
0. 39. The speech recognition method of claim 34, further comprising, when a plurality of noise models from the plurality of previously-stored noise models match the background noise component, selecting one of the plurality of previously-stored noise models as a most closely matching noise model according to a selection made in a prior iteration.
0. 40. The speech recognition method of claim 34, wherein the comparing and selecting comprise:
restricting a set of candidate noise models based on geographic information associated with the background noise component, to yield a restricted set of candidate noise models;
comparing the background noise component to the restricted set of candidate noise models, to yield a second comparison; and
selecting the matching noise model based on the second comparison.
0. 41. The speech recognition method of claim 34, wherein the comparing and selecting comprise:
restricting a set of candidate noise models based on time of day information associated with the background noise component, to yield a restricted set of candidate noise models;
comparing the background noise component to the restricted set of candidate noise models, to yield a second comparison; and
selecting the matching noise model based on the second comparison.
0. 42. The speech recognition method of claim 34, wherein the comparing and selection comprise:
restricting a set of candidate noise models based on an identifier of a user at a location from which the input signal is captured, to yield a restricted set of candidate noise models;
comparing the background noise component to the restricted set of candidate noise models, to yield a second comparison; and
selecting a closely matching noise model based on the second comparison.
|
The present invention relates to the robust recognition of speech in noisy environments using specific noise environment models and recognizers, and more particularly, to selective noise/channel/coding models and recognizers for automatic speech recognition.
Many of the speech recognition applications in current use today often have difficulty properly recognizing speech in a noisy background environment. Or, if speech recognition applications work well in one noisy background environment, they may not work well in another. That is, when a speaker is speaking into a pick-up microphone/telephone with a background that is filled with extraneous noise, the speech recognition application may incorrectly recognize the speech and is thus prone to error. Thus time and effort is wasted by the speaker and the goals of the speech recognition applications are often not achieved. In telephone applications it is often necessary for a human operator to then again have the speaker repeat what has been previously spoken or attempt to decipher what has been recorded.
Thus, there has been a need for speech recognition applications to be able to correctly assess what has been spoken in a noisy background environment. U.S. Pat. No. 5,148,489, issued Sep. 15, 1992 to Erell et al., relates to the preprocessing of noisy speech to minimize the likelihood of errors. The speech is preprocessed by calculating for each vector of speech in the presence of noise an estimate of clean speech. Calculations are accomplished by what is called minimum-mean-log-spectral distance estimations using mixture models and Markov models. However, the preprocessing calculations rely on the basic assumptions that the clean speech can be modeled because the speech and noise are uncorrelated. As this basic assumption may not be true in all cases, errors may still occur.
U.S. Pat. No. 4,933,973, issued Jun. 12, 1990 to Porter, relates to the recognition of incoming speech signals in noise. Pre-stored templates of noise-free speech are modified to have the estimated spectral values of noise and the same signal-to-noise ratio as the incoming signal. Once modified, the templates are compared within a processor by a recognition algorithm. Thus recognition is dependent upon proper modification of the noise-free templates. If modification is incorrectly carried out, errors may still be present in the speech recognition.
U.S. Pat. No. 4,720,802, issued Jan. 19, 1988 to Damoulakis et al., relates to a noise compensation arrangement. Speech recognition is carried out by extracting an estimate of the background noise during unknown speech input. The noise estimate is then used to modify pre-stored noiseless speech reference signals for comparison with the unknown speech input. The comparison is accomplished by averaging values and generating sets of probability density signals. Correct recognition of the unknown speech thus relies upon the proper estimation of the background noise and proper selection of the speech reference signals. Improper estimation and selection may cause errors to occur in the speech recognition.
Thus, as can be seen, the industry has not yet provided a system of robust speech recognition which can function effectively in various noisy backgrounds.
In response to the above noted and other deficiencies, the present invention provides a method and an apparatus for robust speech recognition in various noisy environments. Thus the speech recognition system of the present invention is capable of higher performance than currently known methods in both noisy and other environments. Additionally, the present invention provides noise models, created to handle specific background noises, which can quickly be determined to relate to the background noise of a specific call.
To achieve the foregoing, and in accordance with the purposes of the present invention, as embodied and broadly described herein, the present invention is directed to the robust recognition of speech in noisy environments using specific noise environment models and recognizers. Thus models of various noise environments are created to handle specific background noises. A real-time system then analyzes the background noise of an incoming call, loads the appropriate noise model and performs the speech recognition task with the model.
The background noise models, themselves, are created for each set of background noise which may be used. Examples of the background noises to be sampled as models would be: city noise, motor vehicle noise, truck noise, airport noise, subway train noise, cellular interference noise, etc. Obviously, the models need not only be limited to simple background noise. For instance, various models may model different channel conditions, different telephone microphone characteristics, various different cellular coding techniques, Internet connections, and other noises associated with the placement of a call wherein speech recognition is to be used. Further, a complete set of sub-word models can be created for each characteristic by mixing different background noise characteristics.
Actual creation and collection of the models can be accomplished in any known manner, or any manner heretofore to be known, as long as the noise sampled can be loaded into a speech recognizer. For instance, models can be created by recording background noise and clean speech separately and later combining the two. Or, models can be created by recording speech with the various background noise environments present. Or even further, for example, the models can be created using signal processing of recorded speech to alter it as if it had been recorded in the noisy background.
Determination of which model to use is determined by the speech recognition apparatus. At the beginning of a call, a sample of the surrounding background environment from where the call is being placed is recorded. As introductory prompts, or other such messages are being played to the caller, the system analyzes the recorded background noise. Different methods of analysis may be used. Once the appropriate noise model has been chosen on the basis of the analysis, speech recognition is performed with the model. The system can also constantly monitor the speech recognition function, and if it is determined that speech recognition is not at an acceptable level, the system can replace the chosen model with another.
The present invention and its features and advantages will become more apparent from the following detailed description with reference to the accompanying drawings.
Referring to
The recorded background noise is then modeled to create hidden Markov models for use in speech recognizers. Modeling is performed in the modeling device 10 using known modeling techniques. In this embodiment, the recorded background noise and pre-labeled speech data are put through algorithms which pick out phonemes creating, in essence, statistical background noise models. As described in this embodiment then, the models are thus created by recording background noise and clean speech separately and later combining the two.
Of course, it is to be recognized that any method capable of creating noises models which can be uploaded into a speech recognizer can be used in the present invention. For instance, models can be created by recording speech with the various background noise environments present. Or, for example, the models can be created using signal processing of the recorded speech to alter it as if it had been recorded in the noisy background.
The modeled background noise is then stored in an appropriate storage device 20. The storage device 20 itself may be located at a central network hub, or it may be reproduced and distributed locally. The various stored background noise models 1, . . . , n, n+1 are then appropriately accessed from the storage device 20 by a speech recognition unit 30 when a call is placed by the telephone user 40. There may, of course, be more than one speech recognition unit 30 used for any given call. Further, the present invention will work equally well with any technique of speech recognition using the background noise models.
Referring to
Analysis of the background noise may be accomplished by one or more ways. Signal information, such as the type of signals (ANI, DNIS, SS7 signals, etc.), channel port number, or trunk line number may be used to help restrict what the background noise is, and thus what background noise model would be most suitable. For example, the system may determine that a call received over a particular trunk line number may more likely than not be from India, as that trunk line number is the designated trunk for receiving calls from India. Further, the location of the call may be recognized by the caller's account number, time the call is placed or other known information about the caller and/or the call. Such information could be used as a preliminary indicator of the existence and type of background noise.
Alternatively, or in conjunction with the preceding method, a series of questions or instructions to be posed to the caller with corresponding answers to be made by the caller may be used. These answers may then be analyzed using each model (or a pre-determined maximum number of models) to determine which models have a higher correct match percentage. For example, the system may carry on a dialog with the caller and instruct the caller to say “NS437W”, “Boston”, and “July 1st”. The system will then analyze each response using the various background noise models. The model(s) with the correct match for each response by the caller can then be used in the speech recognition application. An illustration of the above analysis method is found in
Also, if the system is unable to definitively decide which model and/or models yield the best performance in the speech recognition application, the system may either guess, use more than one model by using more than one speech recognizer, or compare parameters of the call's recorded background noise to parameters contained in each background noise model.
Once a call from a particular location has been matched to a background noise model, the system can store that information in a database. Thus in step 135, a database of which background noise models are most successful in the proper analysis of the call's background noise can be created and stored. This database can later be accessed when another incoming call is received from the same location. For example, it has previously been determined, and stored in the database, that a call from a particular location should use the city noise background noise model in the speech recognition application, because that model results in the highest percentage of correct speech recognitions. Thus the most appropriate model is used. Of course, the system can dynamically update itself by constantly re-analyzing the call's recorded background noise to detect potential changes in the background noise environment.
Once the call's recorded background noise has been analyzed, or the database has been accessed to determine where the call is coming from and which model is most appropriate, in step 140 the most appropriate background noise model is selected and recalled from the storage means 20. Further, alternative background noise models may be ordered on a standby basis in case speech recognition fails with the selected model. With the most appropriate background noise model having been selected, and other models ordered on standby, the system proceeds in step 150 to the speech recognition application using the selected model.
Referring to
Correctness of the speech recognition in step 180 may be accomplished in several ways. If more than one speech recognizer means 30 is being used, the correct recognition of the speech utterance may be determined by using a voter scheme. That is, each speech recognizer unit 30, using a set of models with different background noise characteristics, will analyze the speech utterance. A vote determines what analysis is correct. For example, if fifty recognizers determine that “Boston” has been said by the caller, and twenty recognizers determine that “Baltimore” has been said, than the system determines in step 180 that “Boston” must be the correct speech utterance. Alternatively, or in conjunction with the above method, the system can ask the caller to validate the determined speech utterance. For example, the system can prompt the caller by asking “Is this correct?”. A determination of correctness in step 180 can thus be made on a basis of most correct validations by the user and/or lowest rejections (rejections could be set high).
If the minimal criteria of correctness is not met, and thus the most appropriate background noise model loaded in step 160 is determined to be an unsuitable choice, a new model can be loaded. Thus in step 185, the system returns to step 160 to load a new model, perhaps the model which was previously determined in step 140 to be the next in order. The minimal criteria of correctness may be set at any level deemed appropriate and most often will be experimentally determined on the basis of each individual system and its own separate characteristics.
If the determination in step 180 is that speech recognition is proceeding at an acceptable level, then the system can proceed to carry out the caller's desired functions, as shown in step 190.
As such, the present invention has many advantageous uses. For instance, the system is able to provide robust speech recognition in a variety of noisy environments. In other words, the present invention works well over a gamut of different noisy environments and is thus easy to implement. Not only that, but the speech recognition system is capable of a higher performance and a lower error rate than current systems. Even when the error rate begins to approach an unacceptable level, the present system automatically corrects itself by switching to a different model(s).
It is to be understood and expected that variations in the principles of construction and methodology herein disclosed in an embodiment may be made by one skilled in the art and it is intended that such modifications, changes, and substitutions are to be included within the scope of the present invention.
Goldberg, Randy G., Rosen, Kenneth H., Winthrop, Joel A., Sachs, Richard M.
Patent | Priority | Assignee | Title |
10176801, | Apr 30 2013 | PAYPAL, INC. | System and method of improving speech recognition using context |
10278017, | May 16 2014 | Alphonso, Inc | Efficient apparatus and method for audio signature generation using recognition history |
10297251, | Jan 21 2016 | Ford Global Technologies, LLC | Vehicle having dynamic acoustic model switching to improve noisy speech recognition |
10575126, | May 16 2014 | ALPHONSO INC | Apparatus and method for determining audio and/or visual time shift |
9626963, | Apr 30 2013 | PayPal, Inc | System and method of improving speech recognition using context |
9641980, | May 16 2014 | ALPHONSO INC | Apparatus and method for determining co-location of services using a device that generates an audio signal |
9698924, | May 16 2014 | ALPHONSO INC | Efficient apparatus and method for audio signature generation using recognition history |
9942711, | May 16 2014 | Alphonso Inc. | Apparatus and method for determining co-location of services using a device that generates an audio signal |
Patent | Priority | Assignee | Title |
4610023, | Jun 04 1982 | Nissan Motor Company, Limited | Speech recognition system and method for variable noise environment |
4720802, | Jul 26 1983 | Lear Siegler | Noise compensation arrangement |
4933973, | Feb 29 1988 | ITT Corporation | Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems |
5148489, | Feb 28 1990 | SRI International | Method for spectral estimation to improve noise robustness for speech recognition |
5222190, | Jun 11 1991 | Texas Instruments Incorporated | Apparatus and method for identifying a speech pattern |
5386492, | Jun 29 1992 | Nuance Communications, Inc | Speech recognition system utilizing vocabulary model preselection |
5509104, | May 17 1989 | AT&T Corp. | Speech recognition employing key word modeling and non-key word modeling |
5617509, | Mar 29 1995 | Motorola, Inc. | Method, apparatus, and radio optimizing Hidden Markov Model speech recognition |
5649055, | Mar 26 1993 | U S BANK NATIONAL ASSOCIATION | Voice activity detector for speech signals in variable background noise |
5649057, | May 17 1989 | THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT | Speech recognition employing key word modeling and non-key word modeling |
5721808, | Mar 06 1995 | Nippon Telegraph and Telephone Corporation | Method for the composition of noise-resistant hidden markov models for speech recognition and speech recognizer using the same |
5749067, | Nov 23 1993 | LG Electronics Inc | Voice activity detector |
5749068, | Mar 25 1996 | Mitsubishi Denki Kabushiki Kaisha | Speech recognition apparatus and method in noisy circumstances |
5761639, | Mar 13 1989 | Kabushiki Kaisha Toshiba | Method and apparatus for time series signal recognition with signal variation proof learning |
5778342, | Feb 01 1996 | WIRELESS IP LTD | Pattern recognition system and method |
5854999, | Jun 23 1995 | NEC Electronics Corporation | Method and system for speech recognition with compensation for variations in the speech environment |
5860062, | Jun 21 1996 | Matsushita Electric Industrial Co., Ltd. | Speech recognition apparatus and speech recognition method |
6078884, | Aug 24 1995 | British Telecommunications public limited company | Pattern recognition |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Nov 18 1997 | GOLDBERG, RANDY G | AT&T Corp | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 033526 | /0246 | |
Nov 18 1997 | SACHS, RICHARD M | AT&T Corp | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 033526 | /0246 | |
Nov 24 1997 | ROSEN, KENNETH H | AT&T Corp | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 033526 | /0246 | |
Nov 24 1997 | WINTHROP, JOEL A , III | AT&T Corp | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 033526 | /0246 | |
Oct 17 2001 | AT&T Intellectual Property II, L.P. | (assignment on the face of the patent) | / | |||
Feb 04 2016 | AT&T Corp | AT&T Properties, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 038274 | /0841 | |
Feb 04 2016 | AT&T Properties, LLC | AT&T INTELLECTUAL PROPERTY II, L P | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 038274 | /0917 | |
Dec 14 2016 | AT&T INTELLECTUAL PROPERTY II, L P | Nuance Communications, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 041498 | /0316 |
Date | Maintenance Fee Events |
Date | Maintenance Schedule |
Dec 09 2017 | 4 years fee payment window open |
Jun 09 2018 | 6 months grace period start (w surcharge) |
Dec 09 2018 | patent expiry (for year 4) |
Dec 09 2020 | 2 years to revive unintentionally abandoned end. (for year 4) |
Dec 09 2021 | 8 years fee payment window open |
Jun 09 2022 | 6 months grace period start (w surcharge) |
Dec 09 2022 | patent expiry (for year 8) |
Dec 09 2024 | 2 years to revive unintentionally abandoned end. (for year 8) |
Dec 09 2025 | 12 years fee payment window open |
Jun 09 2026 | 6 months grace period start (w surcharge) |
Dec 09 2026 | patent expiry (for year 12) |
Dec 09 2028 | 2 years to revive unintentionally abandoned end. (for year 12) |