A voice-controlled multi-station network has both speaker-dependent and speaker-independent speech recognition. Conditionally to recognizing items of an applicable vocabulary, the network executes a particular function. The method receives a call from a particular origin and executes speaker-independent speech recognition on the call. In an improvement procedure, in case of successful determination of what has been said, a template associated to the recognized speech item is stored and assigned to the origin. Next, speaker-dependent recognition is applied if feasible, for speech received from the same origin, using one or more templates associated to that station. Further, a fallback procedure to speaker-independent recognition is maintained for any particular station in order to cater for failure of the speaker-dependent recognition, while allowing reverting to the improvement procedure.
|
1. A method for activating a voice-controlled function in a multi-station network by using both speaker-dependent and speaker-independent speech recognition facilities, and conditionally to recognizing one or more items or an applicable vocabulary, driving one or more network parts to activate said function, wherein said method comprises the following steps:
receiving a station-initiated call containing one or more initial speech items from the vocabulary, executing speaker-independent recognition on said initial speech items through one or more general templates, whilst in an speech recognition improvement procedure, in case of successful ascertainment of what had been actually spoken, storing a particular speaker-specific template derived from the initial speech item so recognized and assigned to an origin of the call in question, said speaker-specific template being cyclically retained for subsequent speaker-dependent recognition of additional speech items having the same origin; following said speech recognition improvement procedure, applying speaker-dependent recognition as an initial type of speech recognition if feasible for additional speech items received from the same origin, through one or more particular templates associated to that origin and only subsequently applying speaker-independent recognition as a fallback procedure if the recognition of the additional speech items cannot be ascertained by speaker-dependent recognition, wherein speaker-independent recognition is a first response for new or unidentified users of the voice-controlled function, and speaker-dependent recognition based on said speech recognition improvement procedure is a first response for repeat users of the voice-controlled function, with a reversion to speaker-independent recognition if the additional speech items are not recognized.
2. The method as claimed in
3. The method as claimed in
4. The method as claimed in
5. The method as claimed in
6. The method as claimed in
8. The method as claimed in
9. The method as claimed in
10. A device being arranged for executing the method as claimed in
|
The invention relates to a method as claimed in the preamble of claim 1. Pertinent art that combines both speaker-dependent and speaker-independent recognition facilities in a single system has been disclosed in U.S. Pat. No. 5,165,095. Here, speaker-independent recognition is used for terms and phrases that are considered common to many speakers such as various commands for effecting dialling and various other functions. Generally, the functions use the network, but need not be restricted to the network itself. Furthermore, speaker-dependent recognition is used to recognize private terms such as personal names and the like. Generally, speaker-independent recognition must access a larger template base to recognize a particular term, but even then is often less successful. Speaker-dependent recognition generally has fewer failures, so it would be preferable to be able to resort to speaker-dependent recognition in most cases. However, fur using speaker-dependent recognition, the system must identify the actual speaker. Further, user persons experience the training of the system as a tedious task.
In consequence, amongst other things it is an object of the present invention to allow the system to gradually and reversibly improve to speaker-dependent recognition if feasible. Now therefore, according to one of its aspects the invention is characterized according to the characterizing part of claim 1.
The invention also relates to a device arranged for executing the method according to the invention. Further advantageous aspects of the invention are recited in dependent claims.
These and further aspects and advantages of the invention will be discussed more in detail hereinafter with reference to the disclosure of preferred embodiments, and in particular with reference to the appended Figures that show:
In modern telecommunication a key function is directory search using automatic speech recognition, and including the facility for fast introduction of new entries into the directory. No lengthy training is considered feasible.
The technique used here is whole word recognition of any entry, using sparse initial training and automatic additional training, using the CLI (Caller Line Identity) to identify the origin of the call. The approach is particularly advantageous for portable telephones. Alternatively, the caller may be recognized by executing speaker recognition through using the received speech itself, thereby allowing a user person to freely move between a plurality of stations. Other speech recognition techniques than whole word recognition are feasible, such as recognition on the level of phonemes or of diphones.
In word recognition each word must be trained with several examples. To recognize a particular speech item, a speaker-dependent system needs only a few examples or templates therefor from that speaker. A speaker-independent system requires many examples from many speakers. Typically some 100 speakers for each gender are required for a reliable speaker-independent system. Most known speaker-independent recognition systems use separate models for male and female speech. Using more speakers will improve the reliability still further.
To alleviate training requirements for a speaker-independent system, the invention uses an adaptive strategy. Initially the system is trained with only few examples, but during actual usage further examples are collected and used for automatic improvement. The aim is to ensure that a user is recognized at least the second time he enters a particular utterance into the system such utterance being based on the above speech items.
The criteria used for selecting a training method are user oriented. A distinction is made between initial performance, performance during upgrading, and eventual performance after long adaptation.
For the final performance a balance has been found between overall performance, and performance for each individual user taken separately. If only overall performance as solely measured on the total number of recognitions were optimized, the system will foremostly be trained on frequent users. This would result in a system that would serve only a group of such frequent users. However, the principal aim of a directory system is to replace a printed directory that is needed in particular for extension numbers that are used seldom. This is exactly the opposite of frequent users/usage.
A user will want the system to adapt quickly to faulty recognitions. If an utterance is not recognized at first use, as from the second time its chance of being recognized should improve considerably. This calls for a strategy wherein faulty recognitions are used to extend the body of templates.
The most general templates are acquired using a uniform distribution of the training data over the speakers. Contrariwise, using all recorded material for training will foremostly benefit frequent users.
Now, according to the invention, in an environment with a restricted user group, such as a medium size office, getting both optimal performance for each individual user, and also good performance over the whole directory for all users is best acquired if the speaker is known to the system (by Calling Line Identity or otherwise). Two types of templates are now used simultaneously: general templates and user specific templates.
The user-specific templates can be updated quickly, which will result in a good performance for the associated individual user. The drawback is that only utterances already used by a speaker are used for training to that particular user.
The general templates will give a reasonable overall performance directly, but to get enough samples for all entries will take much time. Training of these templates is done with lower priority.
The strategies used for training the user specific templates is:
No initial training and adaptation by cyclic retaining of N (typically in the order of 5) recordings for each item; every use of such item is recorded. Cyclic retraining will continuously adapt the system.
The general templates will benefit most from a uniform distribution over all users. However, in the initial phase only few recordings are available, therefore the way to reach a uniform distribution must be specified. The easiest way to get an initial training base is to use one (or a few) speaker(s) per gender. In this way only a few persons will be bothered with the initial training.
The preferred approach is:
Initial training with one speaker per gender
Use all recordings, but maximally M such as five, per caller
Cyclic refreshing of M recordings per user person, resulting in continuous adaptation. Here M is the maximum capacity for training recordings divided by the maximum number of users.
The proposed approach necessitates for a set of parallel users an increase of the number of templates actually matched with 50% (one user specific template, plus a male and a female template). However, the overall performance will be much better than with a completely speaker-independent system. Over a period of time the system will evolve from a "one speaker"-dependent system, via a speaker-independent system, to eventually a combination of a speaker-dependent system for all frequent users with a speaker-independent system for novice or accidental users.
If occasionally the CLI is unknown and/or the speaker identity cannot be recognized otherwise, an extra default "user" may be introduced. The system will handle this default user as a frequent user. In advanced systems, however, an alternative strategy for adapting in the absence of a user identity can be chosen. Using all recordings for training will always result in over-representation of frequent user(s). Alternatively using only the failed recognitions will result in performance oscillation, but all users will be able to use the system. A balance between these two extremes has been chosen through evaluating the two strategies. The proposed scenario for adaption without CLI is:
Use each Kth good recognition, wherein K is about 5, and furthermore use all failed recognitions for updating the stored templates.
The filling of the respective blocks has been indicated supra. First, the system is trained with one speaker for each gender, thereby wholly or partially filling the lower two rows. Next in practice, all further utterances will be used, but in each column at most M per caller; these are stored in the row of that caller. These templates will be cyclically refreshed. The recognition presumably knows the caller identity, and therefore takes into account the content of the associated row and furthermore, the content of the lowest two rows. The latter cater for speaker-independent recognition. Also for the speaker-independent templates on the lower two rows the training is continued.
The system may incorporate higher level measures for ascertaining whether or not recognition was correct, thereby externally defining an appropriate speech item. One is to provide an additional question to the user that must be answered by yes/no only. Another one is to build-in a check by key actuation, or to allow keying in of a particular phrase. This allows to automatically update the stored body of templates for so continually improving the performance of the system. In fact, the combination of an unrecognized speech item and the subsequent ascertaining of the meaning of the unrecognized speech item will combine to update the stored body of templates. The training with templates that correspond to immediately recognized speech items, on the basis of the speech itself, will cater for slow drifts in the manner the speech in question is actually produced.
Hesdahl, Piet B., Dams, Franciscus J. L., Van Velden, Jeroen G.
Patent | Priority | Assignee | Title |
10043516, | Sep 23 2016 | Apple Inc | Intelligent automated assistant |
10049663, | Jun 08 2016 | Apple Inc | Intelligent automated assistant for media exploration |
10049668, | Dec 02 2015 | Apple Inc | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
10049675, | Feb 25 2010 | Apple Inc. | User profiling for voice input processing |
10067938, | Jun 10 2016 | Apple Inc | Multilingual word prediction |
10079014, | Jun 08 2012 | Apple Inc. | Name recognition system |
10089072, | Jun 11 2016 | Apple Inc | Intelligent device arbitration and control |
10102359, | Mar 21 2011 | Apple Inc. | Device access using voice authentication |
10138671, | Nov 08 2012 | The Chamberlain Group, Inc | Barrier operator feature enhancement |
10163438, | Jul 31 2013 | GOOGLE LLC | Method and apparatus for evaluating trigger phrase enrollment |
10163439, | Jul 31 2013 | GOOGLE LLC | Method and apparatus for evaluating trigger phrase enrollment |
10169329, | May 30 2014 | Apple Inc. | Exemplar-based natural language processing |
10170105, | Jul 31 2013 | Google Technology Holdings LLC | Method and apparatus for evaluating trigger phrase enrollment |
10176167, | Jun 09 2013 | Apple Inc | System and method for inferring user intent from speech inputs |
10185542, | Jun 09 2013 | Apple Inc | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
10192548, | Jul 31 2013 | GOOGLE LLC | Method and apparatus for evaluating trigger phrase enrollment |
10192552, | Jun 10 2016 | Apple Inc | Digital assistant providing whispered speech |
10223066, | Dec 23 2015 | Apple Inc | Proactive assistance based on dialog communication between devices |
10229548, | Oct 28 2014 | The Chamberlain Group, Inc. | Remote guest access to a secured premises |
10249300, | Jun 06 2016 | Apple Inc | Intelligent list reading |
10269345, | Jun 11 2016 | Apple Inc | Intelligent task discovery |
10283110, | Jul 02 2009 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
10297253, | Jun 11 2016 | Apple Inc | Application integration with a digital assistant |
10318871, | Sep 08 2005 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
10354011, | Jun 09 2016 | Apple Inc | Intelligent automated assistant in a home environment |
10356243, | Jun 05 2015 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
10366158, | Sep 29 2015 | Apple Inc | Efficient word encoding for recurrent neural network language models |
10410637, | May 12 2017 | Apple Inc | User-specific acoustic models |
10446143, | Mar 14 2016 | Apple Inc | Identification of voice inputs providing credentials |
10482874, | May 15 2017 | Apple Inc | Hierarchical belief states for digital assistants |
10490187, | Jun 10 2016 | Apple Inc | Digital assistant providing automated status report |
10509862, | Jun 10 2016 | Apple Inc | Dynamic phrase expansion of language input |
10521466, | Jun 11 2016 | Apple Inc | Data driven natural language event detection and classification |
10547910, | Apr 17 2015 | Hewlett-Packard Development Company, L.P.; HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Adjusting speaker settings |
10553215, | Sep 23 2016 | Apple Inc. | Intelligent automated assistant |
10567477, | Mar 08 2015 | Apple Inc | Virtual assistant continuity |
10593346, | Dec 22 2016 | Apple Inc | Rank-reduced token representation for automatic speech recognition |
10597928, | Nov 08 2012 | The Chamberlain Group, Inc | Barrier operator feature enhancement |
10657961, | Jun 08 2013 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
10671428, | Sep 08 2015 | Apple Inc | Distributed personal assistant |
10691473, | Nov 06 2015 | Apple Inc | Intelligent automated assistant in a messaging environment |
10706841, | Jan 18 2010 | Apple Inc. | Task flow identification based on user intent |
10733993, | Jun 10 2016 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
10747498, | Sep 08 2015 | Apple Inc | Zero latency digital assistant |
10755703, | May 11 2017 | Apple Inc | Offline personal assistant |
10791176, | May 12 2017 | Apple Inc | Synchronization and task delegation of a digital assistant |
10795541, | Jun 03 2011 | Apple Inc. | Intelligent organization of tasks items |
10801247, | Nov 08 2012 | The Chamberlain Group, Inc | Barrier operator feature enhancement |
10810274, | May 15 2017 | Apple Inc | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
10810817, | Oct 28 2014 | The Chamberlain Group, Inc. | Remote guest access to a secured premises |
10904611, | Jun 30 2014 | Apple Inc. | Intelligent automated assistant for TV user interactions |
11010550, | Sep 29 2015 | Apple Inc | Unified language modeling framework for word prediction, auto-completion and auto-correction |
11037565, | Jun 10 2016 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
11069347, | Jun 08 2016 | Apple Inc. | Intelligent automated assistant for media exploration |
11080012, | Jun 05 2009 | Apple Inc. | Interface for a virtual digital assistant |
11152002, | Jun 11 2016 | Apple Inc. | Application integration with a digital assistant |
11187026, | Nov 08 2012 | The Chamberlain Group, Inc | Barrier operator feature enhancement |
11217255, | May 16 2017 | Apple Inc | Far-field extension for digital assistant services |
11405466, | May 12 2017 | Apple Inc. | Synchronization and task delegation of a digital assistant |
11423886, | Jan 18 2010 | Apple Inc. | Task flow identification based on user intent |
11500672, | Sep 08 2015 | Apple Inc. | Distributed personal assistant |
11526368, | Nov 06 2015 | Apple Inc. | Intelligent automated assistant in a messaging environment |
11587559, | Sep 30 2015 | Apple Inc | Intelligent device identification |
6873951, | Mar 30 1999 | RPX CLEARINGHOUSE LLC | Speech recognition system and method permitting user customization |
7162641, | Jun 13 2000 | IBM Corporation | Weight based background discriminant functions in authentication systems |
7302391, | Nov 30 2000 | GOOGLE LLC | Methods and apparatus for performing speech recognition over a network and using speech recognition results |
7676026, | Mar 08 2005 | Qualcomm Incorporated | Desktop telephony system |
7855634, | Jul 25 2001 | The Chamberlain Group, Inc. | Barrier movement system including a combined keypad and voice responsive transmitter |
8175591, | Dec 04 2006 | The Chamberlain Group, Inc | Barrier operator system and method using wireless transmission devices |
8325885, | Mar 08 2005 | Qualcomm Incorporated | Call diversion for telephony applications |
8335687, | Nov 30 2000 | GOOGLE LLC | Performing speech recognition over a network and using speech recognition results |
8401846, | Nov 30 2000 | GOOGLE LLC | Performing speech recognition over a network and using speech recognition results |
8447599, | Nov 30 2000 | GOOGLE LLC | Methods and apparatus for generating, updating and distributing speech recognition models |
8471677, | Jul 25 2001 | CHAMBERLAIN GROUP, INC , THE | Barrier movement system including a combined keypad and voice responsive transmitter |
8483365, | Mar 08 2005 | Qualcomm Incorporated | Inbound caller authentication for telephony applications |
8494848, | Nov 30 2000 | GOOGLE LLC | Methods and apparatus for generating, updating and distributing speech recognition models |
8520810, | Nov 30 2000 | GOOGLE LLC | Performing speech recognition over a network and using speech recognition results |
8643465, | Dec 04 2006 | The Chamberlain Group, Inc | Network ID activated transmitter |
8682663, | Nov 30 2000 | GOOGLE LLC | Performing speech recognition over a network and using speech recognition results based on determining that a network connection exists |
8731937, | Nov 30 2000 | GOOGLE LLC | Updating speech recognition models for contacts |
8818809, | Nov 30 2000 | GOOGLE LLC | Methods and apparatus for generating, updating and distributing speech recognition models |
9275638, | Mar 12 2013 | Google Technology Holdings LLC | Method and apparatus for training a voice recognition model database |
9367978, | Mar 15 2013 | The Chamberlain Group, Inc. | Control device access method and apparatus |
9376851, | Nov 08 2012 | The Chamberlain Group, Inc. | Barrier operator feature enhancement |
9380155, | Nov 30 2000 | GOOGLE LLC | Forming speech recognition over a network and using speech recognition results based on determining that a network connection exists |
9396598, | Oct 28 2014 | The Chamberlain Group, Inc.; The Chamberlain Group, Inc | Remote guest access to a secured premises |
9418662, | Jan 21 2009 | Nokia Technologies Oy | Method, apparatus and computer program product for providing compound models for speech recognition adaptation |
9495815, | Jan 27 2005 | The Chamberlain Group, Inc. | System interaction with a movable barrier operator method and apparatus |
9548050, | Jan 18 2010 | Apple Inc. | Intelligent automated assistant |
9582608, | Jun 07 2013 | Apple Inc | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
9620104, | Jun 07 2013 | Apple Inc | System and method for user-specified pronunciation of words for speech synthesis and recognition |
9626955, | Apr 05 2008 | Apple Inc. | Intelligent text-to-speech conversion |
9633660, | Feb 25 2010 | Apple Inc. | User profiling for voice input processing |
9644416, | Nov 08 2012 | The Chamberlain Group, Inc. | Barrier operator feature enhancement |
9646614, | Mar 16 2000 | Apple Inc. | Fast, language-independent method for user authentication by voice |
9668024, | Jun 30 2014 | Apple Inc. | Intelligent automated assistant for TV user interactions |
9691377, | Jul 23 2013 | Google Technology Holdings LLC | Method and device for voice recognition training |
9697822, | Mar 15 2013 | Apple Inc. | System and method for updating an adaptive speech recognition model |
9698997, | Dec 13 2011 | The Chamberlain Group, Inc. | Apparatus and method pertaining to the communication of information regarding appliances that utilize differing communications protocol |
9787830, | Nov 30 2000 | GOOGLE LLC | Performing speech recognition over a network and using speech recognition results based on determining that a network connection exists |
9818243, | Jan 27 2005 | The Chamberlain Group, Inc. | System interaction with a movable barrier operator method and apparatus |
9818399, | Nov 30 2000 | GOOGLE LLC | Performing speech recognition over a network and using speech recognition results based on determining that a network connection exists |
9865248, | Apr 05 2008 | Apple Inc. | Intelligent text-to-speech conversion |
9875744, | Jul 23 2013 | GOOGLE LLC | Method and device for voice recognition training |
9896877, | Nov 08 2012 | The Chamberlain Group, Inc. | Barrier operator feature enhancement |
9934775, | May 26 2016 | Apple Inc | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
9953088, | May 14 2012 | Apple Inc. | Crowd sourcing information to fulfill user requests |
9966060, | Jun 07 2013 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
9966062, | Jul 23 2013 | GOOGLE LLC | Method and device for voice recognition training |
9966068, | Jun 08 2013 | Apple Inc | Interpreting and acting upon commands that involve sharing information with remote devices |
9971774, | Sep 19 2012 | Apple Inc. | Voice-based media searching |
9972304, | Jun 03 2016 | Apple Inc | Privacy preserving distributed evaluation framework for embedded personalized systems |
9986419, | Sep 30 2014 | Apple Inc. | Social reminders |
Patent | Priority | Assignee | Title |
4922538, | Feb 10 1987 | British Telecommunications public limited company | Multi-user speech recognition system |
5073939, | Jun 08 1989 | ITT Corporation | Dynamic time warping (DTW) apparatus for use in speech recognition systems |
5091947, | Jun 04 1987 | Ricoh Company, Ltd. | Speech recognition method and apparatus |
5163081, | Nov 05 1990 | AT&T Bell Laboratories | Automated dual-party-relay telephone system |
5165095, | Sep 28 1990 | Texas Instruments Incorporated | Voice telephone dialing |
5297183, | Apr 13 1992 | Nuance Communications, Inc | Speech recognition system for electronic switches in a cellular telephone or personal communication network |
5353376, | Mar 20 1992 | Texas Instruments Incorporated; TEXAS INSTRUMENTS INCORPORATED A CORP OF DELAWARE | System and method for improved speech acquisition for hands-free voice telecommunication in a noisy environment |
5384833, | Apr 27 1988 | Cisco Technology, Inc | Voice-operated service |
5475791, | Aug 13 1993 | Nuance Communications, Inc | Method for recognizing a spoken word in the presence of interfering speech |
5511111, | Nov 01 1993 | Intellectual Ventures I LLC | Caller name and identification communication system with caller screening option |
5553119, | Jul 07 1994 | GOOGLE LLC | Intelligent recognition of speech signals using caller demographics |
5719921, | Feb 29 1996 | GOOGLE LLC | Methods and apparatus for activating telephone services in response to speech |
5724481, | Mar 30 1995 | Alcatel-Lucent USA Inc | Method for automatic speech recognition of arbitrary spoken words |
5913192, | Aug 22 1997 | Nuance Communications, Inc | Speaker identification with user-selected password phrases |
6076054, | Feb 29 1996 | GOOGLE LLC | Methods and apparatus for generating and using out of vocabulary word models for speaker dependent speech recognition |
6088669, | Feb 02 1996 | International Business Machines, Corporation; IBM Corporation | Speech recognition with attempted speaker recognition for speaker model prefetching or alternative speech modeling |
EP661690, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 09 1998 | DAMS, FRANCISCUS J L | U S PHILIPS CORPORATION | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 009514 | /0861 | |
Sep 09 1998 | HESDAHL, PIET B | U S PHILIPS CORPORATION | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 009514 | /0861 | |
Sep 09 1998 | VAN VELDEN, JEROEN G | U S PHILIPS CORPORATION | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 009514 | /0861 | |
Oct 07 1998 | Koninklijke Philips Electronics N.V. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Mar 24 2008 | REM: Maintenance Fee Reminder Mailed. |
Sep 14 2008 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Sep 14 2007 | 4 years fee payment window open |
Mar 14 2008 | 6 months grace period start (w surcharge) |
Sep 14 2008 | patent expiry (for year 4) |
Sep 14 2010 | 2 years to revive unintentionally abandoned end. (for year 4) |
Sep 14 2011 | 8 years fee payment window open |
Mar 14 2012 | 6 months grace period start (w surcharge) |
Sep 14 2012 | patent expiry (for year 8) |
Sep 14 2014 | 2 years to revive unintentionally abandoned end. (for year 8) |
Sep 14 2015 | 12 years fee payment window open |
Mar 14 2016 | 6 months grace period start (w surcharge) |
Sep 14 2016 | patent expiry (for year 12) |
Sep 14 2018 | 2 years to revive unintentionally abandoned end. (for year 12) |