The lexical network of a large-vocabulary speech recognition system is structured to effectuate the rapid and efficient addition of words to the system's active vocabulary. The lexical network is structured to include phonetic constraint nodes, which organize the inter-word phonetic information in the network, and word class nodes which organize the syntactic semantic information in the network. network fragments, corresponding to phoneme pronunciations and labeled to specify permitted interconnections to each other and to phonetic constraint nodes, are precompiled to facilitate the rapid generation of pronunciations for new words and thereby enhance the rapid addition of words to the vocabulary even during speech recognition. Functions defined in accordance with linguistic constraints may be utilized during recognition. Different language models and different vocabularies for different portions of a discourse may also be invoked depending, in part, on the discourse history.
|
15. A method for producing a word pronunciation network for a given word to augment a pre-established active vocabulary of a speech recognition vocabulary, wherein the given word is selected from a baseform pronunciation or spelling vocabularies of the speech recognition vocabulary, comprising:
A. storing a set of phonemes for forming words of the speech recognition vocabulary; B. for each phoneme, storing a set of phonetic fragments formed in accordance with predetermined rules that define alternative pronunciations of the phoneme dependent on phonemes with which it may be connected; C. selecting and linking permissible phonetic fragments for each phoneme in the given word; D. associating the linked phonetic fragments with a word class, from a plurality of word classes, of the pre-established active vocabulary of the speech recognition vocabulary, the word class being indicative of semantic of syntactic information of the given word; and E. selectively associating the given word with at least one of the active vocabulary word classes based on the active vocabulary word classes and the associated word class.
8. A speech recognition system for dynamically augmenting an active vocabulary with a word from a total vocabulary, comprising:
A. a lexical network comprising phonemic sequences defining words of the active vocabulary; B. constraint nodes defining constraints on permitted interconnection between selected phonemic sequences of words of the active vocabulary, the constraint nodes including phonetic constraint nodes defining phonetic constraints on adjacent words and word class nodes defining semantic or syntactic classes of words corresponding to the word class nodes, the word class nodes being associated with phonetic constraint nodes; and C. means for establishing links between pronunciation networks of words in the active vocabulary through said constraint nodes and for establishing a link between a first word, having a pre-associated word class, and a second word based on said word class nodes and said pre-associated word class, said first word being selected from a baseform pronunciation or a spelling portion of the total vocabulary and being an addition to the active vocabulary, wherein the link between the first word and the second word is formed by forming a pronunciation network for the first word and connecting at least one of beginning and ending phonemes of the pronunciation network of the first word to one of the word class nodes.
1. A speech recognition system for dynamically adding words to an active portion of a total vocabulary, wherein interword relationships between pairs of words of the total vocabulary are defined by a lexical network, the speech recognition system comprising:
a plurality of nodes corresponding to interword relationships between adjacent words, the interword relationships being formed in accordance with a predefined congruence, wherein the nodes comprise: A. a plurality of phonetic constraint nodes characterized by a phonetic constraint tuple (x, y, . . . ) of order two or greater, where x, y, . . . are phonetic constraints on adjacent words; B. a plurality of connection nodes, wherein a set of phonetic constraint nodes corresponds to each connection node for identifying adjacent words from the active vocabulary satisfying the phonetic constraints of the set of phonetic constraint nodes; and C. a plurality of word class nodes each associated with a word class indicative of syntactic or semantic information of at least one word associated with each word class node, and each word class node being associated with a phonetic constraint nodes; and an apparatus configured to add a word to the active vocabulary portion of the total vocabulary, wherein the word to be added is selectively specified as a phonemic baseform pronunciation or a spelling and wherein the word has at least beginning and ending phonemes and a pre-associated word class, wherein the apparatus is configured to add the word by determining phonetic constraint nodes for the word corresponding to its phonemes and determining interword relationships for the word based on the connection nodes corresponding to the determined phonetic constraint nodes and based on the plurality of word class nodes and the pre-associated word class and by connecting a pronunciation network of the word to be added to at least one of the phonetic constraint nodes and the word class nodes.
2. The speech recognition system according to
3. The speech recognition system according to
wherein, the apparatus configured to add the word configured to, if the word to be added to the active vocabulary has at least one word class, determine word class nodes corresponding to the word classes of the word to be added, and wherein the interword relationships for the word are determined by the connection nodes corresponding to the determined phonetic constraint nodes and word class nodes for word.
4. The speech recognition system according to
5. The system of
6. The system of
9. The speech recognition system according to
10. The speech recognition system according to
11. The speech recognition system according to
12. A speech recognition system according to
13. The speech recognition system according to
14. The speech recognition system according to
17. The method of
18. The method of
22. The method of
23. The method of
24. The method of
25. The method of
26. The method of
|
This is a continuation application of Ser. No. 08/451,448 filed May 26, 1995 now abandoned.
1. Field of the Invention
The invention relates to speech recognition systems and, more particularly, to large-vocabulary speech recognition systems. The system described herein is suitable for use in systems providing interactive natural language discourse.
2. The Prior Art
Speech recognition systems convert spoken language to a form that is tractable by a computer. The resultant data string may be used to control a physical system, may be output by the computer in textual form, or may be used in other ways.
An increasingly popular use of speech recognition systems is to automate transactions requiring interactive exchanges. An example of a system with limited interaction is a telephone directory response system in which the user supplies information of a restricted nature such as the name and address of a telephone subscriber and receives in return the telephone number of that subscriber. An example of a substantially more complex such system is a catalogue sales system in which the user supplies information specific to himself or herself (e.g., name, address, telephone number, special identification number, credit card number, etc.) as well as further information (e.g., nature of item desired, size, color, etc.) and the system in return provides information to the user concerning the desired transaction (e.g., price, availability, shipping date, etc.).
Recognition of natural, unconstrained speech is very difficult. The difficulty is increased when there is environmental background noise or a noisy channel (e.g., a telephone line). Computer speech recognition systems typically require the task to be simplified in various ways. For example, they may require the speech to be noise-free (e.g., by using a good microphone), they may require the speaker to pause between words, or they may limit the vocabulary to a small number of words. Even in large-vocabulary systems, the vocabulary is typically defined in advance. The ability to add words to the vocabulary dynamically (i.e., during a discourse) is typically limited, or even nonexistent, due to the significant computing capabilities required to accomplish the task on a real-time basis. The difficulty of real-time speech recognition is dramatically compounded in very large-vocabulary applications (e.g., tens of thousands of words or more).
One example of an interactive speech recognition system under current development is the SUMMIT speech recognition system being developed at M.I.T. This system is described in Zue, V., Seneff, S., Polifroni, J., Phillips, M., Pao, C., Goddeau, D., Glass, J., and Brill, E. "The MIT ATIS System: December 1993 Progress Report." Proc. ARPA Human Language Technology Workshop, Princeton, N.J. March 1994, among other papers. Unlike most other systems which are frame-based systems, (the unit of the frame typically being a 10 ms portion of speech), the SUMMIT speech recognition system is a segment-based system, the segment typically being a speech sound or phone.
In the SUMMIT system, the acoustic signal representing a speaker's utterances is first converted into an electrical signal for signal processing, The processing may include filtering to enhance subsequent recognizability of the signal, remove unwanted noise, etc. The signal is converted to a spectral representation, then divided into segments corresponding to hypothesized boundaries of individual speech sounds (segments). The network of hypothesized segments is then passed to a phonetic classifier whose purpose is to seek to associate each segment with a known "phone" or speech sound identity. Because of uncertainties in the recognition process, each segment is typically associated with a list of several phones, with probabilities associated with each phone. Both the segmentation and the classification are performed in accordance with acoustic models for the possible speech sounds.
The end product of the phonetic classifier is a "lattice" of phones, each phone having a probability associated therewith. The actual words spoken at the input to the recognizer should form a path through this lattice. Because of the uncertainties of the process, there are usually on the order of millions of possible paths to be considered, each of different overall probability. A major task of the speech recognizer is to associate the segments along paths in the phoneme lattice with words in the recognizer vocabulary to thereby find the best path.
In prior art systems, such as the SUMMIT system, the vocabulary or lexical representation is a "network" that encodes all possible words that the recognizer can identify, all possible pronunciations of these words, and all possible connections between these words. This vocabulary is usually defined in advance, that is, prior to attempting to recognize a given utterance, and is usually fixed during the recognition process. Thus, if a word not already in the system's vocabulary is spoken during a recognition session, the word will not successfully be recognized.
The structure of current lexical representation networks does not readily lend itself to rapid updating when large vocabularies are involved, even when done on an "off-line" basis, that is, in the absence of speech input. In particular, in prior art lexical representations of the type exemplified by the SUMMIT recognition system, the lexical network is formed as a number of separate pronunciation networks for each work in the vocabulary, together with links establishing the possible connections between words. The links are placed based on phonetic rules. In order to add a word to the network, all words presently in the vocabulary must be checked in order to establish phonetic compatibility between the respective nodes before the links are established. This is a computationally intensive problem whose difficulty increases as the size of the vocabulary increases. Thus, the word addition problem is a significant issue in phonetically-based speech recognition systems.
In present speech recognition systems, a precomputed language model is employed during the search through the lexical network to favor sequences of words which are likely to occur in spoken language. The language model can provide the constraint to make a large vocabulary task tractable. This language model is generally precomputed based on the predefined vocabulary, and thus is generally inappropriate for use after adding words to the vocabulary.
A. Objects of the Invention
Accordingly, it is an object of the invention to provide an improved speech recognition system.
A further object of the invention is to provide a speech recognition system which facilitates the rapid addition of words to the vocabulary of the system.
Still a further object of the invention is to provide an improved speech recognition system which facilitates vocabulary addition during the speech recognition process without appreciably slowing the speech recognition process or disallowing use of a language model.
Yet another object of the invention is to provide a speech recognition system which is particularly suited to active vocabularies on the order of thousands of words and greater and total vocabularies of millions of words and greater.
Still a further object of the invention is to provide a speech recognition system which can use constraints from large databases without appreciably slowing the speech recognition process.
In accordance with the present invention, the lexical network containing the vocabulary that the system is capable of recognizing includes a number of constructs (defined herein as "word class" nodes, "phonetic constraint" nodes, and "connection" nodes) in addition to the word begin and end nodes commonly found in speech precognition systems. (A node is a connection point within the lexical network. Nodes may be joined by arcs to form paths through the network. Some of the arcs between nodes specify speech segments, i.e., phones.) These constructs effectively precompile and organize both phonetic and syntactic/semantic information and store it in a readily accessible form in the recognition network. This enables the rapid and efficient addition of words to the vocabulary, even in a large vocabulary system (i.e., thousands of active words) and even on a real-time basis, i.e., during interaction with the user. The present invention preserves the ability to comply with phonetic constraints between words and use a language model in searching the network to thereby enhance recognition accuracy. Thus, a large vocabulary interactive system (i.e., one in which inputs by the speaker elicit responses from the system which in turn elicits further input from the speaker) such as a catalogue sales system can be constructed. The effective vocabulary can be very large, (i.e., millions of words) without requiring a correspondingly large active (random access) memory because not all the words in it need be "active" (that is, connected into the lexical recognition network) at once.
In accordance with the present invention, the vocabulary is categorized into three classes. The most frequently used words are precompiled into the lexical network; typically, there will be several hundred of such words, connected into the lexical network with their phonetically permissible variations. Words of lesser frequency are stored as phonemic baseforms. A baseform represents and idealized pronunciation of a word, without the variations which in fact occur from one speaker to another and in varying context. The present invention may incorporate several hundred thousand of such baseforms, from which a word network may rapidly be constructed in accordance with the present invention. The least frequently used words are stored as spellings. New words are entered into the system as spellings (e.g., from an electronic database which is updated periodically). To make either one of the least frequently used words or a completely new word active, the system first creates a phonemic baseform from the spelling. It then generates a pronunciation network from the phonemic baseforms in the manner taught by the present invention.
The phonetic constraint nodes (referred to hereinafter as PC nodes of PCNs) organize the inter-word phonetic information in the network. A PC node is a tuple, PC (x, y, z . . . ) where the x, y, z are constraints on words that are, or can be, connected to the particular node. For example, x may specify the end phone of a word; y the beginning phone of a word with which it may be connected in accordance with defined phonetic constraints; and z a level of stress required on the following syllable. While tuples of any desired order (the order being the number of constraints specified for the particular PCN) may be used, the invention is most simply described by tuples or order two, e.g., PCN (x, y). Thus, PCN (null, n) may specify a PCN to which a word with a "null" ending (e.g., the dropped "ne" in the word "phone" is connected and which in turn will ultimately connect to words beginning with the phoneme /n/.
Word Class Nodes (referred to hereinafter as WC nodes or WCNs) organize the syntactic/semantic information in the lexical network and further facilitate adding words to the system vocabulary. Examples of word class nodes are parts-of-speech (e.g., noun, pronoun, verb) or semantic classes (e.g., "last name", "street name", or "zip code"). Both the words that form the base vocabulary of the speech recognition system (and therefore are resident in the lexical network to define the vocabulary that the system can recognize), as well as those that are to be added to this vocabulary are associated with predefined word classes.
Words are incorporated into the lexical network by connecting their begin and end nodes to WC nodes. The WCNs divide the set of words satisfying a particular PCN constraint into word classes. There may be a general set of these word classes, e.g., nouns, pronouns, verbs, "last name", "street name", "zip code", etc. available for connection to the various PCNs. On connecting a specific instance of a set member (e.g., "noun") to a PCN, it is differentiated by associating a further, more specific characteristic to it, e.g., "noun ending in /n/", "noun ending in "null", etc. Each specific instance of a WCN connects to only one particular PCN. So, for example, there may be a "noun" WCN connected to the (null, n) PCN which is separate from a "noun" WCN connected to the (vowel, n) PCN. To qualify for connection to a given WC node, a word must not only be of the same word class as the WC node to which it is to be connected, but is connected, e.g., noun ending in "null".
The PC nodes are interconnected through word connection nodes (hereinafter referred to as CONN nodes) which define the allowable path between the end node of a word and the begin node of a following word. Effectively, CONN nodes serve as concentrators, that is, they link those PC nodes which terminate a word with those PC nodes which begin a succeeding word which may follow the preceding word under the phonetic constraints that are applicable. These constraints are effectively embedded in the WC nodes, the PC nodes, the CONN nodes, and their interconnections.
In order to add a word to the lexical network, it is necessary first to create a pronunciation network for that word. A given word will typically be subject to a number of different pronunciations, due in part to the phonetic context in which they appear (e.g., the end phoneme /n/ in the word "phone" may be dropped when the following word begins with an /n/ (e.g., "number"), and in part to other factors such as speaker dialect, etc. Variations in pronunciation which are due to differing phonetic context are commonly modeled by standard rules which define, for each phoneme, the ways in which it may be pronounced, depending on the surrounding context. In the present invention, network fragments corresponding to the operation of these rules on each phoneme are precompiled into binary form and stored in the system, indexed by phoneme. The precompiled network fragments include labels specifying allowed connections to other fragments and associations with PCNs. These labels are of two types: the first refers to the phoneme indexes of other pronunciation networks: the second refers to specific branches within the pronunciation networks which are allowed to connect to the first pronunciation network. Pronunciation networks for phonemes precompiled according to this method allow the rapid generation of pronunciations for new words to thereby facilitate word addition dynamically, i.e., during the speech recognition process itself.
In adding a word to the lexical network, the word is associated with a phonemic baseform and a word class. Its pronunciation network is generated by choosing the network fragment associated with phonemes in the phonemic baseform of the word and then interconnecting the fragments according to the constraints at their end nodes. The ensuing structure is a pronunciation network typically having a multiplicity of word begin and word end nodes to allow for variation in the words which precede and follow. The resultant pronunciation network is linked to the word class nodes in the manner described above.
In the present invention, the words are organized by word class, and each added word is required to be associated with a predefined word class in order to allow use of a language model based on word classes during the search of the lexical network, even with added words. Predefined words are not required to belong to a word class; they may be treated individually. The language model comprises functions which define the increment to the score of a path on leaving a particular word class node or word end node or on arriving at a particular word class node or word begin node. A function may depend on both the source node and the destination node.
In accordance with the present invention, constraints from electronic databases are used to make the language vocabulary task tractable. The discourse history of speech frequently can also provide useful information as to the likely identification of words yet to be uttered. In the present invention, the discourse history is used in conjunction with a database to invoke different language models and different vocabularies for different portions of the discourse. In many applications, the system will first pose a question to the user with words drawn from a small-vocabulary domain. The user's response to the question is then used to narrow the vocabulary that needs to be searched for a subsequent discourse involving a large domain vocabulary. As an example, in a catalogue sales system, the system may need to determine the user's address. The system will first ask: "What is your zip code?", and then use the response to fill in the "street name" word class from street names found in the database that have the same zip codes as that of the stated address. The street names so determined are quickly added to the system vocabulary, and street names previously in the vocabulary are removed to provide the requisite room in active memory for the street names to be added and to reduce the size of the network to be searched. The language model, i.e., the probabilities assigned to the various street names so selected, may be established a priori or may be based on other data within the system database, e.g., the number of households on each street, or a combination of these and other information items. Similarly, the system may ask for the caller's phone number first, then use the response and an electronic phonebook database to add to the vocabulary the names and addresses corresponding to the hypothesized phone numbers.
The extensive use of large electronic databases while interacting with the user necessitates an efficient database search strategy, so that the recognition process is not slowed appreciably. In accordance with the present invention, hash tables are employed to index the records in the database, and only that information which is needed for the task at hand is stored with the hash tables.
For a fuller understanding of the nature and objects of the invention, reference should be had to the following detailed description of the invention, taken in connection with the accompanying drawings, in which:
In
The output of the segmenter is applied to a phonetic classifier 18 which generates a phonetic lattice representation of the acoustic input. This lattice describes the various phones corresponding to the hypothesized segments, and the probabilities associated with each. Paths through this lattice, each of which represents a hypothesis of a sequence of phones corresponding to the acoustic input, are compared with possible paths in a corresponding recognition network 20 in a search stage 22. Language models 24 guide the search for the best match between the phonetic paths generated by classifier 18 and the paths traced through recognition network 20. The resultant is an output 26 representing a recognized communication, e.g., a sentence.
The permitted connections between words is shown by links connecting the end nodes of one word with the begin nodes of other words. For example, link 60 connects the end node 54 of the word "phone" to the begin node 62 of the word "number", while link 64 connects the end node 54 of the word "phone" to the begin node 66 of the word "and".
It will be noticed that the null end node 56 of the word "phone" has a connection to the begin node 62 of the word "number" via link 68, but has no connection to the begin node 66 of the word "and". This indicates that a pronunciation which drops the final phoneme (/n/) in "phone" when pronouncing the successive words "pho(ne)" and "and" is not permitted, but such a pronunciation would be permitted in pronouncing the string "pho(ne) number".
It should be understood that
In order to add words to a vocabulary structured in the manner shown in
Word Class nodes 92 and 96, in turn, are connected to a phonetic constraint node 100 via arcs 102, 104, respectively. Phonetic constraint nodes embody phonetic constraints. For example, phonetic constraint node 100 may comprise the constraint pair (null, n), indicating that nodes connected into it from the left are characterized by "null" final phoneme, while those to which it connects to the right are characterized by an initial /n/. In most instances word end nodes are connected to phonetic constraint nodes through word class nodes. However, some words in the lexical network may in effect form their own word class. Such words are directly connected to a phonetic constraint node. Such is the case with the word "can", whose end node 86 is directly connected to a phonetic constraint node 106 via a link 108.
Phonetic constraint nodes 100 and 110 feed into a word connection node 120 via links 122, 124, respectively. Similarly, node 106 feeds into connection node 126 via an arc 128. The pattern outward from the connection nodes is the reverse of the pattern inward to the nodes, that is, connection node 120 expands to phonetic constraint nodes 130, 132, 134 via links 136, 138 and 140, respectively. Similarly, connection node 126 expands to phonetic constraint node 142 via a link 144, as well as to nodes 143, 145 via links 147, 149, respectively. Phonetic constraint node 130 is connected to a Word Class node 150 via a link 152 and then is connected to the begin node 154 of a qualified word pronunciation (e.g., a typical pronunciation of the word "number") via a link 156. The word "qualified" indicates that the begin node to which a connection node leads satisfies the phonetic constraints associated with the end nodes feeding into that connection node. In
The structure of the lexical networks shown in
Words which are not precompiled in the baseform lexical network are added to the network dynamically by first forming the pronunciation network of the word (described in more detail in connection with
Each word end and word begin node contains the index of the PCN it connects to (through a WCN). The present invention performs a simple lookup (thereby eliminating the need for run-time computations) to establish phonetic compatibility between the end nodes and begin nodes of the words being added to the system vocabulary and thus enables the addition of the words in "real time", that is, concurrent with interaction with the user of the system. This in turn can dramatically reduce the size of the vocabulary that must be precompiled and stored in active (random access) memory, and thus enables expansion of the effective range of the vocabulary for a given system, since the needed vocabulary can be formed "on the fly" from word spellings or from phonemic baseforms requiring rather limited storage capacity in comparison to that required for a vocabulary which is completely precompiled and integrated into the lexical network.
In accordance with the present invention, the phonetic fragments are precompiled (that is, in advance of speech recognition) and stored in binary form. They are indexed by phoneme for rapid retrieval. Given a phoneme's baseform, the appropriate phonetic fragments are strung together to form words. It will be noted from
The phonetic constraints according to which the fragments are linked to each other and to PCNs (through WCNs) are expressed in terms of the phoneme indexes. In some cases, further restrictions may need to be imposed on the connections of the word into the lexical network. For example, the phoneme /ax/ ("schwa" in ARPABET notation) may at some times be pronounced as the phone [ax] and at other times as the phone [null]. Similarly, the phoneme /l/ may at times be pronounced as the phone [l] and at other times as the phone [el] (syllabic /l/). (An example is the differing pronunciations of the word "vowel" as [v aw ax l] or alternatively as [v aw el].) The conditions under which the respective pronunciations are permissible are shown in FIG. 4B. For the standard English pronunciation of /ax l/, the combination [ax][l] is permissible, as is the combination [null][el], but the combinations [ax][el] and [null][l] are not. In accordance with the present invention, the permissible combinations may be formed, and the impermissible combinations prevented, by "tagging" the "ends" of a given fragment and restricting the connection of a fragment end to those nodes having not the permitted phonetic constraint but also the same tag as the fragment end to which they are to be connected.
This is illustrated more clearly in
During speech recognition, the system searches through the lexical network to find matches between the sounds emitted by the speaker and words stored in the system. As illustrated in
Turning now to
As an example, refer again to FIG. 3. The end nodes 80, 82, 84, and 88 have "scores" associated with them of -75, -50, -100, and -80, respectively. These numbers are proportional to the log of the probabilities at the respective nodes of the most probable paths into those nodes. The task is to propagate those probabilities through the lexical network to the "begin" nodes 154, 166, etc. This is accomplished as follows:
The first step (
Next (
The maximum phonetic constraint node scores into the connection nodes are then determined (
The next step (
The scores into the right side word class nodes are then maximized (
The score which is being maximized can be expressed as sr+fr(u) where sr is the score of the right side phonetic constraint node which is connected to either the word class node or the begin node, and fr(u) is a function that specifies a score increment (typically a log probability) associated with the node u (either word class node or begin node) to which the phonetic constraint node in question is connected. For example, the function fr(u) may be defined as log (C(u)N), where C(u) is the word count in a representative text of the word or word class associated with the particular node u and N is the total number of words in the representative text.
For purposes of illustration, assume that word class node 150 comprises the class of singular nouns beginning with the letter "n" and that in a sample text of one million words, ten thousand such nouns were found. The function fr(u)=fr(150) would then be evaluated as fr(150)=log (10,000/1,000,000)=-2. Accordingly, the score at this node is given as sr+fr(u)=-50+(-2)=-52. This score is shown at node 150 in
Phonetic constraint node 142 is directly linked to begin node 176, and does not pass through any intermediate word class node. Accordingly, the score of begin node 176 is evaluated via the function sr+fr(u). As shown in
Once the maximum scores from all PCNs into the begin nodes or WCNs are evaluated, the system then determines whether any bigram statistics for the word pairs in question exist in the language model. In accordance with the present invention, the maximum score for each bigram pair (f, t) is determined as sf+g(f,t), where sf is the score of a "from" node (i.e., a word class node in cases where an end node is connected through a word class node, and an end node otherwise) on the left side of a connection node (see
As a specific example to illustrate this, consider the relationship between "from" node 96 (e.g., the word class of "nouns ending in "n" but pronounced with the n omitted") and "to" node 150 (the word class of "singular nouns starting with "n""). g(f, t) may be defined as log [C(f,t)/C(f)], where C(f,t) is the count of the word class pair f,t and C(f) is the count of word class f in some representative text. Assume that it has been determined that g(96,150)=log[C(96,150)/C(96)]=-1. The value of sf at node 96 is -50, as shown. Thus, the function sf+g(f,t)=-50+(-1)=-51. Since this is greater than the prior value, -52, associated with node 150, node 150 takes on the new value -51. Assuming that all other paths to node 150 through CONN node 120 lead to lower scores, -51 is the final value for node 150.
In cases where the begin nodes are reached through right-side word class nodes, as opposed to directly from right-side phonetic constraint nodes, a final adjustment to the score is performed (
Having determined the maximum scores at the begin nodes, the paths which defined those scores are selected as the most likely connection between the words terminating at the end nodes and the following words beginning at the begin nodes. The search process is then continued in similar manner until the complete path through the lexical network is determined.
The ability to add words dynamically (i.e., concurrent with speech recognition) to the system vocabulary enables the system to use to advantage various constraints which are inherent in a domain-specific database, and thereby greatly facilitates recognition of speech input. For example, consider the case of an interactive sales-order system to which the present system may be applied. In such a database, the vocabulary includes, at least in part, records of associated fields containing elements of information pertinent to the user. During discourse with the user, the system prompts the user for various spoken inputs and, on recognizing the inputs, provides information or further prompts to the user in order to process an order. Because of the large number of elements (e.g., "customer names") that may need to be included in the vocabulary to be recognized, it may be impractical to have all potential elements of a field "active" (i.e., available for recognition) at one time. This problem is addressed in accordance with the present invention by using a related field in the record to limit the words which must be added to the active lexical network in order to perform the requisite spoken word recognition.
For example, as shown in
Nguyen, John N., Phillips, Michael S.
Patent | Priority | Assignee | Title |
10089984, | May 27 2008 | Oracle International Corporation | System and method for an integrated, multi-modal, multi-device natural language voice services environment |
10134060, | Feb 06 2007 | Nuance Communications, Inc; VB Assets, LLC | System and method for delivering targeted advertisements and/or providing natural language processing based on advertisements |
10216725, | Sep 16 2014 | VoiceBox Technologies Corporation | Integration of domain information into state transitions of a finite state transducer for natural language processing |
10229673, | Oct 15 2014 | VoiceBox Technologies Corporation | System and method for providing follow-up responses to prior natural language inputs of a user |
10297249, | Oct 16 2006 | Nuance Communications, Inc; VB Assets, LLC | System and method for a cooperative conversational voice user interface |
10331784, | Jul 29 2016 | VoiceBox Technologies Corporation | System and method of disambiguating natural language processing requests |
10347248, | Dec 11 2007 | VoiceBox Technologies Corporation | System and method for providing in-vehicle services via a natural language voice user interface |
10430863, | Sep 16 2014 | VB Assets, LLC | Voice commerce |
10431214, | Nov 26 2014 | VoiceBox Technologies Corporation | System and method of determining a domain and/or an action related to a natural language input |
10510341, | Oct 16 2006 | VB Assets, LLC | System and method for a cooperative conversational voice user interface |
10515628, | Oct 16 2006 | VB Assets, LLC | System and method for a cooperative conversational voice user interface |
10553213, | Feb 20 2009 | Oracle International Corporation | System and method for processing multi-modal device interactions in a natural language voice services environment |
10553216, | May 27 2008 | Oracle International Corporation | System and method for an integrated, multi-modal, multi-device natural language voice services environment |
10614799, | Nov 26 2014 | VoiceBox Technologies Corporation | System and method of providing intent predictions for an utterance prior to a system detection of an end of the utterance |
10755699, | Oct 16 2006 | VB Assets, LLC | System and method for a cooperative conversational voice user interface |
11080758, | Feb 06 2007 | VB Assets, LLC | System and method for delivering targeted advertisements and/or providing natural language processing based on advertisements |
11087385, | Sep 16 2014 | VB Assets, LLC | Voice commerce |
11222626, | Oct 16 2006 | VB Assets, LLC | System and method for a cooperative conversational voice user interface |
11282512, | Oct 27 2018 | Qualcomm Incorporated | Automatic grammar augmentation for robust voice command recognition |
7003460, | May 11 1998 | Siemens Aktiengesellschaft | Method and apparatus for an adaptive speech recognition system utilizing HMM models |
7103533, | Feb 21 2001 | Microsoft Technology Licensing, LLC | Method for preserving contextual accuracy in an extendible speech recognition language model |
7127393, | Feb 10 2003 | Microsoft Technology Licensing, LLC | Dynamic semantic control of a speech recognition system |
7146319, | Mar 31 2003 | Apple Inc | Phonetically based speech recognition system and method |
7349846, | Apr 01 2003 | Canon Kabushiki Kaisha | Information processing apparatus, method, program, and storage medium for inputting a pronunciation symbol |
7398209, | Jun 03 2002 | DIALECT, LLC | Systems and methods for responding to natural language speech utterance |
7403941, | Apr 23 2004 | Apple Inc | System, method and technique for searching structured databases |
7412260, | Apr 27 2001 | Accenture Global Services Limited | Routing call failures in a location-based services system |
7437295, | Apr 27 2001 | Accenture Global Services Limited | Natural language processing for a location-based services system |
7502738, | May 11 2007 | DIALECT, LLC | Systems and methods for responding to natural language speech utterance |
7620549, | Aug 10 2005 | DIALECT, LLC | System and method of supporting adaptive misrecognition in conversational speech |
7634409, | Aug 31 2005 | DIALECT, LLC | Dynamic speech sharpening |
7640160, | Aug 05 2005 | DIALECT, LLC | Systems and methods for responding to natural language speech utterance |
7693720, | Jul 15 2002 | DIALECT, LLC | Mobile systems and methods for responding to natural language speech utterance |
7698228, | Apr 27 2001 | Accenture Global Services Limited | Tracking purchases in a location-based services system |
7725309, | Jun 06 2005 | Apple Inc | System, method, and technique for identifying a spoken utterance as a member of a list of known items allowing for variations in the form of the utterance |
7734460, | Dec 20 2005 | Microsoft Technology Licensing, LLC | Time asynchronous decoding for long-span trajectory model |
7742911, | Oct 12 2004 | Microsoft Technology Licensing, LLC | Apparatus and method for spoken language understanding by using semantic role labeling |
7778821, | Nov 24 2004 | Microsoft Technology Licensing, LLC | Controlled manipulation of characters |
7801910, | Nov 09 2005 | CXENSE ASA | Method and apparatus for timed tagging of media content |
7809570, | Jun 03 2002 | DIALECT, LLC | Systems and methods for responding to natural language speech utterance |
7818176, | Feb 06 2007 | Nuance Communications, Inc; VB Assets, LLC | System and method for selecting and presenting advertisements based on natural language processing of voice-based input |
7860519, | Apr 27 2001 | Accenture Global Services Limited | Location-based services system |
7890328, | Sep 07 2006 | Nuance Communications, Inc | Enhanced accuracy for speech recognition grammars |
7917367, | Aug 05 2005 | DIALECT, LLC | Systems and methods for responding to natural language speech utterance |
7949529, | Aug 29 2005 | DIALECT, LLC | Mobile systems and methods of supporting natural language human-machine interactions |
7970648, | Apr 27 2001 | Accenture Global Services Limited | Advertising campaign and business listing management for a location-based services system |
7983917, | Aug 31 2005 | DIALECT, LLC | Dynamic speech sharpening |
8015006, | Jun 03 2002 | DIALECT, LLC | Systems and methods for processing natural language speech utterances with context-specific domain agents |
8069046, | Aug 31 2005 | DIALECT, LLC | Dynamic speech sharpening |
8073681, | Oct 16 2006 | Nuance Communications, Inc; VB Assets, LLC | System and method for a cooperative conversational voice user interface |
8082145, | Nov 24 2004 | Microsoft Technology Licensing, LLC | Character manipulation |
8112275, | Jun 03 2002 | DIALECT, LLC | System and method for user-specific speech recognition |
8140327, | Jun 03 2002 | DIALECT, LLC | System and method for filtering and eliminating noise from natural language utterances to improve speech recognition and parsing |
8140335, | Dec 11 2007 | VoiceBox Technologies Corporation | System and method for providing a natural language voice user interface in an integrated voice navigation services environment |
8145489, | Feb 06 2007 | Nuance Communications, Inc; VB Assets, LLC | System and method for selecting and presenting advertisements based on natural language processing of voice-based input |
8150694, | Aug 31 2005 | DIALECT, LLC | System and method for providing an acoustic grammar to dynamically sharpen speech interpretation |
8155962, | Jun 03 2002 | DIALECT, LLC | Method and system for asynchronously processing natural language utterances |
8166297, | Jul 02 2008 | SAMSUNG ELECTRONICS CO , LTD | Systems and methods for controlling access to encrypted data stored on a mobile device |
8185646, | Nov 03 2008 | SAMSUNG ELECTRONICS CO , LTD | User authentication for social networks |
8195468, | Aug 29 2005 | DIALECT, LLC | Mobile systems and methods of supporting natural language human-machine interactions |
8229746, | Sep 07 2006 | Microsoft Technology Licensing, LLC | Enhanced accuracy for speech recognition grammars |
8312022, | Mar 21 2008 | CXENSE ASA | Search engine optimization |
8326627, | Dec 11 2007 | VoiceBox Technologies, Inc. | System and method for dynamically generating a recognition grammar in an integrated voice navigation services environment |
8326634, | Aug 05 2005 | DIALECT, LLC | Systems and methods for responding to natural language speech utterance |
8326637, | Feb 20 2009 | Oracle International Corporation | System and method for processing multi-modal device interactions in a natural language voice services environment |
8332224, | Aug 10 2005 | DIALECT, LLC | System and method of supporting adaptive misrecognition conversational speech |
8370147, | Dec 11 2007 | VoiceBox Technologies, Inc. | System and method for providing a natural language voice user interface in an integrated voice navigation services environment |
8447607, | Aug 29 2005 | DIALECT, LLC | Mobile systems and methods of supporting natural language human-machine interactions |
8452598, | Dec 11 2007 | VoiceBox Technologies, Inc. | System and method for providing advertisements in an integrated voice navigation services environment |
8478593, | Sep 07 2006 | Microsoft Technology Licensing, LLC | Enhanced accuracy for speech recognition grammars |
8515765, | Oct 16 2006 | Nuance Communications, Inc; VB Assets, LLC | System and method for a cooperative conversational voice user interface |
8527274, | Feb 06 2007 | Nuance Communications, Inc; VB Assets, LLC | System and method for delivering targeted advertisements and tracking advertisement interactions in voice recognition contexts |
8536976, | Jun 11 2008 | SAMSUNG ELECTRONICS CO , LTD | Single-channel multi-factor authentication |
8555066, | Jul 02 2008 | SAMSUNG ELECTRONICS CO , LTD | Systems and methods for controlling access to encrypted data stored on a mobile device |
8589161, | May 27 2008 | Oracle International Corporation | System and method for an integrated, multi-modal, multi-device natural language voice services environment |
8620659, | Aug 10 2005 | DIALECT, LLC | System and method of supporting adaptive misrecognition in conversational speech |
8719009, | Feb 20 2009 | Oracle International Corporation | System and method for processing multi-modal device interactions in a natural language voice services environment |
8719026, | Dec 11 2007 | VoiceBox Technologies Corporation | System and method for providing a natural language voice user interface in an integrated voice navigation services environment |
8725511, | Sep 07 2006 | Microsoft Technology Licensing, LLC | Enhanced accuracy for speech recognition grammars |
8731929, | Jun 03 2002 | DIALECT, LLC | Agent architecture for determining meanings of natural language utterances |
8738380, | Feb 20 2009 | Oracle International Corporation | System and method for processing multi-modal device interactions in a natural language voice services environment |
8738437, | Apr 27 2001 | Accenture Global Services Limited | Passive mining of usage information in a location-based services system |
8849652, | Aug 29 2005 | DIALECT, LLC | Mobile systems and methods of supporting natural language human-machine interactions |
8849670, | Aug 05 2005 | DIALECT, LLC | Systems and methods for responding to natural language speech utterance |
8886536, | Feb 06 2007 | Nuance Communications, Inc; VB Assets, LLC | System and method for delivering targeted advertisements and tracking advertisement interactions in voice recognition contexts |
8983839, | Dec 11 2007 | VoiceBox Technologies Corporation | System and method for dynamically generating a recognition grammar in an integrated voice navigation services environment |
9015049, | Oct 16 2006 | Nuance Communications, Inc; VB Assets, LLC | System and method for a cooperative conversational voice user interface |
9031845, | Jul 15 2002 | DIALECT, LLC | Mobile systems and methods for responding to natural language speech utterance |
9105266, | Feb 20 2009 | Oracle International Corporation | System and method for processing multi-modal device interactions in a natural language voice services environment |
9171541, | Nov 10 2009 | VOICEBOX TECHNOLOGIES, INC | System and method for hybrid processing in a natural language voice services environment |
9263039, | Aug 05 2005 | DIALECT, LLC | Systems and methods for responding to natural language speech utterance |
9263045, | May 17 2011 | Microsoft Technology Licensing, LLC | Multi-mode text input |
9269097, | Feb 06 2007 | Nuance Communications, Inc; VB Assets, LLC | System and method for delivering targeted advertisements and/or providing natural language processing based on advertisements |
9305548, | May 27 2008 | Oracle International Corporation | System and method for an integrated, multi-modal, multi-device natural language voice services environment |
9384188, | Jan 27 2015 | Microsoft Technology Licensing, LLC | Transcription correction using multi-token structures |
9406078, | Feb 06 2007 | Nuance Communications, Inc; VB Assets, LLC | System and method for delivering targeted advertisements and/or providing natural language processing based on advertisements |
9412364, | Sep 07 2006 | Microsoft Technology Licensing, LLC | Enhanced accuracy for speech recognition grammars |
9460081, | Jan 27 2015 | Microsoft Technology Licensing, LLC | Transcription correction using multi-token structures |
9495957, | Aug 29 2005 | DIALECT, LLC | Mobile systems and methods of supporting natural language human-machine interactions |
9502025, | Nov 10 2009 | VB Assets, LLC | System and method for providing a natural language content dedication service |
9570070, | Feb 20 2009 | Oracle International Corporation | System and method for processing multi-modal device interactions in a natural language voice services environment |
9620111, | May 01 2012 | Amazon Technologies, Inc | Generation and maintenance of language model |
9620113, | Dec 11 2007 | VoiceBox Technologies Corporation | System and method for providing a natural language voice user interface |
9626703, | Sep 16 2014 | Nuance Communications, Inc; VB Assets, LLC | Voice commerce |
9626959, | Aug 10 2005 | DIALECT, LLC | System and method of supporting adaptive misrecognition in conversational speech |
9697230, | Nov 09 2005 | CXENSE, INC | Methods and apparatus for dynamic presentation of advertising, factual, and informational content using enhanced metadata in search-driven media applications |
9697231, | Nov 09 2005 | CXENSE, INC | Methods and apparatus for providing virtual media channels based on media search |
9711143, | May 27 2008 | Oracle International Corporation | System and method for an integrated, multi-modal, multi-device natural language voice services environment |
9747896, | Oct 15 2014 | VoiceBox Technologies Corporation | System and method for providing follow-up responses to prior natural language inputs of a user |
9865262, | May 17 2011 | Microsoft Technology Licensing, LLC | Multi-mode text input |
9898459, | Sep 16 2014 | VoiceBox Technologies Corporation | Integration of domain information into state transitions of a finite state transducer for natural language processing |
9953649, | Feb 20 2009 | Oracle International Corporation | System and method for processing multi-modal device interactions in a natural language voice services environment |
Patent | Priority | Assignee | Title |
4156868, | May 05 1977 | Bell Telephone Laboratories, Incorporated | Syntactic word recognizer |
4481593, | Oct 05 1981 | Silicon Valley Bank | Continuous speech recognition |
4489434, | Oct 05 1981 | Silicon Valley Bank | Speech recognition method and apparatus |
4783803, | Nov 12 1985 | DRAGON SYSTEMS, INC , A CORP OF DE | Speech recognition apparatus and method |
4829576, | Oct 21 1986 | Dragon Systems, Inc.; DRAGON SYSTEMS INC | Voice recognition system |
4837831, | Oct 15 1986 | Dragon Systems, Inc.; DRAGON SYSTEMS INC , 55 CHAPEL STREET, A CORP OF DE | Method for creating and using multiple-word sound models in speech recognition |
4956865, | Feb 01 1985 | Nortel Networks Limited | Speech recognition |
4975957, | May 02 1985 | Hitachi, Ltd. | Character voice communication system |
4980918, | May 09 1985 | International Business Machines Corporation | Speech recognition system with efficient storage and rapid assembly of phonological graphs |
5142585, | Feb 15 1986 | Smiths Industries Public Limited Company | Speech processing apparatus and methods |
5202952, | Jun 22 1990 | SCANSOFT, INC | Large-vocabulary continuous speech prefiltering and processing system |
5212730, | Jul 01 1991 | Texas Instruments Incorporated | Voice recognition of proper names using text-derived recognition models |
5263117, | Oct 26 1989 | Nuance Communications, Inc | Method and apparatus for finding the best splits in a decision tree for a language model for a speech recognizer |
5267345, | Feb 10 1992 | International Business Machines Corporation | Speech recognition apparatus which predicts word classes from context and words from word classes |
5268990, | Jan 31 1991 | SRI International | Method for recognizing speech using linguistically-motivated hidden Markov models |
5283833, | Sep 19 1991 | AT&T Bell Laboratories; American Telephone and Telegraph Company | Method and apparatus for speech processing using morphology and rhyming |
5325421, | Aug 24 1992 | AT&T Bell Laboratories | Voice directed communications system platform |
5333275, | Jun 23 1992 | TEXAS INSTRUMENTS INCORPORATED, A CORP OF DE | System and method for time aligning speech |
5345537, | Dec 19 1990 | Fujitsu Limited | Network reformer and creator |
5390278, | Oct 08 1991 | Bell Canada | Phoneme based speech recognition |
5428707, | Nov 13 1992 | Nuance Communications, Inc | Apparatus and methods for training speech recognition systems and their users and otherwise improving speech recognition performance |
5457770, | Aug 19 1993 | Kabushiki Kaisha Meidensha | Speaker independent speech recognition system and method using neural network and/or DP matching technique |
5799276, | Nov 07 1995 | ROSETTA STONE, LTD ; Lexia Learning Systems LLC | Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals |
5893058, | Jan 24 1989 | Canon Kabushiki Kaisha | Speech recognition method and apparatus for recognizing phonemes using a plurality of speech analyzing and recognizing methods for each kind of phoneme |
6125347, | Sep 29 1993 | Nuance Communications, Inc | System for controlling multiple user application programs by spoken input |
WO9416434, |
Date | Maintenance Fee Events |
Jun 12 2006 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jun 22 2006 | ASPN: Payor Number Assigned. |
Jun 22 2006 | R2551: Refund - Payment of Maintenance Fee, 4th Yr, Small Entity. |
Jun 22 2006 | STOL: Pat Hldr no Longer Claims Small Ent Stat |
Jul 01 2010 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Jul 01 2010 | M1555: 7.5 yr surcharge - late pmt w/in 6 mo, Large Entity. |
Jun 04 2014 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Dec 31 2005 | 4 years fee payment window open |
Jul 01 2006 | 6 months grace period start (w surcharge) |
Dec 31 2006 | patent expiry (for year 4) |
Dec 31 2008 | 2 years to revive unintentionally abandoned end. (for year 4) |
Dec 31 2009 | 8 years fee payment window open |
Jul 01 2010 | 6 months grace period start (w surcharge) |
Dec 31 2010 | patent expiry (for year 8) |
Dec 31 2012 | 2 years to revive unintentionally abandoned end. (for year 8) |
Dec 31 2013 | 12 years fee payment window open |
Jul 01 2014 | 6 months grace period start (w surcharge) |
Dec 31 2014 | patent expiry (for year 12) |
Dec 31 2016 | 2 years to revive unintentionally abandoned end. (for year 12) |