The present invention discloses converting a text form into a speech. In the present invention, partial word lists of a data source are obtained by parsing the data source in parallel or in series. The partial word lists are then compiled to obtain phoneme graphs corresponding, respectively, to the partial word lists, and then the obtained phoneme graphs are combined. Speech recognition is then conducted according to the combination results. According to the present invention, computational complexity may be reduced and recognition efficiency may be improved during speech recognition.

Patent
   8650032
Priority
Nov 30 2010
Filed
Nov 02 2011
Issued
Feb 11 2014
Expiry
Nov 02 2031

TERM.DISCL.
Assg.orig
Entity
Large
1
28
currently ok
1. A system for converting a text form into a phoneme tree, comprising:
a data parser, configured to parse a data source in the text form to obtain a plurality of partial word lists of the data source, the data source comprising content of a web page;
one or more compilers, configured to compile the plurality of partial word lists to create a plurality of phoneme graphs, each phoneme graph corresponding to a respective partial word list, each phoneme graph including a root node, a plurality of phonemes, and an end node;
a combiner, configured to combine the created phoneme graphs to form the phoneme tree, wherein the phoneme tree includes at least a first phoneme graph and a second phoneme graph sharing a common root node and a common end node.
2. The system according to claim 1, further comprising:
a judger, configured to judge whether the phoneme tree has sufficient information to conduct speech recognition, and wherein responsive to judging that there is sufficient information, the speech recognizer conducts the speech recognition using the phoneme tree.
3. The system according to claim 1, wherein under a circumstance that the data source changes responsive to a user clicking on a link of a rendering of the web page, the combiner continues combination of the created phoneme graphs into the phoneme tree and caches the phoneme tree for use by the speech recognizer while the data parser parses the changed data source to obtain updated partial word lists, the one or more compilers compile the updated partial word lists to create updated phoneme graphs corresponding, respectively, to the updated partial word lists, and the combiner combines the created updated phoneme graphs into an updated phoneme tree that is then used by the speech recognizer to conduct the speech recognition of the text form in the changed data source.
4. The system according to claim 1, further comprising:
a segmenter, configured to segment the data source to obtain segments of the data source, wherein a plurality of data parsers parses the data source segments, in parallel, to obtain the plurality of partial word lists of the data source segments for use by the one or more compilers.
5. The system according to claim 1, wherein the data parser parses the data source, in series, to obtain the plurality of partial word lists of the data source for use by the one or more compilers.
6. The system according to claim 1, further comprising:
an optimizer, configured to optimize the phoneme tree responsive to determining that at least two branches of the phoneme tree contain identical nodes.
7. The system according to claim 1, wherein at least one of the one or more compilers comprises:
a grammar obtainer, configured to apply a grammar template with respect to a selected one of the partial word lists to create a grammar corresponding to the selected partial word list, the created grammar specifying a sequence of words determined to be present in the selected partial word list;
a determiner, configured to determine a phoneme list of the grammar, the phoneme list specifying, for each of at least one pronunciation of each word in the specified sequence of words, a list of phonemes corresponding to the each pronunciation;
a creator, configured to create the phoneme graph corresponding to the determined phoneme list; and
an optimizer, configured to optimize the phoneme graph.
8. The system according to claim 1, wherein the combiner, when combining the plurality of phoneme graphs to form the phoneme tree, is further configured to optimize the phoneme tree by eliminating from the second phoneme graph a list of phonemes appearing in the first phoneme graph, the combiner further configured to create, at either the first or last phoneme in the list, a common intermediate node between the first phoneme graph and the second phoneme graph in the phoneme tree.

The present invention relates to information technology, and more particularly to converting text into speech for speech recognition.

Up to now, LVCSR (Large Vocabulary Continuous Speech Recognition) and NLU (natural language processing) still cannot meet the requirements of accuracy and performance of human-machine speech communication in real life.

When the data source content changes in a speech-enabled application, for example, in the speech control application based on web page, the grammar must be generated dynamically according to the data source content.

During speech recognition, how to reduce computational complexity and improve the recognition efficiency is a problem to be confronted.

According to a first aspect of the present invention, the present invention provides a method of converting a text into a speech, comprising: parsing a data source in a text form to obtain partial word lists of the data source; compiling the partial word lists to obtain phoneme graphs corresponding, respectively, to the partial word lists; combining the obtained phoneme graphs; and conducting speech recognition according to the combination results.

According to a second aspect of the present invention, the present invention provides a system for converting a text form into a speech, comprising: a data parser, configured to parse a data source in the text form to obtain partial word lists of the data source; one or more compilers, configured to compile the partial word lists to obtain phoneme graphs corresponding, respectively, to the partial word lists; a combiner, configured to combine the obtained phoneme graphs; and a speech recognizer, configured to conduct speech recognition according to the combination results.

According to the present invention, computational complexity may be reduced and the recognition efficiency may be improved during speech recognition.

Other objectives and effects of the present invention will become clearer and easier to understand with more comprehensive understanding of the present invention in conjunction with the explanations of the following accompanying drawings, wherein:

FIG. 1 illustrates a system of converting a text into a speech according to a first embodiment of the present invention;

FIG. 2 illustrates a system of converting a text into a speech according to a second embodiment of the present invention;

FIG. 3 illustrates a flowchart of a method of converting a text into a speech according to a third embodiment of the present invention;

FIG. 4 illustrates a flowchart of a method of converting a text into a speech according to a fourth embodiment of the present invention; and

FIG. 5 illustrates specific examples of converting a text into a speech according to an embodiment of the present invention.

In all of the above figures, like reference numbers denote identical, similar, or corresponding features or functions.

Specific embodiments of the present invention are described herein with reference to the drawings.

The basic idea of the present invention is obtaining partial word lists of a data source in a text form by parsing the data source in parallel or in series; then compiling the partial word lists to obtain phoneme graphs corresponding, respectively, to the partial word lists; then combining the obtained phoneme graphs; and then conducting speech recognition according to the combination results.

FIG. 1 illustrates a system of converting a text into a speech according to a first embodiment of the present invention.

As shown in FIG. 1, the system 100 comprises a data parser 110 for parsing a data source in the text form to obtain partial word lists of the data source; a plurality of compilers 120-1, 120-2, 120-3, . . . 120-N−1, 120-N for compiling the partial word lists to obtain phoneme graphs corresponding, respectively, to the partial word lists; a combiner 130 for combining the obtained phoneme graphs; and a speech recognizer 140 for conducting speech recognition according to the combination results.

It should be understood by those skilled in the art that the term “word”, as used herein, represents a common grammatical element such as a character, a word, and/or a phrase. A partial word list refers to part of the word list obtained by parsing the whole data source.

In the first embodiment of the present invention, the data parser 110 parses the data source in series, after obtaining a partial word list of the data source, the data parser 110 invokes a compiler (e.g., the compiler 120-1) to compile the partial word list to obtain a phoneme graph, and then the data parser 110 continues to parse the remaining data source to obtain a next partial word list.

In the first embodiment, the system 100 further comprises a judger 160 for judging whether combination results have sufficient information to conduct speech recognition. If there is sufficient information—for example, if there is a complete sentence—then the speech recognizer 140 starts the speech recognition.

In the first embodiment, before the combiner 130 finishes combining the obtained phoneme graphs, if the current data source changes, then the combiner 130 continues combining the obtained phoneme graphs and caches the combination results.

In the first embodiment, the system can further comprise an optimizer 150 for optimizing the combination results. For example, the optimizer 150 may combine identical nodes in combination results as described as below.

In the first embodiment, each compiler 120 in the compilers 120-1, 120-2, 120-3, . . . 120-N−1, 120-N can comprise a grammar obtainer 1201 for applying a grammar template with respect to a partial word list to obtain a grammar corresponding to the partial word list; a determiner 1202 for determining a phoneme list of the grammar; a creator 1203 for creating a corresponding phoneme tree according to the phoneme list; and an optimizer 1204 for optimizing the phoneme tree to obtain a corresponding phoneme graph.

The grammar template generally comprises contents describing what the grammar should be like.

The following is an example of a grammar template:

The grammar specifically describes the contents of the partial word list, for example, whether the contents of the partial word list is English or Chinese, and whether the English (if any) is American English or British English.

An example of the grammar of a partial word list is presented as below:

The phoneme list describes how the word is pronounced. The phoneme list can be determined from a phoneme pool according to the grammar.

The following is an example of a phoneme list:

In one word, those skilled in the art can understand the meaning of terms such as the grammar template, the grammar, the phoneme list and phoneme tree, completely. For the sake of conciseness, they are not described in more detail herein.

During optimization of the phoneme tree, identical nodes in the phoneme tree are combined so as to obtain the phoneme graph.

Certainly, those skilled in the art can understand that in the first embodiment, if the processing speed of the compiler is fast enough—i.e., when the data parser 110 sends a partial word list to the compiler, such that compilation of another partial word list preceding said partial word list is already finished—then only one compiler is needed.

In addition, those skilled in the art can understand that the optimizer is not necessary in the compiler 120 in some cases.

FIG. 2 illustrates a system of converting a text into a speech according to a second embodiment of the present invention.

The system 200 differs from the system 100 shown in FIG. 1 in that the system 200 comprises a segmenter 210 for segmenting a data source to obtain segments of the data source, and a plurality of data parsers 110-1, 110-2, 110-3, . . . 110-N−1, 110-N which parse the data source segments in parallel to obtain partial word lists of the data source segments.

FIG. 3 illustrates a flowchart of a method of converting a text into a speech according to a third embodiment of the present invention.

As shown in FIG. 3, the method 300 comprises step S310 for parsing a data source in the text form to obtain partial word lists of the data source; step S320 for compiling the partial word lists to obtain phoneme graphs corresponding, respectively, to the partial word lists; step S330 for combining the obtained phoneme graphs; and S340 for conducting speech recognition according to the combination results.

In the third embodiment, the data source is parsed in series, and after a partial word list is obtained, the partial word list is compiled to obtain a phoneme graph. The remaining data source is then parsed continually to obtain a next partial word list.

In the third embodiment, the method further comprises step S360 for judging whether combination results have sufficient information to conduct speech recognition before combination of the phoneme graphs is finished. If there is sufficient information—for example, if there is a complete sentence—then speech recognition will be started.

In the third embodiment, before combination of the obtained phoneme graphs is finished, if the current data source changes, then combination of the obtained phoneme graphs is continued and the combination results are cached.

In the third embodiment, the method can further comprise step S350 for optimizing the combination results.

In the third embodiment, the step of compiling partial word lists to obtain phoneme graphs corresponding, respectively, to the partial word lists can comprises step S3201 of applying a grammar template with respect to a partial word list to obtain a grammar corresponding to the partial word list; step S3202 of determining a phoneme list of the grammar; step S3203 of creating a corresponding phoneme tree according to the phoneme list; and step S3204 of optimizing the phoneme tree to obtain a corresponding phoneme graph.

Certainly, those skilled in the art can understand that the step of optimizing the phoneme tree to obtain a corresponding phoneme graph is not necessary in some cases.

FIG. 4 illustrates a flowchart of a method of converting a text into a speech according to a fourth embodiment of the present invention.

The method 400 differs from the method 300 shown in FIG. 3 in that the method 400 comprises a step 405 for segmenting a data source to obtain segments of the data source, and in step 410, parsing the data source segments occurs in parallel to obtain partial word lists of the data source segments.

Embodiments of the present invention will now be described in more detail with reference to an example.

In this example, the data parser 110 parses the data source in series. After obtaining a first partial word list which includes the content Red flag, the data parser 110 invokes a first compiler 120-1 to compile the first partial word list to obtain a first phoneme graph.

In this example, the grammar obtained by the first compiler 120-1 is as below:

The determined phoneme list is as below:

The created phoneme tree is as shown in FIG. 5A.

Since in this simple example, the phoneme tree shown in FIG. 5A is already optimized, so it needn't be optimized by an optimizer component.

Further, since this is a first obtained phoneme graph, no combination occurs at this time.

In addition, provided that the first phoneme graph does not have sufficient information for speech recognition, speech recognition does not happen at this time.

As the data parser 110 continues to parse the remaining data source, it obtains a second partial word list which includes the content White house, and then invokes a second compiler 120-2 to compile the second partial word list to obtain a second phoneme graph.

In this example, the grammar obtained by the second compiler 120-2 is as below:

The determined phoneme list is as below:

The created phoneme tree is as shown in FIG. 5B.

Since in this simple example, the phoneme tree shown in FIG. 5B is already optimized, it needn't be optimized by an optimizer component.

Further, since this is a second obtained phoneme graph and the first phoneme graph is obtained previously, combination occurs at this time. The combination result is as shown in FIG. 5C. Since both of the first phoneme graph and the second phoneme graph have a root node and an end node (but no other common nodes), the combination of the phoneme graphs is relatively simple, i.e., the root nodes and end nodes of the first phoneme graph and the second phoneme graph are combined.

In addition, provided that the result from combining the first phoneme graph and the second phoneme graph already has sufficient information for speech recognition, speech recognition happens at this time.

Besides, in this example, the combination result of the first phoneme graph and the second phoneme graph is already optimized at this time, so optimization will not be conducted for the combination result at this time.

As the data parser 110 continues to parse the remaining data source, it obtains a third partial word list (the last one) which includes the content Yellow flag, and then it invokes a third compiler 120-3 to compile the third partial word list to obtain a third phoneme graph.

In this example, the grammar obtained by the third compiler 120-3 is as below:

The determined phoneme list is as below:

The created phoneme tree is as shown in FIG. 5D.

Since in this simple example, the phoneme tree shown in FIG. 5D is already optimized, it needn't be optimized by an optimizer component.

Further, since this is a third obtained phoneme graph and the first and second phoneme graphs are obtained previously, combination occurs at this time. The combination result is as shown in FIG. 5E.

In addition, at this time the combination result of the first phoneme graph, the second phoneme graph, and the third phoneme graph is not optimal because two branches have identical nodes F, L, AE, and GD. Therefore, the combination result is optimized at this time. The optimized combination result is as shown in FIG. 5F.

Additionally, before the third obtained phoneme graph is combined, if the data source changes—for example, if the data source is a web page and a user clicks a link on the web page—then combination of the third obtained phoneme graph is continued and the combination result is cached so that when the user returns back to the above web page, the combined phoneme graph can continue to be used.

According to the present invention, the phoneme graph of the partial word list needs to be determined each time, and therefore the computational complexity can be reduced. Further, before combination of the obtained phoneme graphs is finished, once a combination result already has sufficient information for speech recognition, the speech recognition will begin, thereby improving the performance of the speech recognition.

It should be noted that some more specific technological details which are publicly known for those skilled in the art and are requisite for realization of the present invention are omitted in the above description to make the present invention more easily understood.

The description of the present invention is provided for illustration and depiction purpose, not for listing all the embodiments or limiting the present invention to the disclosed form. It is understood by those skilled in the art that many modifications and variations are obvious based on the teachings provided herein.

Therefore, the above preferred embodiments are selected and described to better illustrate principles of the present invention and actual applications thereof, and to enable those having ordinary skill in the art to understand that without departure from the essence of the present invention, all the modifications and variations fall within the scope of protection of the present invention as defined by the appended claims.

Liu, Ying, Jia, Bin, Fu, Guo Kang, Han, Zhao Bing

Patent Priority Assignee Title
9953646, Sep 02 2014 BELLEAU TECHNOLOGIES, LLC Method and system for dynamic speech recognition and tracking of prewritten script
Patent Priority Assignee Title
5384893, Sep 23 1992 EMERSON & STERN ASSOCIATES, INC Method and apparatus for speech synthesis based on prosodic analysis
5428707, Nov 13 1992 Nuance Communications, Inc Apparatus and methods for training speech recognition systems and their users and otherwise improving speech recognition performance
5724481, Mar 30 1995 Alcatel-Lucent USA Inc Method for automatic speech recognition of arbitrary spoken words
5917891, Oct 07 1996 RPX CLEARINGHOUSE LLC Voice-dialing system using adaptive model of calling behavior
5995930, Sep 14 1991 U.S. Philips Corporation Method and apparatus for recognizing spoken words in a speech signal by organizing the vocabulary in the form of a tree
6167117, Oct 07 1996 RPX CLEARINGHOUSE LLC Voice-dialing system using model of calling behavior
6260014, Sep 14 1998 Nuance Communications, Inc Specific task composite acoustic models
6377925, Dec 16 1999 PPR DIRECT, INC Electronic translator for assisting communications
6405168, Sep 30 1999 WIAV Solutions LLC Speaker dependent speech recognition training using simplified hidden markov modeling and robust end-point detection
6622121, Aug 20 1999 Nuance Communications, Inc Testing speech recognition systems using test data generated by text-to-speech conversion
6823309, Mar 25 1999 Sovereign Peak Ventures, LLC Speech synthesizing system and method for modifying prosody based on match to database
6870914, Jan 29 1999 Nuance Communications, Inc Distributed text-to-speech synthesis between a telephone network and a telephone subscriber unit
7124082, Oct 11 2002 Twisted Innovations Phonetic speech-to-text-to-speech system and method
7324945, Jun 28 2001 SRI International Method of dynamically altering grammars in a memory efficient speech recognition system
7346507, Jun 05 2002 RAMP HOLDINGS, INC F K A EVERYZING, INC Method and apparatus for training an automated speech recognition-based system
7577569, Sep 05 2001 Cerence Operating Company Combined speech recognition and text-to-speech generation
7706513, Jan 29 1999 Cerence Operating Company Distributed text-to-speech synthesis between a telephone network and a telephone subscriber unit
7844459, May 09 2000 UNIFY GMBH & CO KG Method for creating a speech database for a target vocabulary in order to train a speech recognition system
7885817, Mar 08 2005 Microsoft Technology Licensing, LLC Easy generation and automatic training of spoken dialog systems using text-to-speech
8005674, Nov 29 2006 International Business Machines Corporation Data modeling of class independent recognition models
8140336, Dec 08 2005 Nuance Communications, Inc Speech recognition system with huge vocabulary
20070124142,
20080126093,
20080126094,
20100030561,
20100324894,
CN101470701,
CN1979637,
////////////
Executed onAssignorAssigneeConveyanceFrameReelDoc
Nov 02 2011Nuance Communications, Inc.(assignment on the face of the patent)
Nov 02 2011JIA, BINInternational Business Machines CorporationASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0271620710 pdf
Nov 02 2011HAN, ZHAO BINGInternational Business Machines CorporationASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0271620710 pdf
Nov 02 2011FU, GUO KANGInternational Business Machines CorporationASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0271620710 pdf
Nov 02 2011LIU, YINGInternational Business Machines CorporationASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0271620710 pdf
Mar 29 2013International Business Machines CorporationNuance Communications, IncASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0303230965 pdf
Sep 30 2019Nuance Communications, IncCerence Operating CompanyCORRECTIVE ASSIGNMENT TO CORRECT THE REPLACE THE CONVEYANCE DOCUMENT WITH THE NEW ASSIGNMENT PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191 ASSIGNOR S HEREBY CONFIRMS THE ASSIGNMENT 0598040186 pdf
Sep 30 2019Nuance Communications, IncCerence Operating CompanyCORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE NAME PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191 ASSIGNOR S HEREBY CONFIRMS THE INTELLECTUAL PROPERTY AGREEMENT 0508710001 pdf
Sep 30 2019Nuance Communications, IncCERENCE INC INTELLECTUAL PROPERTY AGREEMENT0508360191 pdf
Oct 01 2019Cerence Operating CompanyBARCLAYS BANK PLCSECURITY AGREEMENT0509530133 pdf
Jun 12 2020Cerence Operating CompanyWELLS FARGO BANK, N A SECURITY AGREEMENT0529350584 pdf
Jun 12 2020BARCLAYS BANK PLCCerence Operating CompanyRELEASE BY SECURED PARTY SEE DOCUMENT FOR DETAILS 0529270335 pdf
Date Maintenance Fee Events
Aug 09 2017M1551: Payment of Maintenance Fee, 4th Year, Large Entity.
Jul 28 2021M1552: Payment of Maintenance Fee, 8th Year, Large Entity.


Date Maintenance Schedule
Feb 11 20174 years fee payment window open
Aug 11 20176 months grace period start (w surcharge)
Feb 11 2018patent expiry (for year 4)
Feb 11 20202 years to revive unintentionally abandoned end. (for year 4)
Feb 11 20218 years fee payment window open
Aug 11 20216 months grace period start (w surcharge)
Feb 11 2022patent expiry (for year 8)
Feb 11 20242 years to revive unintentionally abandoned end. (for year 8)
Feb 11 202512 years fee payment window open
Aug 11 20256 months grace period start (w surcharge)
Feb 11 2026patent expiry (for year 12)
Feb 11 20282 years to revive unintentionally abandoned end. (for year 12)