An articulator shape input section detects movements of an articulator and generates feature data of speech. On the other hand, a speech mode detection section of a speech mode input section detects a mode of the speech. The kind of standard pattern is selected in accordance with the detected speech mode or a speech mode that is specified manually through a speech mode manual input section. A comparison section detects the speech by comparing the selected kind of standard pattern and the input feature data.
|
21. A speech detection apparatus comprising:
an input section for generating, based on speech of a speaker, input data representing a feature of the speech; a speech mode input section for allowing input of a speech mode of the speaker; and a speech detection section for detecting the speech by comparing the input data generated by the input section based on the speech of the speaker with a standard pattern that is prepared in advance, wherein a function to be performed in connection with input speech data is switched in accordance with the speech mode that is input through the speech mode input section.
20. A speech detection apparatus comprising:
an input section for generating, based on speech of a speaker, input data representing a feature of the speech; a speech mode input section for allowing input of a speech mode of the speaker; and a speech detection section for detecting the speech by comparing the input data generated by the input section based on the speech of the speaker with one kind of standard pattern that is prepared in advance, wherein a speech recognition process is only executed when the speech mode that is input through the speech mode input section is the one kind of standard pattern.
1. A speech detection apparatus comprising:
an articulator shape input section for generating input data by measuring a movement of an articulator that occurs when a speaker makes speech from at least part of the articulator and an integument around the articulator; a speech mode input section for allowing input of a speech mode of the speaker; and a speech detection section for detecting the speech by comparing the input data generated by the articulator shape input section based on the speech of the speaker with one kind of standard pattern that is prepared in advance, wherein a speech recognition process is only executed when the speech mode that is input through the speech mode input section is the one kind of standard pattern.
19. A speech detection apparatus comprising:
an articulator shape input section for generating input data by measuring a movement of an articulator that occurs when a speaker makes speech from at least part of the articulator and an integument around the articulator; a speech mode input section for allowing input of a speech mode of the speaker; and a speech detection section for detecting the speech by comparing the input data generated by the articulator shape input section based on the speech of the speaker with a standard pattern for voiceless speech that is prepared in advance, wherein a function to be performed in connection with input speech data is switched in accordance with the speech mode that is input through the speech mode input section.
11. A speech detection apparatus comprising:
an articulator shape input section for generating input data by measuring a movement of an articulator that occurs when a speaker makes speech from at least part of the articulator and an integument around the articulator; a speech mode input section for allowing input of a speech mode of the speaker; a speech detection section for detecting the speech by comparing the input data generated by the articulator shape input section based on the speech of the speaker with one of plural kinds of standard patterns that are prepared in advance; and a standard pattern selection section for selecting one kind of standard pattern of a speech mode that coincides with the speech mode that is input through the speech mode input section, a speech detection process being executed upon selection of the one kind of standard pattern, wherein a function to be performed in connection with input speech data is switched in accordance with the speech mode that is input through the speech mode input section.
16. A speech detection apparatus comprising:
an articulator shape input section for generating input data by measuring a movement of an articulator that occurs when a speaker makes speech from at least part of the articulator and an integument around the articulator; a speech mode input section for allowing input of a speech mode of the speaker; a speech detection section for detecting the speech by comparing the input data generated by the articulator shape input section based on the speech of the speaker with two or more of plural kinds of standard patterns that are prepared in advance; and a standard pattern selection for selecting two or more kinds of standard patterns corresponding to speech modes that include a speech mode that coincides with the input speech mode that is input through the speech mode, a speech detection process being executed upon selection of the two or more kinds of standard patterns, wherein a function to be performed in connection with input speech data is switched in accordance with the speech mode that is input through the speech mode input section.
2. The speech detection apparatus according to
3. The speech detection apparatus according to
4. The speech detection apparatus according to
5. The speech detection apparatus according to
6. The speech detection apparatus according to
7. The speech detection apparatus according to
8. The speech detection apparatus according to
9. The speech detection apparatus according to
10. The speech detection apparatus according to
12. The speech detection apparatus according to
13. The speech detection apparatus according to
14. The speech detection apparatus according to
15. The speech detection apparatus according to
17. The speech detection apparatus according to
18. The speech detection apparatus according to
|
1. Field of the Invention
The present invention relates to a speech input and detection technique that is not affected by noise occurring in a noise environment or a situation where many people speak simultaneously. And the invention relates to a speech detection apparatus for outputting speech information that is detected from movements of an articulator of a human to information equipment such as a computer or a word processor.
The invention relates to a technique of enabling detection of speech information in both cases of voiced speech and voiceless speech by mimicry. Therefore, the technique of the invention can be utilized not only in offices or the like where silence is required and the use of related speech input techniques is not suitable, but also for input of a content that the user does not want to be heard by other people. As such, the invention greatly increases the range of use of speech detection apparatus. Further, the invention can be utilized for a speech detection apparatus for providing barrier-free equipment that enables deaf people, people having difficulty in hearing, and aged people to communicate information smoothly.
2. Description of the Related Art
The target of a speech detection apparatus (machine) is to enable the user's speech to be input correctly and quickly in any environment. An ordinary speech detection apparatus employs a speech recognition technique of recognizing and processing speech information by analyzing frequencies of a voice as a sound. To this end, the cepstrum analysis method or the like is utilized that enables separation and extraction of a spectrum envelope or a spectrum fine structure of a voice. However, this speech recognition technique has a principle-related disadvantage that naturally it cannot detect speech information unless it receives sound information generated by vocalization. That is, such a speech detection apparatus cannot be used in offices, libraries, etc. where silence is required, because during speech input a voice of a speaker is annoying to nearby people. This type of speech detection apparatus is not suitable for input of a voice having a content that the user does not want to be heard by nearby people. Further, the user will be rendered in a psychology of feeling reluctant to murmur alone to the machine. This tendency is enhanced in a situation where people exist around the user. These disadvantages limit the range of use of speech recognition apparatus and are major factors of obstructing the spread of speech input apparatus. Another obstructive factor is that continuing to speak is unexpectedly a physical burden. It is considered that continuing voice input for hours like manipulating a keyboard will make the user's voice hoarse and hurt his vocal cords.
On the other hand, studies of acquiring speech information from information other than sound information have been made conventionally. The vocal organs directly relating to vocalization of a human are the lungs 901 as an air flow mechanism, the larynx 902 as a vocalization mechanism, the oral cavity 903 and the nasal cavity 904 that assume the mouth/nasal cavity function, and the lips 905 that assume the articulation function, though the classification method varies from one technical book to another.
Among speech recognition techniques using visual information of the lips, a technique with image processing that uses an image that is input from a video camera is employed most frequently. For example, in Japanese Unexamined Patent Publication No. Hei. 6-43897, as shown in
To solve this problem, Japanese Unexamined Patent Publication No. Sho. 60-3793 proposed a lip information analyzing apparatus in which four high-luminance markers such as light-emitting diodes are attached to the lips to facilitate the marker position detection, movements of the markers themselves are imaged by a video camera, and pattern recognition is performed on a voltage waveform that is obtained by a position sensor called a high-speed multi-point X-Y tracker. However, even with this technique, when it is attempted to detect speech in a bright room, means is needed to prevent noise that is caused by high-luminance reflection light components coming from the glasses, a gold tooth, etc. of a speaker. Although preprocessing and a feature extraction technique for a two-dimensional image that is input from a television camera are needed for this purpose, the publication No. Sho. 60-3793 has no disclosure as to such a technique.
Several methods have been proposed in which features of a vocal organ are extracted by capturing an image of the lips and a portion around them directly without using markers and performing image processing on the image. For example, in Japanese Unexamined Patent Publication No. Hei. 6-12483, an image of the lips and a portion around them is captured by a camera and vocalized words are estimated by a back propagation method from an outline image obtained by image processing. Japanese Unexamined Patent Publication No. Sho. 62-239231 proposed a technique of using a lip opening area and a lip aspect ratio to simplify lip image information. Japanese Unexamined Patent Publication No. Hei. 3-40177 discloses a speech recognition apparatus retaining, as a database, correlation between vocalized sounds and lip movements to perform recognition for indefinite speakers. Japanese Unexamined Patent Publication No. Hei. 9-325793 proposed to lower the load on a speech recognition computer by decreasing the number of candidate words based on speech-period mouth shape information that is obtained from an image of the mouth of a speaker. However, since these related methods utilize positional information obtained from a two-dimensional image of the lips and a portion around them, for correct input of image information a speaker is required to open and close his lips clearly. It is difficult to detect movements of the lips and a portion around them in speech with a small degree of lip opening/closure and no voice output (hereinafter referred to as "voiceless speech") and speech with a small voice, let alone speech with almost no lip movements as in the case of ventriloquism. Further, the above-cited references do not refer to any speech detection technique that utilizes, to improve the recognition rate, speech modes such as a voiceless speech mode paying attention to differences between an ordinary speech mode and other ones. The "speech mode" indicating a speech state will be described in detail in the "Summary of the Invention" section.
Several methods have been proposed that do not use a video camera, such as a technique of extracting speech information from a myoelectric potential waveform of the lips and a portion around those. For example, Japanese Unexamined Patent Publication No. Hei. 6-12483 discloses an apparatus that utilizes binary information of a myoelectric potential waveform to provide means that replaces image processing. Kurita, et al. invented a model for calculating a lip shape based on a myoelectric signal ("Physiological Model for Realizing an Articulation Operation of the Lips", The Journal of the Acoustical Society of Japan, Vol. 50, No. 6, pp. 465-473, 1994). However, the speech information extraction using myoelectric potentials has a problem that a heavy load is imposed on a speaker because electrodes having measurement cords need to be attached to the lips and a portion around them.
Several inventions have been made in which tongue movements associated with speech of a speaker are detected by mounting an artificial palate to obtain a palatograph signal and a detection result is used in a speech detection apparatus. For example, Japanese Unexamined Patent Publication No. Sho. 55-121499 proposed means for converting presence/absence of contacts between the tongue and transmission electrodes that are incorporated in an artificial palate to an electrical signal. Japanese Unexamined Patent Publication No. Sho. 57-60440 devised a method of improving the touch of the tongue by decreasing the number of electrodes incorporated in an artificial palate. Japanese Unexamined Patent Publication No. Hei. 4-257900 made it possible to deal with indefinite speakers by causing a palatograph photodetection signal to pass through a neural network.
An apparatus that does not utilize tongue movements was proposed in Japanese Unexamined Patent Publication No. Sho. 64-62123 in which vibration of the soft palate is observed by bringing the tip portion of a bush rod into contact with the soft palate. Further, a study was made as to the relationship between the articulator shape and speech by mounting a plurality of metal pellets on a vocal organ, in particular the tongue in the oral cavity, and using an X-ray micro-beam instrument that measures the positions of the metal pellets (Takeshi Token, Kiyoshi Honda, and Yoichi Higashikura, "3-D Observation of Tongue Articulatory Movement for Chinese Vowels", Technical Report of IEICE, SP97-11, 1997-06). A similar study was made to investigate the relationship between the articulatory movement locus and speech by mounting magnetic sensors on a vocal organ in the oral cavity and using a magnetic sensor system that measures the position of the magnetic sensors (Tsuyoshi Okadome, Tokihiko Kaburagi, Shin Suzuki, and Masahiko Honda, "From Text to Articulatory Movement," Acoustical Society of Japan 1998 Spring Research Presentation Conference, Presentation no. 3-7-10, March1998). However, these techniques have problems that natural vocalization action may be obstructed and a heavy load is imposed on a speaker because devices need to be attached to an inside part of a human body. These references do not refer to any speech detection technique either that utilizes, to improve the recognition rate, speech modes such as a voiceless speech mode paying attention to differences between an ordinary speech mode and other ones.
U.S. Pat. No. 3,192,321 proposed, as a technique for detecting speech information more easily than the above techniques, a speech recognition system that is a combination of a speech recognition technique and a technique of directly applying a light beam to the lips and an integument portion around them and detecting speech based on the state of diffused reflection light coming from the skin and the way the lips interrupt the light beam. Japanese Unexamined Patent Publication No. Hei. 7-306692 proposed a similar technique in which speech information of a speaker is detected by applying a light beam to the lips and a portion around them, detecting diffused reflection light coming from the surface of the integument with a photodetector, and measuring an intensity variation of the diffused reflection light. However, neither reflection plates such as markers nor specular reflection plates are attached to the lips and a portion around them. Since the relationship between the intensity of reflection light and positions and movements of the lips is not necessarily clear, a neural network is used for a recognition process. As described in the specification, being low in speech detection accuracy, this technique is for roughly categorizing phonemes as an auxiliary means of a speech recognition technique. Japanese Unexamined Patent Publication No. Hei. 8-187368 discloses, as an example of use of this technique, a game that involves limited situations and in which conversations are expected to occur. Japanese Unexamined Patent Publication No. Hei. 10-11089 proposed a technique of detecting speech by measuring the blood amount in the lips and a portion around them by a similar method in which the detector is limited to an infrared detecting device. These techniques are narrowly effective for speech with large movements of the lips and a portion around them, and difficult to apply to input of voiceless or small voice speech in which the degree of opening/closure of the lips is small. The specifications do not refer to speech modes such as a voiceless speech mode.
As for the above-described related techniques that are intended to detect speech from the shape of an articulator, methods and apparatus for correlating speech and a certain kind of signal that is obtained from the articulator are described in detail. However, the above-cited references do not refer to, in a specific manner, voiceless speech nor relationships between speech and signals associated with different speech modes. Further, there is no related reference that clearly shows problems that are caused by speech mode differences and countermeasures. Although there exists a related reference that refers to speech without voice output (Japanese Unexamined Patent Publication No. Hei. 6-12483), it does not describe the handling of speech modes that are most important for improvement of the recognition rate.
Problems to be solved by the speech input technique of the invention are as follows. These problems cannot be solved by the related speech recognition techniques in terms of the principle and have not been dealt with in a specific manner by related techniques that are intended to detect speech from shape information of an articulator.
(1) A speech detection apparatus cannot be used in offices, libraries, etc. where silence is required, because during speech input a voice of a speaker is annoying to nearby people.
(2) Related techniques are not suitable for input of a content that a speaker does not want to be heard by nearby people.
(3) There is psychological reluctance to speaking alone to a machine.
(4) A speaker who continues to speak with voice output has a physical load.
To solve the above problems, it is necessary to enable speech detection in a voiceless speech mode with entirely no voice output as well as in a speech mode with voice output (hereinafter referred to as a voiced speech mode). If this becomes possible, the problems (1) to (3) are solved because no voice is output to the environment in the voiceless speech mode in which there is almost no respiratory air flow and the vocal cords do not vibrate. Further, improvement is made of the problem (4) because voiceless speech requires only small degrees of mouth opening and closure and does not cause vibration of the vocal cords, reducing the physical load accordingly. Speech modes used in the invention are classified in FIG. 3.
It has been described above that the related techniques do not deal with, in a specific manner, voiceless speech nor speech modes in general. Naturally, as for related speech input techniques, studies have not been made of speech modes of voiceless speech, a whisper, and a small voice. On the other hand, in techniques of detecting speech from the shape of an articulator, it has become clear through experiments that the speech mode is an extremely important concept. In particular, it has turned out that even for speech of the same phoneme or syllable a signal obtained from the shape of an articulator varies with the speech mode that is a voiceless speech mode, a small voice speech mode, an ordinary speech mode, or a loud voice speech mode and the recognition rate of phonemes and syllables may greatly decrease if sufficient care is taken of the speech mode. An object of the present invention is to solve the problem of reduction in recognition rate that is caused by speech mode differences that has not been addressed by the related techniques and, particularly, to increase the recognition rate of voiceless speech that has not been discussed seriously in speech input techniques. To this end, the invention employs the following features.
To increase the rate of speech recognition based on input shape information of an articulator,
(1) at least one standard pattern is given to each speech mode;
(2) there is provided means for inputting, to a speech detection apparatus, information of a speech mode of a speech input attempt; and
(3) a standard pattern corresponding to input speech mode information is selected and then input speech is detected by executing a recognition process.
The above-mentioned problems can be solved if the speech modes include a voiceless speech mode. Naturally it is necessary to accept input of speech with voice output, and a speech recognition apparatus is required to switch among standard patterns in accordance with the speech mode.
The invention will be described below in more detail.
To solve the above-mentioned problems, the invention provides a speech detection apparatus comprising an articulator shape input section 101 (refer to
In this configuration, speech detection is performed only in a prescribed speech mode and hence speech detection is performed in such a manner as to be suitable for the situation. In particular, if setting is so made that detection is performed only in a voiceless speech mode, the speech detection apparatus is advantageous for use in offices and in terms of the load that is imposed on a user.
Speech detection that is most suitable for each situation can be performed by preparing plural kinds of standard patterns and switching the detection mode in accordance with the speech mode. In this case, the plural kinds of standard patterns may include standard patterns of a voiceless speech mode, a voiced speech mode, and unvoiced speech mode. Alternatively, the plural kinds of standard patterns may include standard patterns of a voiceless speech mode and a voiced speech mode.
The speech mode may be determined based on the volume and the noise level of speech of a speaker. In this case, the noise level measurement time may be set at a short period t0 or the noise level may be an average noise level over a long period. Or the noise level may be determined by combining the above two methods.
Where plural kinds of standard patterns corresponding to a plurality of speech modes, respectively, are prepared, speech detection may be performed by selecting two or more kinds of speech modes and using two or more kinds of standard patterns corresponding to the selected speech modes.
In this case, one kind of speech mode may be selected based on a noise level measured in a short period t0 and another kind of speech mode may be selected based on an average noise level that is measured in a long period. (There may occur a case that one kind of speech mode is selected in a duplicated manner.)
There may be used standard patterns of a plurality of voiced speech modes that are featured in loudness, pitch, or length of voice.
The function to be performed in connection with input speech data is switched in accordance with the speech mode that is input through the speech mode input section. For example, the speech modes corresponding to the respective speech modes are a function of allowing input of coded text information, a function of giving an instruction relating to a particular operation, and a function of stopping input. Further, switching may be made automatically, in accordance with the speech mode, among plural kinds of application software.
According to another aspect of the invention, to solve the above-mentioned problems, there is provided a speech detection apparatus comprising an articulator shape input section 101 for generating input data by measuring a movement of an articulator that occurs when a speaker makes speech from at least part of the articulator and an integument around the articulator; a speech mode input section 102 for allowing input of a speech mode of the speaker; and a speech detection section 103 for detecting the speech by comparing the input data generated by the articulator shape input section based on the speech of the speaker with a standard pattern for voiceless speech that is prepared in advance.
With this configuration, speech detection can be performed without a speaker's emitting a noise or imposing an undue load on a speaker. In particular, speech can be detected with high accuracy in the voiceless speech mode because the shape variation of an articulator that is caused by speech is restricted and hence the deviation of the shape is small.
The manner of measuring features of speech is not limited to the case of using movements of an articulator and they can be measured in other various manners.
The present invention will be hereinafter described in detail.
As shown in
A description will be made of a technique called a specular reflection light spot method that is mainly used in specific embodiments of the invention, that is, in an apparatus for measuring shape information of an articulator. The specular reflection light spot method will be described in detail in the specific embodiments. This technique has already been proposed by the present assignee in Japanese Unexamined Patent Publication No. Hei. 10-243498.
This measuring method improves the articulator shape detection accuracy by attaching specular reflection plates on an articulator of a speaker and an integument around it and enabling measurement of very small angular variations and positional variations of the articulator by a geometrical-optics-based technique. Specifically, as shown in
In this configuration, as the speaker 15 speaks, the positions and angles of the specular reflection plates 12 that are attached to the articulator and the integument around it vary.
Therefore, the light beam 13 that is applied by the light source section 10 is uniquely reflected by the specular reflection plates 12 according to the law of reflection and the directions of reflection light beams vary accordingly. The position detection sensor 16 detects the specular reflection light spots 69 and 70 that move on its surface, whereby the positions of the specular reflection light spots 69 and 70 corresponding to the varying shape of the articulator and the portion around it that is caused by vocalization of the speaker 15. Features of a temporal variation and a positional variation are extracted from coordinate-represented information of the detected specular reflection light spot positions. Input speech is classified by performing a judgment by comparing those features with a stored standard pattern indicating features of a temporal variation and a positional variation of a speech signal. Capable of measuring information including very small angular and positional variations of an articulator, this specular reflection light spot method can detect, with high accuracy, shape information of an articulator even in cases of small voice speech and voiceless speech in which the variation of the articulator shape is small.
However, in the invention, the measuring method is not limited to the specular reflection light spot method. Naturally it is possible to employ the measuring techniques described in the "Background of the Invention" section, the optical flow method described in Kenji Mase, Alex Pentrand, "Lip Reading: Automatic Visual Recognition of Spoken Words", Optical Society of America, Proc. Image Understanding and Machine Vision, 1989, and other measuring techniques as long as they can detect movements of an articulator in small voice speech and voiceless speech.
Shape information of the articulator thus obtained is measured, and resulting data and a speech mode at that time point are input to the speech detection section 103. The kind of speech mode may be input either manually or automatically by using a method that will be described in the fourth embodiment. It is also possible to use other methods that will not be described in the fourth embodiment (e.g., speech mode switching is performed based on an instruction that is supplied externally via a network or the like or from the inside of the computer). The speech detection section 103 may be an independent apparatus and output detected speech information to a personal computer or a word processor. Alternatively, part of the functions of the speech detection section 103 may be incorporated in a personal computer as hardware or installed in a personal computer as software.
Standard patterns stored in the standard patterns storage section 111 of the speech detection section 103 (see
A technical basis of the above general embodiment, a specific configuration and operation, and examples of improvement in the functional aspect will be described in the first embodiment.
This embodiment shows how information that is obtained from the shape of an articulator varies with the speech mode, thereby presenting why the invention uses standard patterns corresponding to the respective speech modes. In an apparatus for performing measurements for this purpose, this embodiment uses, in the articulator shape input section 101, a video camera instead of the position detection sensor 16 shown in
A light beam 13 emitted from the light source section 10 illuminates the specular reflection plate 12 that is located at point e (58) in a lower jaw side portion (see FIG. 8). A specular reflection light beam 14 coming from the specular reflection plate 12 and traveling in a direction that depends on the position and the angle of point e is projected onto the screen 62 to form a specular reflection light spot 63. An image of the specular reflection light spot 63 is taken by the two-dimensional CCD camera 61 and resulting signals 74, that is, signals in the fast and slow scanning directions of a two-dimensional CCD sensor 71, are output to an output coordinates calculation section 73 via a CCD driving circuit 72 as shown in FIG. 7. Since the specular reflection light spot 63 projected on the screen 62 is several times brighter than the other, unilluminated portion, in a room only the specular reflection light spot 63 can be extracted by easily eliminating the influence of ambient light by setting a threshold value for the detection of the specular reflection light spot 63. An X-axis coordinate of the specular reflection light spot 63 is determined from the time measured with respect to the time point of a fast scanning start signal of the two-dimensional CCD sensor 71 and its Y-axis coordinate is determined from the time of the slow scanning.
With the above-described apparatus and the arrangement of the specular reflection plate 12, information of the articulator shape was recorded by having a male subject in his 40s pronounce a vowel /a/ while changing the speech mode among a voiceless speech mode, a small voice speech mode, an ordinary speech mode, and a loud voice speech mode.
The locus of the specular reflection light spot 63 mainly reflects information of the angular variation of point e (58) in the lower jaw side portion where the reflection plate 12 is attached (see FIG. 8). Therefore, differences in the shape information of the articulator among speech modes can be detected similarly even by other measuring methods as long as the angular variation information of point e (58) in the lower jaw side portion or information that can replace such angular variation information is measured. As described above, this embodiment succeeded in finding the importance of the speech mode that had not been dealt with by the related techniques. This is the basis of using standard patterns corresponding to respective speech modes in the invention.
Incidentally, it should not be construed that the measurement point is not limited to the point e (58) in the lower jaw side portion, but any other portions at which the shape of the articulator such as lips, chin, and cheeks is reflected are applicable.
Effects of using standard patterns for respective speech modes and a relationship among the speech modes will be described in the second embodiment.
This embodiment will clarify differences among the respective speech modes. To this end, the degree of similarity will be determined between data of the same speech mode and between data of different speech modes and a relationship between the recognition rate and the speech mode will be shown. An experiment was conducted by using the apparatus having the same configuration as in the first embodiment that is according to the specular reflection light spot method. The subject was a male person in his 40s and the measurement point was point e (58) in a lower jaw side portion (see FIG. 8).
A set of phonemes (or syllables) (/a/, /u/, /za/, /ra/, /ya/, /ma) whose recognition rates were particularly low in a previously conducted experiment (not described in this specification) in which an average recognition rate of 92.8% was obtained for 19 kinds of phonemes (or syllables) was employed in speech. (The 19 kinds of phonemes (or syllables) are the same as will be shown in the fifth embodiment.)
A set of the above phonemes (or syllables) was acquired two times repeatedly while the speech mode is changed. Four kinds of speech modes, that is, a voiceless speech mode, a small voice speech mode, an ordinary speech mode, anda loud voice speech mode, were employed. To calculate the degree of similarity, X-coordinate components of first-time data were used as a standard pattern and those of second-time data were used as input data.
Next, a recognition method using the degree of similarity that is most frequently used will be described briefly.
With a notation that f (x) represents an input data waveform and g(x) represents a standard pattern waveform, a normalized degree of similarity R (also called a correlation value) is given by one of the following two equations.
Equation (1) is applicable in a case where the waveforms have continuous values and Equation (2) is applicable in a case where the waveforms have discrete values (vectors). In Equation (2), |f| (the underline means a vector) means the norm of f, that is, the square root of the sum of the squares of elements f(xi) for xi, and indicates the distance from the origin in the n-dimensional space.
This method was applied to the specular reflection light spot method in the following manner. Light spot data obtained from speech of each syllable is represented by a set (vector) of discrete values of respective time frames. Therefore, Equation (2) was employed in the experiment. To compare two data by using Equation (2), it is necessary to equalize the phases and the lengths of speech data. Reference time 151 for the phase equalization was set at the center of the width taken at the ⅓ value from the bottom of the fall portion of the Y-coordinate component of speech data (see FIG. 15). The data length of a phoneme (or syllable) was so unified as to be a 31-dimensional vector consisting of first-half 16 frames and second-half 15 frames including the reference time. The degree of similarity was determined between input data and templates of 48 phonemes (or syllables) in total including phonemes of the four speech modes and closest phonemes (or syllables) of standard patterns were employed as recognition results of the input data.
However, the above relationship is not clear in the small voice speech mode and the ordinary voice speech mode; the detection accuracy of speech information was relatively low even when the standard pattern of the same speech mode as the input data was used. There were cases that a higher recognition rate was obtained when a standard pattern of a speech mode (small voice, ordinary, or loud voice) that was different from the speech mode of input data was used, as long as the speech mode used was a voiced one. These points will be described in the third embodiment.
To summarize the above, it has been confirmed that in the voiceless speech mode and the loud voice speech mode the speech detection accuracy can be improved by equalizing the speech modes of input data and a standard pattern. In particular, this embodiment has shown that when the input data is of the voiceless speech mode, by providing the means for switching the speech mode of a standard pattern (to the voiceless speech mode), the recognition rate of phonemes (or syllables) can be made two times that of the case where the speech mode of a standard pattern is fixed to the ordinary speech mode (see FIG. 21). Also, in the cases of the small voice speech mode and the ordinary speech mode, although selecting a standard pattern of the same speech mode as that of input data and using the selected standard pattern for the recognition did not produce the best result, it is meaningful in that it is better than using a standard pattern of a much different speech mode. It goes without saying that making the speech modes of input data and a standard pattern the loud speech mode is included in the invention though its value in terms of industrial use is low because of annoyance to nearby people. It is also included in the invention to equalize the speech modes of input data and a standard pattern in accordance with the speech mode of a person whose articulator is disordered.
As shown in the second embodiment, the average recognition rate that is obtained with input data and a standard pattern of the same speech mode is high in the loud speech mode and the voiceless speech mode and the average recognition rate is low when a standard pattern of a speech mode that is different from the speech mode of input data. This is explained as follows. In loud voice speech, although the speech length can be adjusted, the loudness is difficult to adjust because it is at the saturation level. Further, there is no sufficient margin to produce a variation in pitch. On the other hand, in the voiceless speech mode, because the vocal cords themselves do not vibrate, the pitch and the loudness of a voice cannot be adjusted. That is, it is considered that the average recognition rate is high in these two speech modes because the shape of the articulator during speech is restricted and hence a large deviation in the articulator shape is less likely to occur. In contrast, in speech of small voice or ordinary voice or voiced speech whose volume is around the volume of such speech, emotional and expressive speech can be performed by freely changing the loudness, pitch, and length of voice (called super-phoneme elements). It is considered that, resultantly, in speech of small voice or ordinary voice or voiced speech whose volume is around the volume of such speech, a large deviation occurs in the articulator shape and the average recognition rate decreases even if the speech modes of a standard pattern and input data are the same. However, it is considered that for the same reason the average recognition rate does not decrease to a large extent even if the speech mode of a standard pattern and input data are somewhat different from each other. The super-phoneme element is a feature of the above kind in speech that appears in such a manner as to bridge successive single sounds (bunsetsu elements). Although the super-phoneme element specifically means a tone, intonation, accent, or length of a sound, it boils down to (is decomposed into) the pitch, loudness, and length of a voice.
In this embodiment, a measurement is made of how the feature in speech called the super-phoneme element influences the pattern of speech data in the ordinary speech mode and a measurement result is applied to the speech detection apparatus of the invention. The same apparatus as in the first and second embodiments was used as the articulator shape input section 101. An experiment was conducted according to the specular reflection light spot method (see FIG. 8). The subject was a male person in his 40s and the measurement point was point e (58) in a lower jaw side portion. Sets of Japanese words used were (aka (red), aka (dirt)), (saku (rip), saku (fence), (hashi (chopsticks), hashi (bridge)), and (hyo (leopard), and hyo (table)). Although patterns of speech data belonging to the same set (parenthesized) are similar, they had large differences in shape due to differences in pitch. For example,
In the invention, to absorb a pattern deviation due to differences of super-phoneme elements, it is proposed to intentionally use a plurality of standard patterns having different characteristics for the same speech mode. As an example of this proposal, it will be shown below that the recognition rate can be increased by using standard patterns that are different in loudness (ordinary speech and small voice speech) in a composite manner. While these standard patterns are varied in loudness, it is considered that other super-phoneme elements also vary within their ranges of deviation.
The second and third embodiments showed the effects of switching the speech mode of a standard pattern used in the speech detection section 103 or using standard patterns of a plurality of speech modes there in accordance with the speech mode of input data. To perform such an operation, it is necessary to input the speech mode of input data to the speech detection section 103 in advance. The simplest and most reliable method there for is that the speaker inputs a speech mode manually by, for example, depressing a prescribed key of a keyboard. Where speech is input to a personal computer 200, one conceivable method is that the computer 200 designates a mode of speech to be input next. However, in actual use of the speech detection apparatus of the invention, it is cumbersome and inconvenient to perform such a manipulation or operation each time. In view of this, this embodiment will propose an apparatus for automatically detecting a speech mode and describe a new function using that apparatus (See FIG. 2).
As a first method for the above purpose, a specific example of a method of measuring the loudness of a voice by setting a microphone 202 in front of the mouth of the speaker and detecting a speech mode based on a measurement level will be described.
To determine a speech mode based on the volume of speech obtained, first it is necessary to determine the loudness of voice corresponding to each mode. To this end, indices for evaluation of the volume of speech were determined as shown in
To provide another characteristic to be used for detecting a speech mode,
A description will be made of a method of detecting a speech mode from an output level obtained by the microphone 202 of the headset 201 after the above-described preparation. In this example, for the sake of convenience, a converted value of an output of the microphone 202 as calibrated by an output value of the noise meter was used as a noise level (phon).
Although usually a noise level is indicated in terms of an average value of a noise, in actual detection of a speech mode there is no time to use an average value because a real-time response is required. Therefore, a short-term noise level detection method was used in which a maximum value of the noise level during a short period (t0: 200 ms in this embodiment) in the initial stage of voice input is detected and the speech mode is switched if the detected maximum value exceeds a predetermined threshold value.
Specifically, a method shown by a flowchart of
An experiment was conducted by using the method of
On the other hand, in ordinary speech, it is less likely that the speech mode is changed suddenly during continuous speech. It is expected that a sudden change of the speech mode is in most cases due to large noise from the environment. To prevent erroneous switching of the speech mode due to noise, a long-term average noise level detection method is employed. In this method, in a case where speech is being input continuously, a speech mode is selected by calculating and monitoring an average noise level Lm. Specifically, if during speech the noise level does not become lower than a fifth threshold value (L5: 10 phons in this embodiment) for a predetermined period (t1: 500 ms in this embodiment), the average value of the noise level is calculated continuously and a speech mode is determined based on its average value. Threshold values for the speech mode determination can be set by using, as references, the values shown in FIG. 28. If the noise level does not exceed the fifth threshold value L5 in the predetermined period t1 without input of information of the articulator shape, an average noise level is newly calculated with a judgment that the speech has been suspended.
The method of detecting a speech mode from a short-term noise level and the method of detecting a speech mode from an average noise level over a long period have been described above. A method as a combination of these two methods is also included in the invention. Specifically, if speech modes selected based on a short-term noise level and a long-term average noise level are different from each other, standard patterns of a plurality of speech modes are used for the recognition of input speech as described in the third embodiment. Information of input speech can be recognized correctly by automatically detecting a speech mode of speech being input by using any of the above methods and making switching to the standard pattern corresponding to the detected speech mode.
The above threshold values and time intervals are ones determined for the experimental conditions of this embodiment; the threshold values and time intervals need to be adjusted in accordance with the speaker and the noise level of the environment where the apparatus is used. The speech modes need not always be the four speech modes used above and may be changed in accordance with the mode of use.
It is proposed to switch the input destination of input information and the function to be performed in connection with input information in accordance with the speech mode, in addition to switching the standard pattern in accordance with the speech mode. A specific example is such that information of the articulator shape is used for speech input in the case of voiceless speech and a function of giving instructions to correct or edit input speech information is effected in the case of small voice speech. If an ordinary speech mode is detected, a function of stopping the speech input mode with a judgment that a conversation with another person has started. Further, it is possible to automatically switching the application program (e.g., spreadsheet software and document processing software) to be used in accordance with the speech mode. In this case, a related speech recognition technique may be employed. Making it possible to switch among such functions in accordance with the speech mode for each user, the invention can make the speech input function much more versatile than in the related techniques.
This embodiment is directed to detection of a speech mode difference based on articulator shape information that is obtained by a measuring method other than the specular reflection light spot method, and will show that the invention is effective without being limited by the measuring method.
In this embodiment, an articulator shape information measuring method disclosed in Japanese Unexamined Patent Publication No. Sho. 52-112205 was used. In this method, as shown in
In this embodiment, an apparatus shown in
With the above apparatus configuration and imaging environment, the markers M4 and M5 in the lip portion of one male subject in his 30s were imaged and the coordinates of their positions were detected. The contents of speech used for input were the following 19 kinds of Japanese phonemes and or syllables.
/a/, /i/, /u/, /e/, /o/, /ka/, /sa/, /ta/, /na/, /ha/, /ma/, /ya/, /ra/, /wa/, /ga/, /za/, /da/, /ba/, /pa/
Each of these phonemes (or syllables) was vocalized randomly two times in each of the voiceless speech mode and the ordinary speech mode. Data of the first-time speech was used as a standard pattern and data of the second-time speech was used as input data.
Among data obtained, example data of /a/ of the voiceless speech mode is shown in FIG. 34 and example data of /a/ of the ordinary speech mode is shown in FIG. 35. It is seen from
The distance between each of standard patterns of the voiceless speech mode and the ordinary speech mode and each of input data of the voiceless speech mode and the ordinary speech mode was calculated by using the Dynamic programming method. A phoneme (or syllable) having the shortest distance from input data was employed as a recognition result of the input data. To make detection results easy to understand, they were graphically shown in
As described above, in the technology of detecting speech from, for example, shape information of an articulator, the invention makes it possible to increase the speech detection accuracy by restricting the speech modes of input data and a standard pattern. In particular, the invention makes it possible to greatly improve the speech detection accuracy by using the voiceless speech mode as well as to detect speech contents without requiring a speaker to emit a sound. This solves the principle-related disadvantage of the related speech recognition techniques that speech input cannot be made unless the speaker emits a sound. Not requiring emission of a sound, the invention greatly increases the application range of speech detection apparatus. Further, by switching the speech mode, the invention makes it possible to switch the function to be performed in connection with speech-input information and to increase the recognition rate of input speech in which a plurality of speech modes exist in mixed form. This allows speech input apparatus to have far more versatile functions than in the related techniques.
Harada, Masaaki, Takeuchi, Shin
Patent | Priority | Assignee | Title |
10049665, | Jul 26 2012 | Samsung Electronics Co., Ltd | Voice recognition method and apparatus using video recognition |
10283120, | Sep 16 2014 | The University of Hull | Method and apparatus for producing output indicative of the content of speech or mouthed speech from movement of speech articulators |
11089396, | Jun 09 2017 | Microsoft Technology Licensing, LLC | Silent voice input |
11227607, | Jan 23 2018 | CIRRUS LOGIC INTERNATIONAL SEMICONDUCTOR LTD | Speaker identification |
11227609, | Oct 13 2017 | Cirrus Logic, Inc. | Analysing speech signals |
11264037, | Jan 23 2018 | CIRRUS LOGIC INTERNATIONAL SEMICONDUCTOR LTD | Speaker identification |
11270707, | Oct 13 2017 | Cirrus Logic, Inc. | Analysing speech signals |
11276409, | Nov 14 2017 | Cirrus Logic, Inc. | Detection of replay attack |
11449307, | Jul 10 2017 | Samsung Electronics Co., Ltd. | Remote controller for controlling an external device using voice recognition and method thereof |
11475899, | Jan 23 2018 | CIRRUS LOGIC INTERNATIONAL SEMICONDUCTOR LTD | Speaker identification |
11516570, | Jun 09 2017 | Microsoft Technology Licensing, LLC | Silent voice input |
11631402, | Jul 31 2018 | Cirrus Logic, Inc. | Detection of replay attack |
11694695, | Jan 23 2018 | Cirrus Logic, Inc. | Speaker identification |
11704397, | Jun 28 2017 | Cirrus Logic, Inc. | Detection of replay attack |
11705135, | Oct 13 2017 | Cirrus Logic, Inc. | Detection of liveness |
11714888, | Jul 07 2017 | Cirrus Logic Inc. | Methods, apparatus and systems for biometric processes |
11735189, | Jan 23 2018 | CIRRUS LOGIC INTERNATIONAL SEMICONDUCTOR LTD | Speaker identification |
11748462, | Aug 31 2018 | Cirrus Logic Inc. | Biometric authentication |
11829461, | Jul 07 2017 | Cirrus Logic Inc. | Methods, apparatus and systems for audio playback |
11854533, | Dec 04 2019 | GOOGLE LLC | Speaker awareness using speaker dependent speech model(s) |
11908482, | Jun 28 2020 | TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED | Packet loss retransmission method, system, and apparatus, computer-readable storage medium, and device |
6879953, | Oct 22 1999 | Alpine Electronics, Inc. | Speech recognition with request level determination |
6971993, | Nov 15 2000 | FLETCHER, SAMUEL G | Method for utilizing oral movement and related events |
6974424, | Sep 19 2000 | FLETCHER, SAMUEL G | Palatometer and nasometer apparatus |
7082393, | Mar 27 2001 | RAST Associates, LLC | Head-worn, trimodal device to increase transcription accuracy in a voice recognition system and to process unvocalized speech |
7120477, | Nov 22 1999 | Microsoft Technology Licensing, LLC | Personal mobile computing device having antenna microphone and speech detection for improved speech recognition |
7283850, | Oct 12 2004 | Microsoft Technology Licensing, LLC | Method and apparatus for multi-sensory speech enhancement on a mobile device |
7346504, | Jun 20 2005 | Microsoft Technology Licensing, LLC | Multi-sensory speech enhancement using a clean speech prior |
7383181, | Jul 29 2003 | Microsoft Technology Licensing, LLC | Multi-sensory speech detection system |
7406303, | Jul 05 2005 | Microsoft Technology Licensing, LLC | Multi-sensory speech enhancement using synthesized sensor signal |
7418385, | Jun 20 2003 | NTT DoCoMo, Inc | Voice detection device |
7447630, | Nov 26 2003 | Microsoft Technology Licensing, LLC | Method and apparatus for multi-sensory speech enhancement |
7499686, | Feb 24 2004 | ZHIGU HOLDINGS LIMITED | Method and apparatus for multi-sensory speech enhancement on a mobile device |
7502736, | Aug 09 2001 | SAMSUNG ELECTRONICS CO , LTD | Voice registration method and system, and voice recognition method and system based on voice registration method and system |
7508959, | Apr 09 2003 | Toyota Jidosha Kabushiki Kaisha | Change information recognition apparatus and change information recognition method |
7574008, | Sep 17 2004 | Microsoft Technology Licensing, LLC | Method and apparatus for multi-sensory speech enhancement |
7587318, | Sep 12 2002 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Correlating video images of lip movements with audio signals to improve speech recognition |
7680656, | Jun 28 2005 | Microsoft Technology Licensing, LLC | Multi-sensory speech enhancement using a speech-state model |
7930178, | Dec 23 2005 | Microsoft Technology Licensing, LLC | Speech modeling and enhancement based on magnitude-normalized spectra |
7966188, | May 20 2003 | Nuance Communications, Inc | Method of enhancing voice interactions using visual messages |
8209167, | Sep 21 2007 | Kabushiki Kaisha Toshiba | Mobile radio terminal, speech conversion method and program for the same |
8244533, | Dec 12 2002 | Alpine Electronics, Inc | Speech recognition performance improvement method and speech recognition device |
8532987, | Aug 24 2010 | Lawrence Livermore National Security, LLC | Speech masking and cancelling and voice obscuration |
8920174, | Dec 08 2005 | EYE PLUS PLUS, INC ; The University of Tokyo | Electric tactile display |
9743212, | Jun 30 2014 | Microsoft Technology Licensing, LLC | Audio calibration and adjustment |
Patent | Priority | Assignee | Title |
4757541, | Nov 10 1981 | BEADLES, ROBERT L | Audio visual speech recognition |
5893058, | Jan 24 1989 | Canon Kabushiki Kaisha | Speech recognition method and apparatus for recognizing phonemes using a plurality of speech analyzing and recognizing methods for each kind of phoneme |
5911128, | Aug 05 1994 | Method and apparatus for performing speech frame encoding mode selection in a variable rate encoding system | |
5913188, | Sep 26 1994 | Canon Kabushiki Kaisha | Apparatus and method for determining articulatory-orperation speech parameters |
6272466, | Mar 04 1997 | FUJI XEROX CO , LTD | Speech detection apparatus using specularly reflected light |
JP1011089, | |||
JP340177, | |||
JP4257900, | |||
JP52112205, | |||
JP55121499, | |||
JP57160440, | |||
JP603793, | |||
JP612483, | |||
JP62239231, | |||
JP643897, | |||
JP7306692, | |||
JP8187368, | |||
JP9325793, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
May 25 1999 | HARADA, MASAAKI | FUJI XEROX CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 010002 | /0616 | |
May 25 1999 | TAKEUCHI, SHIN | FUJI XEROX CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 010002 | /0616 | |
May 27 1999 | Fuji Xerox Co., Ltd. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Jul 07 2005 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Sep 07 2009 | REM: Maintenance Fee Reminder Mailed. |
Oct 14 2009 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Oct 14 2009 | M1555: 7.5 yr surcharge - late pmt w/in 6 mo, Large Entity. |
Mar 13 2013 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Jan 29 2005 | 4 years fee payment window open |
Jul 29 2005 | 6 months grace period start (w surcharge) |
Jan 29 2006 | patent expiry (for year 4) |
Jan 29 2008 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jan 29 2009 | 8 years fee payment window open |
Jul 29 2009 | 6 months grace period start (w surcharge) |
Jan 29 2010 | patent expiry (for year 8) |
Jan 29 2012 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jan 29 2013 | 12 years fee payment window open |
Jul 29 2013 | 6 months grace period start (w surcharge) |
Jan 29 2014 | patent expiry (for year 12) |
Jan 29 2016 | 2 years to revive unintentionally abandoned end. (for year 12) |