A method and apparatus for distinguishing a voice region from a non-voice region in an environment where various types of noise and voice are mixed together are provided. The method includes the steps of converting an input voice signal into a frequency domain signal by preprocessing the input voice signal, performing sigmoid compression on the converted signal, transforming a spectrum vector generated by the sigmoid compression into a voice detection parameter in scalar form, and detecting the voice region using the parameter.
|
1. A method of detecting a voice region with a voice region detecting apparatus, the method comprising:
converting an input voice signal representing at least a physical voice into a frequency domain signal by preprocessing the input voice signal;
performing sigmoid compression on the converted signal;
transforming at least one component of a spectrum vector generated by the sigmoid compression into a scalar voice detection parameter wherein the transforming is performed using the equation
where yk is a component of the sigmoid compressed spectrum vector, and P(x) is a scalar voice detection parameter;
detecting the voice region by comparing the scalar voice detection parameter with a threshold and determining that a region in which the scalar voice detection parameter exceeds the threshold is the voice region; and
outputting a voice signal in the detected voice region, wherein the method is performed using the voice region detecting apparatus.
17. A non-transitory computer-readable storage media storing computer-readable code for implementation of a method of detecting a voice region, the method comprising:
converting an input voice signal representing at least a physical voice into a frequency domain signal by preprocessing the input voice signal;
performing sigmoid compression on the converted signal;
transforming at least one component of a spectrum vector generated by the sigmoid compression into a scalar voice detection parameter wherein the transforming is performed using the equation
where yk is a component of the sigmoid compressed spectrum vector, and P(x) is a scalar voice detection parameter;
detecting the voice region using the parameter by comparing the scalar voice detection parameter with a threshold and determining that a region in which the scalar voice detection parameter exceeds the threshold is the voice region; and
outputting a voice signal in the determined voice region.
9. An apparatus for detecting a voice region including a processor having computing device-executable instructions, the apparatus comprising:
a pre-processing unit for converting an input voice signal into a frequency domain signal by preprocessing the input voice signal;
a sigmoid compression unit for performing sigmoid compression on the converted signal;
a parameter generation unit for transforming a spectrum vector generated by the sigmoid compression into a scalar voice detection parameter wherein the parameter generation unit performs a vector-to-scalar transformation using the equation
where yk is a component of the sigmoid compressed spectrum vector, and P(x) is a scalar voice detection parameter; and
a voice region detection unit, executing on the processor, for detecting the voice region by comparing the scalar voice detection parameter with a threshold and determining that a region in which the scalar voice detection parameter exceeds the threshold is the voice region.
2. The method as set forth in
3. The method as set forth in
pre-emphasizing the input voice signal;
applying a predetermined window to the pre-emphasized signal; and
Fourier transforming the signal to which the window has been applied.
4. The method as set forth in
where x is a component of a spectrum vector which is composed of low-pass-filtered samples, F(x) is a spectrum vector generated as a result of the sigmoid compression, μ is a component of a vector which is composed of average values for respective components, and α and β are predetermined constant values.
5. The method as set forth in
6. The method as set forth in
7. The method as set forth in
8. The method as set forth in
10. The apparatus as set forth in
11. The apparatus as set forth in
12. The apparatus as set forth in
where x is a component of a spectrum vector which is composed of low-pass-filtered samples, F(x) is a spectrum vector generated as a result of sigmoid compression, μ is a component of a vector which is composed of average values for respective components, and α and β are predetermined constants.
13. The apparatus as set forth in
14. The apparatus as set forth in
15. The apparatus as set forth in
16. The apparatus as set forth in
|
This application claims priority from Korean Patent Application No. 10-2005-0010598 filed on Feb. 4, 2005 in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference in its entirety.
1. Field of the Disclosure
The present disclosure relates generally to voice recognition technology, and more particularly, to a method and apparatus for distinguishing a voice region from a non-voice region in an environment where various types of noise and a voice are mixed together.
2. Description of the Related Art
Recently, with the development of computers and the advancement of communication technology, various multimedia-related technologies have been developed, including technology for generating and editing various types of multimedia data, technology for recognizing video/voice among input multimedia data, and technology for compressing video/voice more efficiently. Of the technologies, the technology for detecting a voice region in a noisy environment is a basic technology essential to various fields such as the voice recognition field and the voice compression field. However, it is not easy to detect a voice region because the voices are mixed with various types of noise. Furthermore, there are various types of noise such as continuous noise and burst noise. Accordingly, in such an arbitrary environment, it is not easy to both detect a region in which voices exist and then to extract the voices.
As a result, the accurate detection of a voice region in a noisy environment plays an important role in improving voice recognition and the enhancement of convenience for a user. The technology for distinguishing a voice region from a non-voice region and detecting the voice region mainly includes a field using frame energy as in U.S. Pat. No. 6,658,380, a field using time-axis filtering as in U.S. Pat. No. 6,782,363 (hereinafter referred to as “patent '363”), a field using frequency filtering as in U.S. Pat. No. 6,574,592 (hereinafter referred to as “patent '592”) and a field using the linear transformation of frequency information as in U.S. Pat. No. 6,778,954 (hereinafter referred to as “patent '954”).
As patent '945, the present invention pertains to the field using the linear transformation of frequency information, but it is different in that it is not based on a probabilistic model but uses a rule-based approach, unlike patent '945.
Patent '363 calculates voice region detection parameters through feature parameter filtering in order to detect energy-based one-dimensional feature parameters, and has a filter for edge detection. Furthermore, patent '363 is configured to detect a voice region using a finite state machine. The technology disclosed in patent '363 is advantageous in that only a small amount of calculation is required and end points are detected regardless of noise level, but is problematic in that there is no solution for burst noise because energy-based one-dimensional feature parameters are used.
Furthermore, patent '592 discloses a technology for detecting voices using the energy of an output signal that has passed through a band pass filter that is adjusted to the voice frequency band. In this process, both length and size information are used. Patent '592 is advantageous in that a voice region can be detected using a relatively small amount of calculation, but is problematic in that it is impossible to detect a voice signal having low energy and the start portion of a consonant having low energy in the voice signal, and it is difficult to determine a threshold value, and variation in the threshold value affects the performance thereof.
Meanwhile, patent '954 discloses a technology for performing real-time modeling for noise and voices using a Gaussian distribution, updating models by estimating voices and noise even if voices and noise are mixed with each other, and removing noise based on a Signal-to-Noise Ratio (SNR) estimated through the modeling. However, patent '954 uses single noise source models so that there is a problem in that it is considerably affected by input energy.
The problems of the conventional technologies are summarized as follows. First, a parameter value varies depending on the amount of noise. Second, a threshold value must be varied according to the energy of a noise signal.
Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior art, and an object of the present invention is to provide a method and apparatus for efficiently distinguishing a voice region from a non-voice region in an environment where various types of noise and voices are mixed with each other.
In order to accomplish the above object, the present invention provides a method of detecting a voice region, including the steps of (a) converting an input voice signal into a frequency domain signal by preprocessing the input voice signal; (b) performing sigmoid compression on the converted signal; (c) transforming a spectrum vector generated by the sigmoid compression into a voice detection parameter in scalar form; and (d) detecting the voice region using the parameter.
The above and other objects, features and advantages of the present invention will be more clearly understood from the following detailed exemplary description taken in conjunction with the accompanying drawings, in which:
Reference should now be made to the drawings, in which the same reference numerals are used throughout the different drawings to designate the same or similar components.
The present invention is characterized by representing a signal with a vector that distinguishes the signal from noise through smoothing and sigmoid compression processes with respect to a power spectrum, converting the vector into a scalar value, and using the scalar value as a voice detection parameter.
First, a preprocessing unit 105 converts an input voice signal into a frequency domain signal by preprocessing the input voice signal. The preprocessing unit 105 may include a pre-emphasis unit 110, a windowing unit 120 and a Fourier transform unit 130.
The pre-emphasis unit 110 performs pre-emphasis on the input voice signal. Assuming that a voice signal is s(n) and an m-th frame signal is d(m,n) when the signal s(n) is divided into a plurality of frames, the signal d(m,n) and a signal d(m,D+1), which is pre-emphasized and overlaps the rear portion of a previous frame, are expressed by Equation (1):
d(m,n)=d(m−1,L+n) 0≦n≦D
d(m,D+n)=s(n)+ζ·s(n−1) 0≦n≦L (1)
where D is the length by which the signal d(m,D+1) overlaps the previous frame, L is the frame length, and ζ is a constant used in the pre-emphasis process.
The windowing unit 120 applies a predetermined window (for example, a Hamming window) to the pre-emphasized signal. A signal y(n), to which the predetermined window is applied, has been discrete-Fourier transformed into a frequency domain signal using Equation (2):
where Ym(k) is divided into a real part and an imaginary part.
A low-pass filtering unit 140 low-pass-filters the transformed frequency domain signal. This low-pass filtering process removes relatively high frequency components. The reason for performing low-pass filtering is to prevent a spectrum from being affected by pitch harmonics as well as to acquire a smooth spectrum. In this case, the term “pitch” refers to the fundamental frequency of a voice signal and the term “harmonic” refers to a frequency that is an integer multiple of the fundamental frequency.
Furthermore, low-pass filtering helps consonants maintain parameter values similar to those of vowels. Vowels are mainly composed of low frequency components, so that the voice signals thereof are smooth, but relative to vowels, the consonants have many high frequency components, so that the voice signals thereof are not smooth. The present invention distinguishes voice from non-voice noise based on a single determination criterion (parameter) regardless of vowels and consonants, and thus, uses low-pass filtering.
The present invention uses a Chebyshev low-pass filter as one example of the low-pass filter. The cutoff frequency of the Chebyshev low-pass filter is 0.1, and the order thereof is 3. In the Chebyshev low-pass filter, a magnitude graph for respective frequencies is shown in
After the low-pass filtering process, a sub-sampling process is performed, if necessary. The sub-sampling is a process of decreasing the number of samples. For example, if there are 2n samples, the amount of data is halved by a ½ sub-sampling. The sub-sampling has the effect of decreasing the number of calculations, so that it is suitable for distinguishing voice from non-voice noise when using equipment having insufficient system performance.
A sigmoid compression unit 150 performs sigmoid compression on the low-pass-filtered signal. The spectral peaks of the input signal have different values, and when passed through the sigmoid compression process, the peaks of the spectrum become uniform.
For sigmoid compression, the sigmoid compression unit 150 applies a sigmoid compression equation, such as the following Equation (3), to each frequency.
Here, x is a component (sample) of a spectrum vector, which is composed of the low-pass-filtered samples, F(x) is a spectrum vector which is generated by the sigmoid compression, and μ is a component (sample) of a vector that is composed of average values (hereinafter referred to as “sample averages”) for respective samples; μ is acquired using a method (first method) of taking a sample average from current frames regardless of whether they comprise a voice region, or a method (second method) of taking a sample average for respective frequencies from consecutive frames in a non-voice region. In the first method, a single μ is acquired, whereas in the second method, vector values having different μs for respective frequencies are acquired, so that the second method is very efficient in the case where a noise signal has colored noise.
The constant α is related to a value that is acquired when x is identical to the average value, that is, α/(α+1). If α is set to 1, this value is 0.5, which is acquired when x is identical to the average value. Since values close to the average value are likely to represent non-voice signals, it is preferred that α be determined so that the sigmoid compression value has a small value. As a result, it is preferable that α be smaller than 1
Furthermore, β represents the extent to which a spectrum x affects the sigmoid function, that is, the extent of influence of the sigmoid function. Thus, when β is adjusted, it is possible to adjust the gain of the sigmoid function.
In the present invention, β may appropriately be the inverse of the average of the spectrum, including voices. For example, when the sample average is 3000, it is appropriate that β be about 0.0003.
A result value (hereinafter referred to as a “sigmoid value”) generated by the sigmoid compression has an approximately intermediate value for silence. For voice, the sigmoid value is approximately 1 when x is much larger than the sample average, and is approximately 0 when x is much smaller than the sample average.
As described above, sigmoid compression performs the role of roughly classifying x into values which approximate the three values: 0, α/(α+1) and 1.
For example, when sigmoid compression is performed using the signal shown in
A parameter generation unit 160 generates a scalar-voice detection parameter (hereinafter referred to as a “parameter”), which can represent a spectrum vector (that is, F(x)), by transforming the spectrum vector that has passed through the sigmoid compression process. The transforming process is performed in a similar manner to the process of adding entropy to each spectrum vector component, through which a vector value is transformed into a scalar value.
If one component of any compressed vector spectrum F(x) is expressed as yk(F(x) is composed of the components of {y0, y1, . . . , yn-1}), the parameter is calculated using equation (4):
As described above, since the parameter is generated through a vector-scalar transformation, one spectrum vector can be digitized. Voices, which form a broadband signal, have information up to 6 kHz, and may have different spectrum shapes depending on voice features. However, using the parameter it is possible to make a digitized determination regardless of an input signal band, a spectrum shape, or the like.
One thing that differs from the general entropy acquisition is the removal of the limitation that
When the signal resulting from sigmoid compression, as shown in
Meanwhile, a voice region determination unit 170 determines that the region in which the parameter exceeds a predetermined value is a voice region by comparing the generated parameter with the predetermined value. In
Each component of
The method of detecting a voice region includes step S5 of converting an input voice signal into a frequency domain signal by preprocessing the input voice signal, step S60 of performing sigmoid compression on the converted signal, step S70 of transforming a spectrum vector generated by the sigmoid compression into a voice detection parameter in scalar form, and step S80 of extracting the voice region using the parameter, and may further include step S40 of low-pass-filtering the converted frequency domain signal and providing it as an input for sigmoid compression.
Furthermore, step S40 may include sub-sampling step S50 of decreasing the number of samples.
In this case, step S5 is an example, and may be further divided into step S10 of pre-emphasizing the input voice signal, step S20 of applying a predetermined window to the pre-emphasized signal, and step S30 of Fourier transforming the signal to which the window has been applied.
As described above, step S60 may be performed according to Equation (3), and step S70 may be performed according to Equation (4).
Furthermore, step S80 is performed by comparing the parameter with a predetermined threshold value and determining that the region in which the parameter exceeds the threshold value is a voice region.
Several experiments using the present invention were performed and the results are described below. Assuming that a clean voice signal as shown in
Upon observation of the results, it can be appreciated that respective figures represent conspicuous peaks in the voice region, and parameter values in the non-voice region do not vary although the SNR varies.
The present invention is also resistant to burst noise.
Referring to
Voice region detection is a necessary element for a voice recognition system in a terminal having insufficient calculation capacity, and it directly improves voice recognition performance and user convenience.
In accordance with the present invention, parameters that are attained through a small amount of calculation and that enable the detection of a voice region, are provided for voice region detection.
Furthermore, in accordance with the present invention, a voice region detection method is provided whose determination logic is not altered depending on noise and that is resistant to various types of noise such as burst noise and continuous noise.
Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible without departing from the scope and spirit of the invention as disclosed in the accompanying claims.
Oh, Kwang-cheol, Park, Ki-young
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
4959865, | Dec 21 1987 | DSP GROUP, INC , THE | A method for indicating the presence of speech in an audio signal |
5611019, | May 19 1993 | Matsushita Electric Industrial Co., Ltd. | Method and an apparatus for speech detection for determining whether an input signal is speech or nonspeech |
6023671, | Apr 15 1996 | Sony Corporation | Voiced/unvoiced decision using a plurality of sigmoid-transformed parameters for speech coding |
6031915, | Jul 19 1995 | Olympus Optical Co., Ltd. | Voice start recording apparatus |
6411925, | Oct 20 1998 | Canon Kabushiki Kaisha | Speech processing apparatus and method for noise masking |
6427134, | Jul 03 1996 | British Telecommunications public limited company | Voice activity detector for calculating spectral irregularity measure on the basis of spectral difference measurements |
6453291, | Feb 04 1999 | Google Technology Holdings LLC | Apparatus and method for voice activity detection in a communication system |
6574592, | Mar 19 1999 | Kabushiki Kaisha Toshiba | Voice detecting and voice control system |
6658380, | Sep 18 1997 | Microsoft Technology Licensing, LLC | Method for detecting speech activity |
6778954, | Aug 28 1999 | SAMSUNG ELECTRONICS CO , LTD | Speech enhancement method |
6782363, | May 04 2001 | WSOU Investments, LLC | Method and apparatus for performing real-time endpoint detection in automatic speech recognition |
7412376, | Sep 10 2003 | Microsoft Technology Licensing, LLC | System and method for real-time detection and preservation of speech onset in a signal |
7440892, | Mar 11 2004 | Denso Corporation | Method, device and program for extracting and recognizing voice |
20020116189, | |||
20040030544, | |||
20050131689, | |||
EP909442, | |||
KR100450787, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jan 25 2006 | OH, KWANG-CHEOL | SAMSUNG ELECTRONICS CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 017515 | /0893 | |
Jan 25 2006 | PARK, KI-YOUNG | SAMSUNG ELECTRONICS CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 017515 | /0893 | |
Jan 27 2006 | Samsung Electronics Co., Ltd. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Dec 21 2011 | ASPN: Payor Number Assigned. |
Jan 30 2015 | REM: Maintenance Fee Reminder Mailed. |
Jun 21 2015 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Jul 20 2015 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Jun 21 2014 | 4 years fee payment window open |
Dec 21 2014 | 6 months grace period start (w surcharge) |
Jun 21 2015 | patent expiry (for year 4) |
Jun 21 2017 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jun 21 2018 | 8 years fee payment window open |
Dec 21 2018 | 6 months grace period start (w surcharge) |
Jun 21 2019 | patent expiry (for year 8) |
Jun 21 2021 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jun 21 2022 | 12 years fee payment window open |
Dec 21 2022 | 6 months grace period start (w surcharge) |
Jun 21 2023 | patent expiry (for year 12) |
Jun 21 2025 | 2 years to revive unintentionally abandoned end. (for year 12) |