A speech coding system employs measurements of robust features of speech frames whose distribution are not strongly affected by noise/levels to make voicing decisions for input speech occurring in a noisy environment. Linear programing analysis of the robust features and respective weights are used to determine an optimum linear combination of these features. The input speech vectors are matched to a vocabulary of codewords in order to select the corresponding, optimally matching codeword. Adaptive vector quantization is used in which a vocabulary of words obtained in a quiet environment is updated based upon a noise estimate of a noisy environment in which the input speech occurs, and the "noisy" vocabulary is then searched for the best match with an input speech vector. The corresponding clean codeword index is then selected for transmission and for synthesis at the receiver end. The results are better spectral reproduction and significant intelligibility enhancement over prior coding approaches. Robust features found to allow robust voicing decisions include: low-band energy; zero-crossing counts adapted for noise level; AMDF ratio (speech periodicity) measure; low-pass filtered backward correlation; low-pass filtered forward correlation; inverse-filtered backward correlation; and inverse-filtered pitch prediction gain measure.
|
0. 14. A method for speech coding for transmission comprising:
providing a first vector quantization codebook of reference vectors, each reference vector representing spectral characteristics of a corresponding time frame of a reference speech signal; accepting an input audio signal; updating the reference vectors of the first vector quantization codebook to produce a corresponding second vector quantization codebook, including updating each reference vector to represent spectral characteristics of a combination of the corresponding time frame of the reference speech signal and a background noise present in the input audio signal; and quantizing a time frame of the input audio signal according to the second vector quantization codebook, selecting the most similar updated reference vector to spectral characteristics of said time frame of the input audio signal.
12. In a method of low-bit-rate speech coding of input speech occurring in a noisy environment, for a system which employs linear predictive coding (LPC) analysis of input speech frames to generate reflection coefficients, conversion of the reflection coefficients to vectors representing spectral parameters of the input speech frames, and matching of the spectral parameter vectors against reference vectors of a vocabulary of codewords generated in a training sequence in order to select the corresponding index of an optimally matching codeword for transmission,
the improvement comprising the steps of: selecting a set of features which are characterized by a probability distribution which is not strongly affected in the noisy environment and which allow discrimination between voiced and unvoiced input speech; measuring the selected features for input speech frames; and using said feature measurements to make voiced/unvoiced speech decisions in order to select the voice/unvoiced excitation for speech synthesis in the receiver; using noise estimates to update the reference vectors of the vocabulary of codewords, wherein new reference vectors are generated corresponding to said vocabulary of codewords in the noisy environment, said noise estimates including noise amplitude and noise reflection coefficients, wherein said noise estimate for speech frame I is performed only if the ith speech frame is unvoiced and more than a given number l of continuous unvoiced speech frames are accumulated, in order to prevent using voiced or unvoiced speech in the noise estimate.
1. In a method of low-bit-rate speech coding of input speech occurring in a noisy environment, for a system which employs linear predictive coding (LPC) analysis of input speech frames to generate reflection coefficients, conversion of the reflection coefficients to vectors representing spectral parameters of the input speech frames, and matching of the spectral parameter vectors against reference vectors of a vocabulary of codewords generated in a training sequence in order to select the corresponding index of an optimally matching codeword for transmission,
the improvement comprising the steps of: selecting a set of at least two features which are characterized by a probability distribution which is not strongly affected in the noisy environment and which allow discrimination between voiced and unvoiced input speech, wherein said selected features include the feature of zero-crossing counts which are based on average noise energy; measuring the selected features for input speech frames; and using said feature measurements to make voiced/unvoiced speech decisions in order to select the voice/unvoiced excitation for speech synthesis in the receiver; using noise estimates to update the reference vectors of the vocabulary of codewords, wherein new reference vectors are generated corresponding to said vocabulary of codewords in the noisy environment, said noise estimates including noise amplitude and noise reflection coefficients, wherein said noise estimate for speech frame I is performed only if the ith speech frame is unvoiced and more than a given number l of continuous unvoiced speech frames are accumulated, in order to prevent using voiced or unvoiced speech in the noise estimate.
0. 24. A speech encoder comprising:
an input processor for accepting and input audio signal; a background noise estimator coupled to the input processor for estimating a characteristic of a background noise present in the input audio signal; a first vector quantization codebook including reference vectors, each reference vector representing spectral characteristics of a corresponding time frame of a reference speech signal; a second vector quantization codebook of updated reference vectors, each updated reference vector corresponding to a different one of the reference vectors of the first vector quantization codebook, and each reference vector representing spectral characteristics of a combination of the corresponding time frame of the reference speech signal and the background noise present in the input audio signal; a codebook updater coupled to the first vector quantization codebook, the background noise estimator, and the second vector quantization codebook, configured to accept reference vectors of the first vector quantization codebook and the characteristic of the background noise and produce the reference vectors of the second vector quantization codebook; a vector quantizer coupled to the second quantization codebook and the input processor configured to quantize a time frame of the input audio signal according to the second vector quantization codebook by selecting the most similar reference vector of said second codebook to spectral characteristics of said time frame of the input audio signal; and a transmitter coupled to the vector quantizer to sending an index of the reference vector of the first vector quantization codebook that corresponds to the selected most similar reference vector of the second vector quantization codebook.
2. A low-bit-rate speech coding method according to
3. A low-bit-rate speech coding method according to
4. A low-bit-rate speech coding method according to
5. A low-bit-rate speech coding method according to
6. A low-bit-rate speech coding method according to
7. A low-bit-rate speech coding method according to
8. A low-bit-rate speech coding method according to
9. A low-bit-rate speech coding method according to
10. A low-bit-rate speech coding method according to
11. A low-bit-rate speech coding method according to
13. A low-bit-rate speech coding method according to
0. 15. The method of
transmitting an index of a reference vector of the first vector quantization codebook that corresponds to the selected updated reference vector of the second vector quantization codebook; receiving the transmitted index; and synthesizing a time frame of an output signal, including selecting a reference vector of the first vector quantization codebook according to the received index.
0. 16. The method of
0. 17. The method of
0. 18. The method of
determining any of a consecutive series of a predetermined number of time frames of the input audio signal includes voiced speech; and if none of the series of time frames includes voiced speech, estimating the characteristic of the background noise in at least one of said series of time frames.
0. 19. The method of
0. 20. The method of
0. 21. The method of
0. 22. The method of
estimating the characteristic of the background noise includes determining autocorrelation coefficients of the background noise; and updating each reference vector includes determining an autocorrelation coefficient representation of the spectral characteristics associated with the reference vector, and combining said autocorrelation coefficients with the autocorrelation coefficients of the background noise to produce an autocorrelation coefficient representation of the spectral characteristics represented by the updated reference vector.
0. 23. The method of
|
This is a continuation of application Ser. No. 07/695,571 filed May 3, 1991 now abandoned.
The United States Government has rights in this invention pursuant to RADC Contract F30602-89-C-0118 awarded by the Department of the Air Force.
The present invention relates to enhanced speech coding techniques for low-rate speech coders, and particularly, to improved speech frame analysis and vector quantization methods.
A low-bit-rate speech coder is disclosed in U.S. Pat. No. 4,975,956, issued to Y. J. Liu and J. H. Rothweiler, entitled "Low-Bit-Rate Speech Coder Using LPC Data Reduction Processing", which is incorporated herein by reference. This speech coder employs linear predictive coding (LPC) analysis to generate reflection coefficients for the input speech frames and pitch and gain parameters. To obtain a low bit rate of 400 bps, these parameters are further compressed. The reflection coefficients are first converted to line spectrum frequencies (LSFs) and formants. For even frames, these spectral parameters are vector-quantized into clean codeword indices. Odd frames are omitted, and are regenerated by interpolation at the decoder end. The vector quantization module compares the spectral parameters for an input word against a vocabulary of codewords for which vector indices have been generated and stored during a training sequence, and the optimally matching codeword is selected for transmission. Pitch and gain bits are quantized using trellis coding. Output speech is reconstructed from the regenerated vector-quantization indices using a matching codebook at the decoder end.
In a quiet background, this 400-bps speech coder has a high intelligibility for a low-bit-rate transmission. However, in a background of high noise, such as in a helicopter or jet, the encoded speech becomes unintelligible. A detailed study has shown that conversion of voicing and spectral parameters in the high-noise environment is the key to the loss of intelligibility. The LPC conversion causes a majority of voiced frames to become unvoiced. The result is a whispering LPC speech and an almost inaudible low-rate voice. Even if the voicing is correct, spectral distortion causes the low-rate voice to be significantly muffled and buzzy. Although the pitch has no audible errors, the gain has a predominantly annoying effect.
It is therefore a principal object of the invention to provide an improved low-bit-rate speech coder capable of high quality speech coding in a high-noise environment. In accordance with the invention, a two-step approach to conversion of voicing and spectral parameters is taken. In the first step, robust speech frame features whose distributions are not strongly affected by noise levels are generated. In the second step, linear programming is used to determine an optimum combination of these features. A technique of adaptive vector quantization is also used in which a clean codebook is updated based upon an estimate of the background noise levels, and the "noisy" codebook is then searched for the best match with an input speech vector. The corresponding clean codeword is then selected for transmission and for synthesis at the receiver end. The results are better spectral reproduction and significant intelligibility enhancement over the previous coding approach.
In a preferred implementation of the system for the environment of helicopter, it is found that the following features are well distributed to allow good discrimination between voiced and unvoiced speech: (1) low-band energy; (2) zero-crossing counts adapted for noise level; (3) AMDF ratio (speech periodicity) measure; (4) low-pass filtered, backward correlation; (5) low-pass filtered, forward correlation; (6) inverse-filtered backward correlation; and (7) inverse-filtered pitch prediction gain measure. By linear programming analysis, five of these robust features are determined to significantly improve voicing decisions in the speech coder system. Adaptive vector quantization, using estimates of the average noise amplitude and average noise reflection coefficients to update codebook vectors, significantly improves input vector matching.
The above principles and further features and advantages of the invention are described in detail below in conjunction with the drawings, of which:
Referring to
In
To identify speech parameters crucial for intelligibility in a high-noise environment, such as helicopter noise, several listening tests were performed comparing the performance of a clean speech file with a noisy speech file through LPC analysis. The listening tests showed that the voicing and spectrum parameters of LPC conversion must be enhanced to obtain intelligible speech coding. Also, the gain parameter requires correction to eliminate an annoying noise effect.
In the following preferred embodiments of the invention, enhanced techniques for low-bit-rate coding are applied to a 400-bps speech coder in the environment of helicopter noise. However, the principles of the invention illustrated herein are applicable for other low bit rates of transmission and to other types of noisy environments as well.
To achieve the low bit rate of 400 bps, spectral parameters are not quantized with every speech frame. As described in the aforementioned U.S. Pat. No. 4,975,956, vector quantization is performed for every even frame, while interpolation is performed for every odd frame. For the odd frame, interpolation bits are sent representing an interpolation factor used for the combination of the spectral codeword of its previous frame and future frame. Based upon a frame period of 22.5 msec used in a standard encoder, the preferred bit allocations are illustrated in Table I.
TABLE I | ||||
Parameter | Even Frame | Odd Frame | Two Frames | |
Spectral | 10 | 0 | 10 | |
Gain | 2 | 2 | 4 | |
Pitch | 1 | 1 | 2 | |
Interpolation | 0 | 2 | 2 | |
Total: | 13 | 5 | 18 | |
For even frames, a total of 13 bits are allocated. For odd frames, only 5 bits are allocated. For every pair of even and odd frames, a total of 18 bits are used. Assuming a 45 msec period for every two frames, this bit allocation scheme fits within the 400 bits/second requirement.
The major operations for obtaining robust voicing decisions include preliminary processing, robust feature extraction, voicing classification, and voicing smoothing. The specific parameters of these processing steps depend upon the different applications and environments. In the described example, voicing decisions are made every half frame or 11.25 msec. To enable robust voicing decisions, feature distributions without strong dependence on noise levels are necessary. The selected features are then combined using optimum weights in a linear combination.
Following the usual operations in LPC analysis, the preliminary processing includes high-pass filtering, voicing-window decisions, and low-pass filtering. The low-pass filtering is particularly important for robust voicing decisions in a high noise environment. Even though real-world noise, such as helicopter noise, is usually distributed in characteristic patterns, the spectral strength is normally weak in the low frequency band. A typical spectrum of helicopter noise is shown in
Voicing decisions are the determination of fundamental periodicity in the input speech. For human speech, the fundamental frequency is usually below 400 Hz. Therefore, a good choice of the cut-off frequency is about 420 Hz. Using the Remetez exchange algorithm, a low-pass filter with cut-off frequency at 420 Hz and transition frequency at 650 Hz is used. This filter is selected to be even-symmetric with 40 taps. Typical values for the first 20 taps, hk, k=0, . . . 19, are illustrated in Table II.
TABLE II | ||
Tap | Value | |
h0 | 0.01787624 | |
h1 | 0.02237480 | |
h2 | 0.002685766 | |
h3 | 0.01303141 | |
h4 | -0.0001381086 | |
h5 | -0.001044893 | |
h6 | -0.01218479 | |
h7 | -0.01683313 | |
h8 | -0.02370618 | |
h9 | -0.02454394 | |
h10 | -0.02252495 | |
h11 | -0.01385341 | |
h12 | -0.003387984 | |
h13 | 0.01871256 | |
h14 | 0.04112903 | |
h15 | 0.0654924 | |
h16 | 0.08902424 | |
h17 | 0.109489 | |
h18 | 0.124534 | |
h19 | 0.132543 | |
The next 20 tap values are determined from symmetry and are given as follows:
All the features are extracted in the low-frequency band to minimize the noise corruption. The filtered speech can be computed as follows, where the input speech after high-pass filtering is sn:
A spectral plot of the effect of the low-pass filter is illustrated in
Two major criteria for good robust features are that their distributions must not strongly depend upon noise levels and that they must have good voiced/unvoiced discrimination. Speech samples were evaluated for male and female speakers in a quiet environment with a signal-to-noise ratio of 30 dB, and in a noisy environment with a signal-to-noise ratio of -10 dB. Robust features were then selected on the basis of both low-frequency distributions and voiced/unvoiced discriminations, using low-band energy measurements, zero-crossing rate, and selected correlation calculations as factors. The processing steps for the enhancement techniques of the present invention, including extraction of the robust features, their use for robust voicing decisions, noise estimation, and updating a clean codebook, are illustrated in the block diagram of FIG. 5.
Low-band energy distribution is a measure of energy in the low-frequency band. Typically, voiced speech has higher low-band energy than unvoiced speech. For normalization purposes, this energy is divided by the average voiced energy, as represented by the following equation, wherein 1 represents the speech signal after 100 Hz high-pass filtering and 420 Hz low-pass filtering, and LEA represents the average voiced energy in the low band:
Another feature found to have robustness for good voicing decisions is measurement of the zero-crossing rate, i.e., the number of times the input signal crosses a zero (or reference) axis. In effect, it is a count of the high frequency content in the signal. Typically, unvoiced speech has a higher zero-crossing count than voiced speech. The zero-crossing count is accumulated by counting changes in sign of 1n, which is defined as positive if 1n>±D, and negative if 1n<±D.
To make the zero-crossing count robust in a noisy environment, it is counted in the low-frequency band, and the dither D is appropriately adjusted in noise. The low-band energy is computed according to the following equation:
For the jth frame, this energy is indicated by Ej. The low-band noise energy is first estimated by assuming there are always available 16 frames without speech activity. Using these 16 frames, the average low-band noise energy EN is computed as:
After these 16 frames, the low-band noise energy is updated at frame k if three conditions are satisfied. First, this frame must be unvoiced. Second, there must already be an accumulation of 16 continuous unvoiced frames before this current frame. Third, the ratio of current low-band energy to average low-band noise energy is less than 1.6. If all three conditions are satisfied at frame k, the average low-band noise energy is updated as follows:
To adapt the coefficient D to noise, a quantity a is defined as follows:
After evaluating a, a minimum between a and 20 is selected. Next, the quantity b, which is the maximum between the selected minimum and 10 is obtained. Mathematically, b is given by the following equation:
where max represents the maximum and min represents the minimum. The adaptation coefficient D is updated as follows:
The newest value of D for frame k is then used to compute the sign of every low-pass filtered sample. The zero-crossing count then follows the procedure mentioned above. The performance of the zero-crossing count is indicated in
Another feature found to have robustness for speech coding in a noisy environment is a measure of periodicity of speech, referred to herein as AMDF measure. Typically, voiced speech has smaller AMDF values than unvoiced speech. The AMDF computation is done using inverse-filtered speech by passing the low-pass signal through a second-order LPC filter. If vi represents the inverse-filtered speech sample, the AMDF value is computed as follows:
where τ represents the 60 possible pitch lags ranging from 20 samples to 156 samples. These 60 possible lags are searched to find a maximum and a minimum. This feature is then computed as the ratio of maximum AMDF to minimum AMDF, i.e., R=max(AMDF)/min(AMDF). The performance of the AMDF ratio measure is demonstrated in
A fourth robust feature for voicing decisions in speech coding is a measure of correlation strength at the pitch period, which is a low-pass filtered backward correlation. Typically, voiced speech has higher correlation values than unvoiced speech. However, the correlation is done using negative pitch lags, and is defined mathematically as follows:
where τ represents the pitch period. The above equation shows this feature normalized with respect to low-pass energy with and without negative pitch lag. The performance of this feature is demonstrated in
A fifth robust feature for voicing decisions is a measure of correlation strength via low-pass filtered forward correlation using a positive pitch lag. Typically, the voiced speech has higher correlation values than unvoiced speech. It is defined mathematically as follows:
where τ represents the pitch period. The above equation shows this feature normalized with respect to low-pass energy with and without positive pitch lag. The performance of this feature is demonstrated in
Another feature is an inverse-filtered backward correlation, which is also a measure of correlation strength at the pitch period using backward pitch lag. The main difference from the two previous correlation measures is the use of inverse-filtered speech vi. Again, the voiced speech has higher correlation values than unvoiced speech. It is defined mathematically as follows:
where τ represents the pitch period. Normalization is done the same way as before with and without pitch lag. The performance of this feature is demonstrated in
Another feature found to have robustness for voicing decisions is the second-order pitch-prediction gain after inverse filtering, which is also a measure of speech periodicity). The pitch-prediction residual is given by the following equation:
where a1 and a2 are prediction coefficients. The optimum prediction coefficients can be found by differentiating δ with respect to both a1 and a2. Substituting these two optimum values into the above equation, the optimum prediction residual is expressed as follows:
where E represents the zeroth-order autocorrelation coefficient and R represents the normalized autocorrelation coefficients. The second term in the above equation is the prediction gain. The feature used for voicing decisions is slightly modified by rearranging the above equation as follows:
For voiced speech, g has a larger values than for unvoiced speech. The performance of this feature is demonstrated in
All of the seven features discussed above are found to have good discriminations and robust distributions. Further information on the features can be found in the references, "Voices/Unvoiced Classification of Speech with Applications to the U.S. Government LPC-10E Algorithm" by J. Campbell and T. Tremain, ICASSP'86 and "An Enhanced LPC Vocoder with No Voiced/Unvoiced Switch" by S. Y. Kwon and A. J. Goldberg, ASSP-32, 1984. Other robust features may be found using the same criteria. The histogram plots show the there are always some overlaps between voiced bins and unvoiced bins for all features. Therefore, no single feature should be relied upon to make voicing decisions. To minimize potential error, a combination of the features is utilized, as depicted in
where fj represents the jth feature, w represents a weight assigned to the feature, and c is a constant. A frame is classified as unvoiced if the reverse inequality holds. The optimum weights for the combination are determined using linear programming analysis of representative training patterns in which helicopter noise is mixed with clean speech. The correct voicing decisions are measured against LPC analysis of the clean speech. The linear programming analysis solves the inequality equations using the well-known simplex method of linear optimization by first converting them to equalities using slack and surplus variables:
The above equations are solved by maximizing a quantity h. A hyperplane is found separating the voiced region from the unvoiced region, and h is defined to be the average distance between the voiced region and the unvoiced region, given as follows:
The optimum weights are found when h is maximized for the training patterns.
The simplex method starts with an initial feasible solution. However, an initial solution is difficult to find if the number of equations becomes large. To simplify the initial solution, some artificial values are introduced, and the basic equations become as follows:
where the weights wj, j=n+m+k+1, . . . n+k+2m are artificial variables. All the artificial variables are also assigned the negative maximum weight. The quantity h is then given below:
where M is an arbitrarily large number. The solutions are then iterated until all artificial variables are removed and the quantity h can no longer be increased. For a further discussion of this type of linear programming analysis, reference is made to "A Procedure For Using Pattern Classification Techniques To Obtain A Voiced/Unvoiced Classifier", by L. Siegel, IEEE Trans., ASSP-27, February 1979, and Linear Programming, by G. Hadley, published by Addison Wesley, 1963.
Analyses performed by the above-described procedures showed that the five most useful features for the helicopter-noise patterns are low-band energy, zero-crossing rate, AMDF measure, low-pass filtered backward correlation, and inverse-filtered pitch-prediction gain. Therefore, these five features are combined in this example to make decisions as to when the input speech frames are voiced or unvoiced. Voicing smoothing may also be used to desensitize the voicing decisions to rapid transitions in values. Factors considered in smoothing include the discriminant magnitude of the voiced/unvoiced decisions, the onset of a rapid transition (between half frames), and continuity (which requires no instantaneous change of voicing). The voicing is determined every half frame of 11.25 msec. In order to facilitate the smoothing decisions, the final voicing decisions may be delayed two frames.
Referring again to
A background noise estimate can be performed in two One is the average noise amplitude Nai, and the other is the average noise reflection coefficients Baij, j=1, . . . P, where i represents the current frame number, j represents the coefficient number, and P is the LPC order. To prevent using voiced or unvoiced speech in the computation, the noise estimate for frame i is only performed if two conditions are satisfied: the frame i is decided to be unvoiced; and there must be an accumulation of more than a given number L of continuous unvoiced frames. To count continuous unvoiced frames, a counter n is reset on each voiced frame and incremented on each unvoiced frame. For n>L, the following noise estimates are computed:
The average noise reflection coefficients Ba are further converted to noise autocorrelation coefficients RN. To compute RN and Na at frame i, the values at frame i-15 are utilized. This greatly reduces the probability of including speech frames. The noise estimate parameters RN and Na are then used to add noise parameters to the codebook vectors.
The LSFs are converted to autocorrelation coefficients for each codeword in the clean codebook. As described previously, the higher-order LPC vector can enhance discrimination of the formants in noise, and the codebook is preferably designed using a 14th-order LPC analysis, i.e. P=14. Assuming there are N codewords in the codebook, and each codeword has P autocorrelation coefficients, and RCkj represents the jth coefficient of the kth codeword, then the noise autocorrelation coefficients are added to each codeword as follows:
where RCkj represents the updated codeword vector and Qi represents the mixing ratio at the ith frame. The mixing ratio is determined from the noise amplitude Nai, as follows:
where f is a factor determined empirically, according to the level of noise amplitude, as follows:
The codebook update is performed only when the counter n is at a multiple factor of J frames, which is adjustable depending upon the processor speed. For a very fast processor, the codebook could be updated every frame. In this case, the mixing ratio Qi is determined empirically to depend upon the signal-to-noise ratio, as follows:
where Si represents the speech amplitude at frame i. This mixing ratio is used in the same way as described above to compute the updated codewords.
After computing the updated codebook of autocorrelation coefficients, each codeword is further converted to line-spectrum frequencies (LSFs) and formants. The input reflection coefficients are also converted to LSFs and formants. For the 14th-order LPC analysis, each vector for a voiced frame consists of 14 LSFs and two lowest frequency formants, and each vector for an unvoiced frame consists of 14 LSFs and one highest frequency formant. The N codewords of the codebook are then searched to find the codeword which has the best match with an input vector, and the corresponding index is transmitted to the receiver.
In the receiver, only the clean codebook of N codewords is stored. The received index is used to select the corresponding clean codeword for synthesis. Thus, even though an updated (noisy) codebook is used to produce better matching, a clean codebook is used for synthesis of output speech in which spectral distortion is greatly reduced.
The previous speech coder techniques as described in U.S. Pat. No. 4,975,956 could be implemented for 400-bps transmission using a 100 nsec DSP processor (equivalent to 10 Mips). The enhanced techniques can be implemented using two such DSPs, if tree searching for codeword matches and 32-frame codebook updates are used. Using the voicing decisions from LPC analysis of clean speech via the prior techniques as a reference, the performance of the new voicing decision techniques is illustrated in
Informal listening tests were also conducted both for speech samples in which noise was mixed with clean speech and those recorded in the actual helicopter noise environment. The listening tests showed none of the previous whispering LPC speech for either type of sample. The 400-bps speech in the noisy environment was reproduced as clearly audible but with some degradation in quality. To improve speech intelligibility, improved vector quantization can be applied.
The adaptive vector quantization was also tested using noisy speech samples of the same two types. The listening tests showed that there is always an intelligibility improvement using codebook adaptation. The degree of improvement depends upon three factors: signal-to-noise ratio; rate of codebook update; and the use of preemphasis. Tests on the effect of S/N ratio showed that the intelligibility improvement is quite significant at very low S/N such as -10 dB. for higher S/N, the improvement is less audible, which is expected since there is less noise corruption. The intelligibility improvement seems to depend only a little on the rate of codebook update. Updating with every frame appeared only slightly better than updating every 32 frames. As to preemphasis, tests of mixed speech showed that the same factor as used in the clean codebook should be used, whereas for recorded speech, a smaller preemphasis factor can significantly improve intelligibility.
The specific embodiments of the invention described herein are intended to be illustrative only, and many other variations and modifications may be made thereto in accordance with the principles of the invention. All such embodiments and variations and modifications thereof are considered to be within the scope of the invention, as defined in the following claims.
Patent | Priority | Assignee | Title |
10032460, | Jul 04 2013 | CRYSTAL CLEAR CODEC, LLC | Frequency envelope vector quantization method and apparatus |
6961696, | Feb 07 2003 | Google Technology Holdings LLC | Class quantization for distributed speech recognition |
7305337, | Dec 25 2001 | National Cheng Kung University | Method and apparatus for speech coding and decoding |
7480639, | Jun 04 2004 | Siemens Medical Solutions USA, Inc | Support vector classification with bounded uncertainties in input data |
7835311, | Dec 09 1999 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Voice-activity detection based on far-end and near-end statistics |
7890319, | Apr 25 2006 | Canon Kabushiki Kaisha | Signal processing apparatus and method thereof |
8135585, | Mar 04 2008 | LG Electronics Inc | Method and an apparatus for processing a signal |
8280724, | Sep 13 2002 | Cerence Operating Company | Speech synthesis using complex spectral modeling |
8565127, | Dec 09 1999 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Voice-activity detection based on far-end and near-end statistics |
9142221, | Apr 07 2008 | QUALCOMM TECHNOLOGIES INTERNATIONAL, LTD | Noise reduction |
9520141, | Feb 28 2013 | GOOGLE LLC | Keyboard typing detection and suppression |
9805732, | Jul 04 2013 | CRYSTAL CLEAR CODEC, LLC | Frequency envelope vector quantization method and apparatus |
Patent | Priority | Assignee | Title |
4074069, | Jun 18 1975 | Nippon Telegraph & Telephone Corporation | Method and apparatus for judging voiced and unvoiced conditions of speech signal |
4091237, | Oct 06 1975 | Lockheed Missiles & Space Company, Inc. | Bi-Phase harmonic histogram pitch extractor |
4296279, | Jan 31 1980 | Speech Technology Corporation | Speech synthesizer |
4589131, | Sep 24 1981 | OMNISEC AG, TROCKENLOOSTRASSE 91, CH-8105 REGENSDORF, SWITZERLAND, A CO OF SWITZERLAND | Voiced/unvoiced decision using sequential decisions |
4630304, | Jul 01 1985 | Motorola, Inc. | Automatic background noise estimator for a noise suppression system |
4696038, | Apr 13 1983 | Texas Instruments Incorporated | Voice messaging system with unified pitch and voice tracking |
4720802, | Jul 26 1983 | Lear Siegler | Noise compensation arrangement |
4933973, | Feb 29 1988 | ITT Corporation | Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems |
4975956, | Jul 26 1989 | ITT Corporation; ITT CORPORATION, 320 PARK AVENUE, NEW YORK, N Y 10022 A CORP OF DE | Low-bit-rate speech coder using LPC data reduction processing |
5073940, | Nov 24 1989 | Ericsson Inc | Method for protecting multi-pulse coders from fading and random pattern bit errors |
5127053, | Dec 24 1990 | L-3 Communications Corporation | Low-complexity method for improving the performance of autocorrelation-based pitch detectors |
5459814, | Mar 26 1993 | U S BANK NATIONAL ASSOCIATION | Voice activity detector for speech signals in variable background noise |
5806024, | Dec 23 1995 | NEC Corporation | Coding of a speech or music signal with quantization of harmonics components specifically and then residue components |
6018707, | Sep 24 1996 | Sony Corporation | Vector quantization method, speech encoding method and apparatus |
6081776, | Jul 13 1998 | Lockheed Martin Corporation | Speech coding system and method including adaptive finite impulse response filter |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 21 1999 | ITT Manufacturing Enterprises, Inc. | (assignment on the face of the patent) | / | |||
Dec 21 2011 | ITT Corporation | Exelis Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 027567 | /0311 | |
Dec 23 2015 | Exelis Inc | Harris Corporation | MERGER SEE DOCUMENT FOR DETAILS | 039362 | /0534 |
Date | Maintenance Fee Events |
Apr 21 2005 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Apr 21 2009 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Oct 07 2006 | 4 years fee payment window open |
Apr 07 2007 | 6 months grace period start (w surcharge) |
Oct 07 2007 | patent expiry (for year 4) |
Oct 07 2009 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 07 2010 | 8 years fee payment window open |
Apr 07 2011 | 6 months grace period start (w surcharge) |
Oct 07 2011 | patent expiry (for year 8) |
Oct 07 2013 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 07 2014 | 12 years fee payment window open |
Apr 07 2015 | 6 months grace period start (w surcharge) |
Oct 07 2015 | patent expiry (for year 12) |
Oct 07 2017 | 2 years to revive unintentionally abandoned end. (for year 12) |