A speech coding apparatus comprises a repetition period pre-selecting unit for generating a plurality of candidates for the repetition period of a driving excitation source by multiplying the repetition period of an adaptive excitation source by a plurality of constant numbers, respectively, and for pre-selecting a predetermined number of candidates from all the candidates generated. A driving excitation source coding unit provides both excitation source location information and excitation source polarity information that minimize a coding distortion, for each of the predetermined number of candidates, and provides an evaluation value associated with the minimum coding distortion for each of the predetermined number of candidates. A repetition period coding unit compares the evaluation values provided for the predetermined number of candidates with one another, selects one candidate from the predetermined number of candidates according to the comparison result, and furnishes selection information indicating the selection result, excitation source location code, and polarity code.

Patent
   RE43209
Priority
Nov 08 1999
Filed
Jan 28 2010
Issued
Feb 21 2012
Expiry
Nov 07 2020
Assg.orig
Entity
Large
0
24
EXPIRED<2yrs
0. 16. A speech coding method for coding an input speech on a frame-by-frame basis using an adaptive excitation source, which is generated from a past excitation source, and a driving excitation source, which includes a predetermined number of pulse locations and polarities being associated with the pulse locations, so as to generate speech code, said speech coding method comprising the steps of:
calculating by a cross-correlation calculating unit, a cross-correlation between a first impulse response generated when an impulse is placed at a first pulse location among a plurality of pulse locations and a second impulse response generated when an impulse is placed at a second pulse location among the plurality of pulse locations, for all combinations of the first and second pulse locations;
calculating a correlation between an impulse response generated when an impulse is placed at a pulse location among the plurality of pulse locations and a synthesized speech generated based on the adaptive excitation source, and modifying the cross-correlation calculated by said cross-correlation calculating step, using the correlation; and
searching for each location of the predetermined number of the pulse locations of the driving excitation source, using the cross-correlation modified by said cross-correlation modifying step.
0. 15. A speech coding apparatus for coding an input speech on a frame-by-frame basis using an adaptive excitation source, which is generated from a past excitation source, and a driving excitation source, which includes a predetermined number of pulse locations and polarities being associated with the pulse locations, so as to generate speech code, said speech coding apparatus comprising:
a cross-correlation calculating unit for calculating a cross-correlation between a first impulse response generated when an impulse is placed at a first pulse location among a plurality of pulse locations and a second impulse response generated when an impulse is placed at a second pulse location among the plurality of pulse locations, for all combinations of the first and second pulse locations;
a cross-correlation modifying unit for calculating a correlation between an impulse response generated when an impulse is placed at a pulse location among the plurality of pulse locations and a synthesized speech generated based on the adaptive excitation source, and for modifying the cross-correlation calculated by said cross-correlation calculating unit, using the correlation; and
a searching unit for searching for each location of the predetermined number of the pulse locations of the driving excitation source, using the cross-correlation modified by said cross-correlation modifying unit.
13. A speech coding apparatus for coding an input speech on a frame-by-frame basis using an adaptive excitation source, which is generated from a past excitation source, and a driving excitation source generated from the input speech and the adaptive excitation source, said driving excitation source being represented by locations and polarities of a plurality of excitation sources, so as to generate speech code, said speech coding apparatus comprising:
a pre-table calculating means unit for calculating a correlation between a signal to be coded and each of a plurality of synthesized speeches each of which is generated based on a corresponding temporary driving excitation source that is a signal obtained by placing a predetermined excitation source at a corresponding one of all possible locations, and a cross-correlation between any two of the plurality of synthesized speeches, and for storing these calculated correlations and cross-correlations as a pre-table therein;
a pre-table modifying means unit for calculating a correlation between the signal to be coded and a synthesized speech generated based on the adaptive excitation source, and a correlation between each of the plurality of synthesized speeches generated based on the corresponding temporary driving excitation source and the synthesized speech generated based on the adaptive excitation source, and for modifying said pre-table using these calculated correlations; and
a searching means unit for determining the locations and polarities of the plurality of excitation sources using the pre-table corrected by said pre-table modifying means unit so as to generate excitation source location code indicating the locations of the plurality of excitation sources and excitation source polarity code indicating the polarities of the plurality of excitation sources.
0. 1. A speech coding apparatus for coding an input speech on a frame-by-frame basis using an adaptive excitation source, which is generated from a past excitation source, and a driving excitation source, which is generated from the input speech and the adaptive excitation source, so as to generate speech code, said speech coding apparatus comprising:
a repetition period pre-selecting means for generating a plurality of candidates for a repetition period of the driving excitation source by multiplying a repetition period of the adaptive excitation source by a plurality of constant numbers, respectively, and for pre-selecting a predetermined number of candidates from all the candidates generated and furnishing the predetermined number of pre-selected candidates;
a driving excitation source coding means for providing both excitation source location information and excitation source polarity information that minimize a coding distortion, for each of the predetermined number of candidates for the repetition period of the driving excitation source, and for providing an evaluation value associated with the minimum coding distortion for each of the predetermined number of candidates; and
a repetition period coding means for comparing the evaluation values provided for the predetermined number of candidates for the repetition period of the driving excitation source from said driving excitation source coding means with one another, for selecting one candidate from the predetermined number of candidates according to a comparison result, and for furnishing selection information indicating a selection result, excitation source location code indicating excitation source location information associated with the selected candidate for the repetition period of the driving excitation source, and polarity code indicating excitation source polarity information associated with the selected candidate.
0. 2. The speech coding apparatus according to claim 1, wherein said repetition period pre-selecting means pre-selects two candidates from all the candidates generated, and said repetition period coding means encodes the selection result in one bit so as to generate 1-bit selection information.
0. 3. The speech coding apparatus according to claim 1, wherein said repetition period pre-selecting means includes a means for comparing the repetition period of the adaptive excitation source with a predetermined threshold value, and for pre-selecting the predetermined number of candidates from all the candidates generated according to a comparison result.
0. 4. The speech coding apparatus according to claim 1, wherein said repetition period pre-selecting means includes a means for generating a plurality of other adaptive excitation sources whose respective repetition periods equal to the plurality of candidates for the repetition period of the driving excitation source, respectively, and for pre-selecting the predetermined number of candidates from all the candidates generated according to a comparison between distances among the plurality of other adaptive excitation sources generated.
0. 5. The speech coding apparatus according to claim 1, wherein said plurality of constant numbers, by which the repetition period of the adaptive excitation source is multiplied, includes ½ and 1.
0. 6. A speech decoding apparatus for decoding input speech code on a frame-by-frame basis using an adaptive excitation source, which is generated from a past excitation source, and a driving excitation source, which is generated from the input speech code and the adaptive excitation source, so as to reconstruct original speech, said speech decoding apparatus comprising:
a repetition period pre-selecting means for providing a plurality of candidates for a repetition period of the driving excitation source by multiplying a repetition period of the adaptive excitation source by a plurality of constant numbers respectively, and for pre-selecting a predetermined number of candidates from all the candidates generated and furnishing the predetermined number of pre-selected candidates;
a repetition period decoding means for selecting one candidate from the predetermined number of pre-selected candidates for the repetition period of the driving excitation source from said repetition period pre-selecting means according to selection information included in said input coded speech and indicating the selection, and for furnishing the selected candidate as the repetition period of the driving excitation source; and
a driving excitation source decoding means for generating a time-series signal according to excitation source location code and excitation source polarity code included in the input speech code, and for generating a time-series vector that is a series of pitch-cycles, each of which includes the time-series signal, using the repetition period of the driving excitation source from said repetition period decoding means.
0. 7. The speech decoding apparatus according to claim 6, wherein said repetition period pre-selecting means pre-selects two candidates from all the candidates generated, and said repetition period decoding means decodes selection information coded in one bit, which is included in the input speech code and indicates a selection of a candidate for the repetition period of the adaptive excitation source made during coding.
0. 8. The speech decoding apparatus according to claim 6, wherein said repetition period pre-selecting means includes a means for comparing the repetition period of the adaptive excitation source with a predetermined threshold value, and for pre-selecting the predetermined number of candidates from all the candidates generated according to a comparison result.
0. 9. The speech decoding apparatus according to claim 6, wherein said repetition period pre-selecting means includes a means for generating a plurality of other adaptive excitation sources whose respective repetition periods equal to the plurality of candidates for the repetition period of the driving excitation source, respectively, and for pre-selecting the predetermined number of candidates from all the candidates generated according to a comparison between distances among the plurality of other adaptive excitation sources generated.
0. 10. The speech decoding apparatus according to claim 6, wherein the plurality of constant numbers, by which the repetition period of the adaptive excitation source is multiplied, includes ½ and 1.
0. 11. A speech coding apparatus for coding an input speech on a frame-by-frame basis using an adaptive excitation source, which is generated from a past excitation source, and a driving excitation source generated from the input speech and the adaptive excitation source, said driving excitation source being represented by locations and polarities of a plurality of excitation sources, so as to generate speech code, said speech coding apparatus comprising:
an excitation source location table including a plurality of selectable possible locations and a fixed magnitude determined based on the number of the plurality of possible locations for each of the plurality of excitation sources;
a driving excitation source coding means for placing the plurality of excitation sources at respective possible locations while multiplying each of the plurality of excitation sources by a corresponding fixed magnitude, with reference to said excitation source location table, for generating a driving excitation source by calculating a sum of the plurality of excitation sources each of which has been multiplied by the corresponding fixed magnitude and is thus placed at one corresponding possible location, for each of all combinations of possible locations of the plurality of excitation sources, and for selecting possible locations and polarities of the plurality of excitation sources which provide a driving excitation source having a smallest coding distortion between itself and the input speech so as to generate excitation source location code and polarity code.
0. 12. A speech decoding apparatus for decoding input speech code on a frame-by-frame basis using an adaptive excitation source, which is generated from a past excitation source, and a driving excitation source generated from the input speech code and the adaptive excitation source, said driving excitation source being represented by locations and polarities of a plurality of excitation sources, so as to reconstruct original speech, said speech decoding apparatus comprising:
an excitation source location table including a plurality of selectable possible locations and a fixed magnitude determined based on the number of the plurality of possible locations for each of the plurality of excitation sources;
a driving excitation source decoding means for selecting respective possible locations for the plurality of excitation sources with reference to said excitation source location table based on excitation source location code included in the input speech code, for placing the plurality of excitation sources at the respective selected possible locations while multiplying each of the plurality of excitation sources by a corresponding fixed magnitude, and for generating a driving excitation source by calculating a sum of the plurality of excitation sources each of which has been multiplied by the corresponding fixed magnitude and is thus placed at the corresponding possible location.
14. The speech coding apparatus of claim 13, wherein the signal to be coded is at least one of: the input speech, and a synthesized signal generated from the input speech.
0. 17. The speech coding method of claim 16, wherein said cross-correlation modifying step is performed by a cross-correlation modifying unit.
0. 18. The speech coding method of claim 17, wherein said searching step is performed by a searching unit.


where C and E are given by:

C = k g ( k ) d ( m k ) ( 2 ) E = k i g ( k ) g ( i ) ϕ ( m k , m i ) ( 3 )
where mk is the location of the kth pulse, g(k) is the magnitude of the kth pulse, d(x) is the correlation between an impulse response generated when an impulse is placed at the pulse position x and the signal to be coded, and φ(x,y) is the cross-correlation between an impulse response generated when an impulse is placed at the pulse location x and an impulse response generated when an impulse is placed at the pulse location y. The searching process is carried out by the calculation of the evaluation value D for all combinations of the possible locations of all excitation source pulses.

In addition, simplifying the above equations (2) and (3) by assuming that g(k) has the same sign as d(mk) and has an absolute value of 1 yields the following equations (4) and (5):

C = k d ( m k ) ( 4 ) E = k i ϕ ( m k , m i ) ( 5 )
where
d′(mk)=|d(mk)|  (6)
φ′(mk, mi)=sign[d(mk)]sign[d(mi)]φ(mk, mi)  (7)
Only calculating d′(mk) and φ′(mk, mi) in advance of the calculation of the evaluation value D for all combinations of the locations of all excitation source pulses is thus needed before the simple summations according to the equations (4) and (5), thereby reducing the amount of arithmetic operations.

Japanese patent application publications (TOKKAIHEI) No. 10-232696 and No. 10-312198, and “Improvements in ACELP speech coding based on adaptive pulse locations”, by Tsuchiya et al., Nihon Onkyo Gakkai (The Acoustical Society of Japan) 1999 Shunki Kenkyuu Happyokai Kouen Ronbunshuu vol.I, pp. 213–214, 1999, which will be referred to as Reference 2, disclose configurations for improving the quality of the algebraic excitation source mentioned above.

Japanese patent application publication No. 10-232696 discloses a method of providing a plurality of fixed waveforms and generating a driving excitation source by placing the plurality of fixed waveforms at a plurality of locations coded algebraically, respectively, thereby yielding an output speech with a high quality. Reference 2 studies an arrangement in which a pitch filter is contained in a generating unit for generating a driving excitation source (in reference 2, an ACELP excitation source). Either of the arrangement of the plurality of fixed waveforms and the pitch-filtering process to generate a pitch-filtered driving excitation source can improve the quality of the output speech without increasing the amount of searching operations if it is carried out at the same time that the calculation of impulse responses is done.

Japanese patent application publication No. 10-312198 discloses an arrangement in which the locations of excitation sources pulses are searched for while the driving excitation source is made to be orthogonal to the adaptive excitation source when the pitch gain is greater than or equal to a predetermined value.

Referring next to FIG. 17, there is illustrated a block diagram showing in details the structure of a driving excitation source coding unit 5 of an improved CELP speech coding apparatus disclosed in Japanese patent application publication No. 10-232696 and Reference 2. In the figure, reference numeral 16 denotes a perceptual weighting filter coefficient calculating unit, numerals 17 and 19 denote perceptual weighting filters, numeral 18 denotes a basic response generating unit, numeral 20 denotes a pre-table calculating unit, numeral 21 denotes a searching unit, and numeral 22 denotes an excitation source location table.

Next, the operation of the driving excitation source coding unit 5 will be described. A quantized linear prediction coefficient from a linear prediction coefficient coding unit 3 disposed within the speech coding apparatus as shown in FIG. 14 is applied to the perceptual weighting filter coefficient calculating unit 16 and the basic response generating unit 18. An adaptive excitation source coding unit 4 furnishes a signal to be coded that is either an input speech 1 or a signal obtained by substituting synthesized speech generated from an adaptive excitation source from the input speech 1 to the perceptual weighting filter 17. The adaptive excitation source coding unit 4 also delivers the repetition period of the adaptive excitation source converted from an adaptive excitation source code to the basic response generating unit 18.

The perceptual weighting filter coefficient calculating unit 16 then calculates a perceptual weighting filter coefficient using the quantized linear prediction coefficient and sets the calculated perceptual weighting filter coefficient as a filter coefficient intended for the perceptual weighting filters 17 and 19. The perceptual weighting filter 17 performs a filtering process on the input signal to be coded using the filter coefficient set by the perceptual weighting filter coefficient calculating unit 16.

The basic response generating unit 18 performs pitch filtering on a unit impulse or a fixed waveform using the repetition period of the adaptive excitation source so as to generate a series of cycles each of which includes the unit impulse or the fixed waveform, the repetition period of the series of cycles being equal to that of the adaptive excitation source. The basic response generating unit 18 then allows the generated signal, as an excitation source, to pass through a synthesis filter formed using the quantized linear prediction coefficient to generate synthesized speech, and outputs the synthesized speech as a basic response. The perceptual weighting filter 19 performs a filtering process on the basis response using the filter coefficient set by the perceptual weighting filter coefficient calculating unit 16.

The pre-table calculating unit 20 calculates the correlation d(x) between the perceptual weighted signal to be coded and the perceptual weighted basic response when placing the impulse at the location x, and calculates the cross-correlation φ(x,y) between the perceptual weighted basic response when placing the impulse at the location x and the perceptual weighted basic response when placing the impulse at the location y. The pre-table calculating unit 20 then obtains d′(x) and φ′(x,y) according to equations (6) and (7) and stores them as a pre-table.

The excitation source location table 22 stores a plurality of candidates for the locations of excitation source pulses, which are similar to those as shown in FIG. 16. The searching unit 21 sequentially reads each of all combinations of the possible locations of the excitation source pulses from the excitation source location table 22 and calculates an evaluation value D for each combination of the possible locations of the excitation source pulses using the pre-table calculated by the pre-table calculating unit 20 according to above-mentioned equations (1), (4) and (5). The searching unit 21 also searches for one combination of the possible locations of the excitation source pulses which maximizes the evaluation value D and furnishes excitation source location code (i.e., indexes of the excitation source location table) indicating the combination of the possible locations of the excitation source pulses and polarity code indicating the polarities of them, as driving excitation source code, to a multiplexer 7 as shown in FIG. 14. The searching unit 21 further delivers one time-series vector associated with the driving excitation source code to a gain coding unit 6 as shown in FIG. 14.

In Japanese patent application publication No. 10-312198, the method of making the driving excitation source orthogonal to the adaptive excitation source is implemented by making the perceptual weighted signal to be coded which is input to the pre-table calculating unit 20 orthogonal to the adaptive excitation source, and contributions associated with the correlation between the adaptive excitation source and each driving excitation source pulse are subtracted from E given by equation (5) in the searching unit 21.

A problem encountered with prior art speech coding apparatuses and prior art speech decoding apparatuses constructed as above is that while the pitch-filtering process to generate a pitch-filtered driving excitation source can improve the coding performance without increasing the amount of searching operations, the use of the repetition period of an adaptive excitation source as the repetition period intended for the pitch-filtering process can degrade the quality of speech code generated when the pitch-period of an input speech is different from the repetition period of the adaptive excitation source.

FIG. 18 shows a relationship between a signal to be coded and the locations of pulses included in each pitch-cycle of a pitch-filtered driving excitation source, when the repetition period of the adaptive excitation source is two times the pitch-period of an input speech, in accordance with a prior art speech coding apparatus and a prior art speech decoding apparatus. FIG. 19 shows a relationship between a signal to be coded and the locations of pulses included in each pitch-cycle of a pitch-filtered driving excitation source, when the repetition period of the adaptive excitation source is one-half the pitch-period of an input speech, in accordance with a prior art speech coding apparatus and a prior art speech decoding apparatus.

The repetition period of the adaptive excitation source is determined such that the coding distortion between a synthesized speech generated based on the adaptive excitation source and the signal to be coded is minimized. Therefore the repetition period of the adaptive excitation source is frequently different from the pitch-period of the input speech that is the period of vibrations of the speaker's vocal cords. In this case, the repetition period of the adaptive excitation source is approximately an integral multiple or submultiple of the pitch-period of the input speech. In many cases, the repetition period of the adaptive excitation source is about two times or one-half the pitch-period.

In FIG. 18, since the speaker's vocal cords vibrate in the same way every other pitch-cycle, it is determined that the repetition period of the adaptive excitation source is about two times as large as the pitch-period of the input speech. When the driving excitation source is coded using the repetition period of the adaptive excitation source, most excitation source pulses are concentrated in the first half of the period of each pitch-cycle. The pitch-filtered driving excitation source that is the series of pitch-cycles thus obtained in the current frame using the repetition period of the adaptive excitation source is as shown in FIG. 18. The use of the excitation source pitch-filtered using the repetition period different from the pitch-period of the input speech can cause a change in the tone quality of the frame and hence unstability in the synthesized speech. This disadvantage does not become negligible as the bit rate decreases and the amount of information about the driving excitation source therefore decreases. Frames in which the magnitude of the adaptive excitation source is less than that of the driving excitation source have noticeable degradation of the sound quality.

In FIG. 19, since there is a predominance of low-frequency components in the input speech signal and the waveform of the first half of each pitch-cycle of the input speech is similar to that of the second half of each pitch-cycle, it is determined that the repetition period of the adaptive excitation source is about one-half the pitch-period of the input speech. As in the case of FIG. 18, the use of the excitation source pitch-filtered using the repetition period different from the pitch-period of the input speech can cause a change in the tone quality of the frame and hence unstability in the synthesized speech.

When the bit rate decreases and the amount of information about the driving excitation source therefore decreases, there is a tendency that the driving excitation source determined such that the waveform distortion (or coding distortion) is minimized has a large error in a band of low magnitudes and the synthesized speech therefore has a large spectral distortion. Such a spectral distortion can be detected as degradation of the sound quality. Although a perceptual weighting process is provided in order to eliminate degradation of the sound quality due to spectral distortions, an enhancement of the perceptual weighting process can cause an increase in the waveform distortion and hence degradation of the sound quality showing a ragged sound. The enhancement of the perceptual weighting process is therefore controlled such that the adverse effect on the sound quality by the waveform distortion has the same level as that by the spectral distortion. However, the spectral distortion is increased when the input speech is a female one, and the perceptual weighting process cannot be controlled so that it is optimized for both male and female speeches.

In prior art configurations, a constant magnitude is provided for a plurality of excitation sources, such as pulses, placed at respective locations within each pitch-cycle included in each frame. There is no use in equalizing the magnitudes of the plurality of excitation sources regardless of the difference in the number of candidates for the location of each of the plurality of excitation sources. In the excitation source location table as shown in FIG. 16, three bits are used for each of the excitation source locations numbered 1 to 3 and four bits are used for the remaining excitation source location numbered 4. It is easily expected by examining a maximum of a correlation between each of the plurality of excitation sources placed at a possible location and the signal to be coded that the excitation source number 4 having the largest number of possible locations has a higher probability of providing the largest correlation. Assume an extreme case where no bit is provided for an excitation source number. In the case where no bit is provided for an excitation source number, i.e., one excitation source is fixed at a certain location, the correlation between the excitation source and the signal to be coded is small while the polarity is provided independently. This means that it is not appropriate to provide a larger magnitude for one excitation source as compared with those provided for other excitation sources. The problem with prior art configurations is thus that the magnitudes of the plurality of excitation sources are not optimized.

Although a prior art configuration is disclosed for providing an individual magnitude for each of the plurality of excitation sources through vector quantization during the gain quantization process, the amount of gain-quantized information increases and the gain quantization process increases in complexity.

The above-mentioned technique of making the driving excitation source orthogonal to the adaptive excitation source causes an increase in the amount of searching operations. Therefore, an increase in the number of combinations of algebraic excitation sources puts an enormous load on the coding or decoding process. Especially, when using the technique of making the driving excitation source orthogonal to the adaptive excitation source in a prior art configuration that generates a driving excitation source by placing a plurality of fixed waveforms or performs a pitch-filtering process to generate a pitch-filtered driving excitation source, the amount of arithmetic operations increase greatly.

The present invention is proposed to solve the above problems. It is therefore an object of the present invention to provide a speech coding apparatus capable of generating high-quality speech code and a speech decoding apparatus capable of reconstructing a high-quality speech.

It is another object of the present invention to provide a speech coding apparatus capable of generating high-quality speech code while keeping an increase in the amount of arithmetic operations to a minimum and a speech decoding apparatus capable of reconstructing a high-quality speech while keeping an increase in the amount of arithmetic operations to a minimum.


d″(mk)=akd′(mk)  (10)
φ″(mk, mi)=akaiφ′(mk, mi)  (11)
where ak is the magnitude of the kth pulse, which is equal to one magnitude listed in the excitation source location table of FIG. 12. Only calculating and storing d″(mk) and φ″(mk, mi) as a pre-table in advance of the calculation of the evaluation value D for all combinations of all pulse locations is thus needed before the simple summations according to the equations (8) and (9), thereby reducing the amount of arithmetic operations.

The decoding of the driving excitation source can be performed by selecting one excitation source location for each of the plurality of excitation source numbers stored in the excitation source location table of FIG. 12 based on the excitation source location code, and for placing an excitation source, which is then multiplied by the fixed magnitude provided for each of the plurality of excitation source numbers, at a corresponding excitation source location selected for each of the plurality of excitation source numbers. When each of the plurality of excitation sources placed is not a pulse or when generating a series of pitch-cycles each includes the plurality of excitation sources, elements of the plurality of excitation sources placed overlap and all that is needed is to calculate the sum of all overlapped portions. In other words, the driving excitation source decoding process of the present embodiment includes the process of multiplying a plurality of excitation sources to be placed by respective fixed magnitudes provided for the plurality of excitation source numbers in addition to the conventional algebraic excitation source decoding process.

In a prior art decoding process in which a fixed waveform is prepared for each of the plurality of excitation source numbers, a basic response has to be calculated for each of the plurality of excitation source numbers. In contrast, in accordance with the present embodiment, only a modification of the pre-table is added as previously mentioned. In any prior art decoding process, the magnitude of each of the plurality of excitation sources is maintained constant even though the amount of location information (i.e., the number of candidates for the excitation source location) varies from excitation source number to excitation source number.

As previously mentioned, in accordance with the fifth embodiment of the present invention, the speech coding apparatus provides a certain magnitude depending on the number of candidates for the location of each of a plurality of excitation sources for each of the plurality of excitation sources and multiplies the plurality of excitation sources placed at respective possible locations by the plurality of fixed magnitudes, respectively, by means of the driving excitation source coding unit 5. The driving excitation source coding unit 5 then generates a driving excitation source by calculating the sum of all the excitation sources placed at the respective possible locations for each of all combinations of possible locations of the plurality of excitation sources, and searches for excitation source code and polarity code associated with one driving excitation source exhibiting the smallest coding distortion between itself and the input speech, the excitation source code indicating the locations of the plurality of excitation sources placed and the polarity code indicating the polarities of the plurality of excitation sources placed. The speech coding apparatus can avoid waste concerned with the setting of the magnitudes of the plurality of excitation sources to a fixed value, and generate high-quality speech code.

Similarly, in accordance with the fifth embodiment of the present invention, the speech decoding apparatus provides a certain magnitude depending on the number of candidates for the location of each of a plurality of excitation sources for each of the plurality of excitation sources. The driving excitation source decoding unit 12 then generates a driving excitation source by calculating the sum of all the excitation sources placed at respective possible locations defined by the excitation source location coded included in the input speech code while multiplying the plurality of excitation sources placed at the respective possible locations by the plurality of fixed magnitudes, respectively. The speech decoding apparatus can avoid waste concerned with the setting of the magnitudes of the plurality of excitation sources to a fixed value, and reconstruct a high-quality speech.

Referring next to FIG. 13, there is illustrated a block diagram showing the structure of a driving excitation source coding unit 5 of a speech coding apparatus in accordance with a sixth embodiment of the present invention. The overall structure of the speech coding apparatus of this embodiment is the same as that of prior art speech coding apparatuses as shown in FIG. 14. In FIG. 13, reference numeral 42 denotes a pre-table modifying unit. The speech coding apparatus of the sixth embodiment can make a perceptual weighted signal to be coded orthogonal to an adaptive excitation source using only the additional pre-table modifying unit 42.

In operation, a linear prediction coefficient coding unit 3 delivers a quantized linear prediction coefficient to both a perceptual weighting filter coefficient calculating unit 16 disposed within the driving excitation source coding unit 5 and a basic response generating unit 18. An adaptive excitation source coding unit 4 converts an adaptive excitation source code into a repetition period of an adaptive excitation source and then furnishes the repetition period of the adaptive excitation source to the basic response generating unit 18 located within the driving excitation source coding unit 5. The adaptive excitation source coding unit 4 also delivers either an input speech 1 or a signal obtained by subtracting a synthesized speech generated based on the adaptive excitation source from the input speech 1, as a signal to be coded, to a perceptual weighting filter 17. The adaptive excitation source coding unit 4 further furnishes the adaptive excitation source to the pre-table modifying unit 42 located within the driving excitation source coding unit 5.

The perceptual weighting filter coefficient calculating unit 16 calculates a perceptual weighting filter coefficient using the quantized linear prediction coefficient and defines the calculated perceptual weighting filter coefficient as a filter coefficient for the perceptual weighting filter 17 and another perceptual weighting filter 19. The perceptual weighting filter 17 performs a filtering process on the input signal to be coded using the filter coefficient set by the perceptual weighting filter coefficient calculating unit 16.

The basic response generating unit 18 performs a pitch-filtering process on either a unit pulse or a fixed waveform using the input repetition period of the adaptive excitation source so as to generate a series of pitch-cycles each of which includes either the unit pulse or the fixed waveform. The basic response generating unit 18 then generates a synthesized speech by allowing the generated signal as an excitation source to pass through a synthesis filter constructed using the quantized linear prediction coefficient, and furnishes the synthesized speech as a basic response to the perceptual weighting filter 19. The perceptual weighting filter 19 performs a filtering process on the input basic response using the filter coefficient set by the perceptual weighting filter coefficient calculating unit 16.

The pre-table calculating unit 20 calculates a correlation d(x) between the perceptual weighed signal to be coded from the perceptual weighting filter 17 and each of the plurality of perceptual weighed basic responses from the perceptual weighting filter 19, i.e., each of a plurality of perceptual weighed synthesized speeches respectively generated based on a plurality of temporary driving excitation sources, which are signals obtained by placing a predetermined excitation source at all possible excitation source locations, respectively. The pre-table calculating unit 20 also calculates a cross-correlation φ(x,y) between any two of the plurality of perceptual weighted basic responses, i.e., any two of the plurality of synthesized speeches respectively generated based on the plurality of temporary driving excitation sources. d(x) and φ(x,y) are stored as a pre-table.

The pre-table modifying unit 42 accepts the adaptive excitation source and the pre-table stored in the pre-table calculating unit 20 and modifies the pre-table according to the following equations equation (12) and either the following equation (13) or equivalent (13′). The pre-table modifying unit 42 then calculates d′(x) and φ′(x,y) according to the following equations (14) and (15) and stores these parameters as a new pre-table.

d ^ ( x ) = d ( x ) - c x c tgt P acb ( 12 ) ϕ ^ ( x , y ) = ϕ ( x , y ) - c x c y p acb ( 13 )
equation (13) being equivalent to the following equation:

ϕ ^ ( x , y ) = 1 p acb ( p pacb ϕ ( x , y ) - c x c y ) ( 13 ) d ( m k ) = d ^ ( m k ) ( 14 ) [ ϕ ( m k , m i ) sign [ d ^ ( m k ) ] sign [ d ^ ( m 1 ) ] ( m k , m i ) [ ] ( 15 ) ] ϕ ( m k , m i ) = sign [ d ^ ( m k ) ] sign [ d ^ ( m 1 ) ] ϕ ^ ( m k , m i ) ( 15 )
where ctgt is a correlation between the perceptual weighted signal to be coded and a perceptual weighted adaptive excitation source response (i.e., synthesized speech), i.e., a correlation between the perceptual weighted signal to be coded and a synthesized speech generated based on the perceptual weighted adaptive excitation source, cx is a correlation between a signal created by placing the perceptual weighted basic response at the excitation source location x and the perceptual weighted adaptive excitation source response (i.e., synthesized speech), i.e., a correlation between each of the plurality of perceptual weighed synthesized speeches respectively generated based on the plurality of temporary driving excitation sources and the synthesized speech generated based on the adaptive excitation source, and pacb is the power of the perceptual weighted adaptive excitation source response (i.e., synthesized speech).

The searching unit 21 sequentially reads the plurality of candidates for the excitation source location from the excitation source location table 22, and calculates the evaluation value D for each of all combinations of possible excitation source locations using the pre-table stored in the pre-table modifying unit 42, i.e., d′(x) and φ′(x,y) calculated for each of all combinations of possible excitation source locations according to the equations (1), (4) and (5). The searching unit 21 then searches for one combination of excitation source locations that maximizes the evaluation value D and furnishes excitation source location code (i.e., indexes of the excitation source location table) indicating the plurality of possible excitation source locations searched for and polarity code indicating the polarities of the plurality of excitation sources, as driving excitation source code. The searching unit 21 generates a time-series vector associated with the driving excitation source code as a driving excitation source.

As previously mentioned, in accordance with the sixth embodiment of the present invention, the speech coding apparatus calculates a correlation ctgt between the perceptual weighted signal to be coded and a synthesized speech generated based on the perceptual weighted adaptive excitation source, and a correlation cx between each of a plurality of perceptual weighed synthesized speeches respectively generated based on a plurality of temporary driving excitation sources, which are associated with all possible excitation source locations, respectively, and the synthesized speech generated based on the adaptive excitation source, and then modifies the pre-table using these correlations. Accordingly, the speech coding apparatus can make the perceptual weighted signal to be coded orthogonal to the adaptive excitation source without increase in the amount of arithmetic operations in the searching unit 21, thereby improving the coding performance and providing high-quality speech code.

Many widely different embodiments of the present invention may be constructed without departing from the spirit and scope of the present invention. It should be understood that the present invention is not limited to the specific embodiments described in the specification, except as defined in the appended claims.

Yamaura, Tadashi, Tasaki, Hirohisa

Patent Priority Assignee Title
Patent Priority Assignee Title
5732389, Jun 07 1995 THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT Voiced/unvoiced classification of speech for excitation codebook selection in celp speech decoding during frame erasures
5745871, May 03 1993 THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT Pitch period estimation for use with audio coders
5781880, Nov 21 1994 WIAV Solutions LLC Pitch lag estimation using frequency-domain lowpass filtering of the linear predictive coding (LPC) residual
5787389, Jan 17 1995 RAKUTEN, INC Speech encoder with features extracted from current and previous frames
6202046, Jan 23 1997 Kabushiki Kaisha Toshiba Background noise/speech classification method
6226604, Aug 02 1996 III Holdings 12, LLC Voice encoder, voice decoder, recording medium on which program for realizing voice encoding/decoding is recorded and mobile communication apparatus
6408268, Mar 12 1997 Mitsubishi Denki Kabushiki Kaisha Voice encoder, voice decoder, voice encoder/decoder, voice encoding method, voice decoding method and voice encoding/decoding method
6453288, Nov 07 1996 Godo Kaisha IP Bridge 1 Method and apparatus for producing component of excitation vector
6496796, Sep 07 1999 Mitsubishi Denki Kabushiki Kaisha Voice coding apparatus and voice decoding apparatus
6507814, Aug 24 1998 SAMSUNG ELECTRONICS CO , LTD Pitch determination using speech classification and prior pitch estimation
EP694907,
EP743634,
EP883107,
JP10069297,
JP10232696,
JP10293599,
JP10312198,
JP1200296,
JP208900,
JP519794,
JP519795,
JP61134000,
JP6396699,
WO9840877,
/
Executed onAssignorAssigneeConveyanceFrameReelDoc
Jan 28 2010Mitsubishi Denki Kabushiki Kaisha(assignment on the face of the patent)
Date Maintenance Fee Events
Oct 04 2012ASPN: Payor Number Assigned.
Oct 16 2013M1552: Payment of Maintenance Fee, 8th Year, Large Entity.
Dec 25 2017REM: Maintenance Fee Reminder Mailed.
Jun 11 2018EXP: Patent Expired for Failure to Pay Maintenance Fees.


Date Maintenance Schedule
Feb 21 20154 years fee payment window open
Aug 21 20156 months grace period start (w surcharge)
Feb 21 2016patent expiry (for year 4)
Feb 21 20182 years to revive unintentionally abandoned end. (for year 4)
Feb 21 20198 years fee payment window open
Aug 21 20196 months grace period start (w surcharge)
Feb 21 2020patent expiry (for year 8)
Feb 21 20222 years to revive unintentionally abandoned end. (for year 8)
Feb 21 202312 years fee payment window open
Aug 21 20236 months grace period start (w surcharge)
Feb 21 2024patent expiry (for year 12)
Feb 21 20262 years to revive unintentionally abandoned end. (for year 12)