A method and apparatus to determine an encoding mode of an audio signal, and a method and apparatus to encode an audio signal according to the encoding mode. In the encoding mode determination method, a mode determination threshold for the current frame that is subject to encoding mode determination is adaptively adjusted according to a long-term feature of the audio signal for a frame (the current frame) that is subject to encoding mode determination, thereby improving the hit rate of encoding mode determination and signal classification, suppressing frequent oscillation of an encoding mode in frame units, improving noise tolerance, and improving smoothness of a reconstructed audio signal.
|
1. An apparatus to determine an encoding mode to encode an audio signal, comprising:
at least one processor; and
a determining unit to generate a short-term feature and a long-term feature for a current frame, to generate a parameter by using the long-term feature, to adaptively adjust a mode determination threshold of the short-term feature in a unit of frames by using the parameter and to determine one of an audio encoding mode and a music encoding mode of the current frame based on a comparison of the short-term feature and the adjusted mode determination threshold so that the current frame of the audio signal is encoded according to the determined encoding mode.
21. An apparatus to decode a signal of a bitstream, comprising:
at least one processor; and
a determining unit to determine one of an audio encoding mode and a music encoding mode from a bitstream having an encoded signal and information on the audio encoding mode or the music encoding mode of the encoded signal, so that the encoded signal of the bitstream is decoded according to the audio encoding mode or the music encoding mode,
wherein the audio encoding mode or the music encoding mode of the encoded signal has been determined using a mode determination threshold of the short-term feature for a current frame which is adaptively adjusted in a unit of frames by using a parameter generated from a long-term feature for the current frame.
23. An encoding apparatus for an audio signal, comprising:
at least one processor; and
a short-term feature generation unit to analyze the audio signal for a current frame and generating a short-term feature;
a long-term feature generation unit to generate a long-term feature for the current frame using the short-term feature;
a mode determination threshold adjustment unit to adaptively adjust a mode determination threshold for the current frame using a parameter obtained from the generated long-term feature;
an encoding mode determination unit to determine one of an audio encoding mode and a speech encoding mode for the current frame using the adaptively adjusted mode determination threshold; and
an encoding unit to perform frequency-domain encoding or time-domain encoding for the current frame in the determined encoding mode.
22. An apparatus to encode and/or decode an audio signal, comprising:
at least one processor;
a first determining unit to generate a short-term feature and a long-term feature for a current frame, to generate a parameter by using the long-term feature, to adaptively adjust a mode determination threshold of the short-term feature in a unit of frames by using the parameter and to determine one of an audio encoding mode and a music encoding mode of the current frame based on a comparison of the short-term feature and the adjusted mode determination threshold, so that the current frame of the audio signal is encoded according to the audio encoding mode or the music encoding mode; and
a second determining unit to determine one of the audio encoding mode and the music encoding mode from a bitstream having the encoded signal and information on the encoding mode, so that the encoded signal of the bitstream is decoded according to the audio encoding mode or the music encoding mode.
2. The apparatus of
a time-domain coding unit to encode the audio signal according to the encoding mode in a time-domain; and
a frequency-domain coding unit to encode the audio signal according to the encoding mode in a frequency-domain.
3. The apparatus of
a speech coding unit to encode the audio signal as a speech signal according to the encoding mode; and
a music coding unit to encode the audio signal as a music signal according to the encoding mode.
4. The apparatus of
a speech coding unit to receive the audio signal and the encoding mode from the determining unit to encode the audio signal when the encoding mode is the speech encoding mode; and
a music coding unit to receive the audio signal and the encoding mode from the determining unit to encode the audio signal when the encoding mode is music encoding mode.
5. The apparatus of
a coding unit to encode the audio signal according to the encoding mode; and
a bitstream generation unit to generate a bitstream according to the encoded audio signal and information on the encoding mode.
6. The apparatus of
a short term feature generation unit to generate the short-term feature from the current frame of the audio signal; and
a long-term feature generation unit to generate the long-term feature from the current frame and at least one previous frame.
7. The apparatus of
a mode determination threshold adjustment unit to adjust the mode determination threshold using the short term feature and the long-term feature; and
an encoding determination unit to determine the encoding mode according to the adjusted mode determination threshold and the short-term feature.
8. The apparatus of
9. The apparatus of
10. The apparatus of
a first long-term feature generation unit to generate a first long-term feature according to the short-term feature of the current frame and a short-term feature of the at least one previous frame; and
a second long-term feature generation unit to generate a second long-term feature as the long-term feature according to the first long-term feature and a variation feature of at least one of the current frame and the at least one previous frame.
11. The apparatus of
a mode determination threshold adjustment unit to adjust a mode determination threshold according to the short term feature and the second long-term feature; and
an encoding determination unit to determine the encoding mode according to the adjusted mode determination threshold and the short-term feature.
12. The apparatus of
13. The apparatus of
an LP-LTP gain generation unit to generate an LP-LTP gain as the short-term feature of the current frame; and
a long-term feature generation unit to generate the long-term feature according to the LP-LTP gain of the current frame and a second LP-LTP gain of at least one previous frames.
14. The apparatus of
a spectrum tilt generation unit to generate a spectrum tilt as the short-term feature of the current frame; and
a long-term feature generation unit to generate the long-term feature according to the spectrum tilt of the current frame and a second spectrum tilt of at least one previous frames.
15. The apparatus of
a zero crossing rate generation unit to generate a zero crossing rate as the short-term feature of the current frame; and
a long-term feature generation unit to generate the long-term feature according to the zero crossing rate of the current frame and a second zero crossing rate of at least one previous frames.
16. The apparatus of
a short-term feature generation unit having one or a combination of an LP-LTP gain generation unit to generate an LP-LTP gain as the short-term feature of the current frame, a spectrum tilt generation unit to generate a spectrum tilt as the short-term feature of the current frame, and a zero crossing rate generation unit to generate a zero crossing rate as the short-term feature of the current frame; and
a long-term feature generation unit to generate the long-term feature according to the short-term feature of the current frame and a second short-term feature of at least one frame.
17. The apparatus of
18. The apparatus of
the long-term feature is determined according to the short-term feature of the current frame and short-term features of a plurality of previous frames.
19. The apparatus of
the long-term feature is determined according to a variation feature between the current frame and a previous frame.
20. The apparatus of
the long-term feature is determined according to a variation feature of a second encoding mode of the previous frame.
|
This application claims the benefit of Korean Patent Application No. 10-2006-0127844, filed on Dec. 14, 2006, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field of the Invention
The present general inventive concept relates to a method and apparatus to determine an encoding mode of an audio signal and a method and apparatus to encode and/or decode an audio signal using the encoding mode determination method and apparatus, and more particularly, to an encoding mode determination method and apparatus which can be used in an encoding apparatus to determine an encoding mode of an audio signal according to a domain and a coding method that are suitable for encoding the audio signal.
2. Description of the Related Art
Audio signals can be classified as various types, such as speech signals, music signals, or mixtures of speech signals and music signals, according to their characteristics, and different coding methods or compression methods are applied to the various types of the audio signal.
The compression methods for audio signals can be divided into an audio codec and a speech codec. The audio codec, such as Advanced Audio Coding Plus (aacPlus), is intended to compress music signals. The audio codec compresses a music signal in a frequency domain using a psychoacoustic model. However, when a speech signal is compressed using the audio codec, sound quality degrades, and the sound quality degradation becomes more serious when the speech signal includes an attack signal. The speech codec, such as Adaptive Multi Rate-WideBand (AMR-WB), is intended to compress speech signals. The speech codec compresses an audio signal in a time domain using an utterance model. However, when an audio signal is compressed using the speech codec, sound quality degrades.
In order to efficiently perform speech/music compression at the same time based on the above-described characteristics, AMR−WB+(3GPP TS 26.290) has been suggested. AMR−WB+ is a speech compression method using algebraic code excited linear prediction (ACELP) for speech compression and transform coded excitation (TCX) for audio compression.
AMR−WB+ determines whether to apply ACELP or TCX for each frame on a time axis. Although AMR−WB+ works efficiently for a compression object that approximates a speech signal, it may cause degradation in sound quality or compression rate for a compression object that approximates a music signal. Thus, when different compression methods are applied according to the characteristics or modes of an audio signal, a method for determining an encoding mode has a great influence on the performance of encoding or compression with respect to the audio signal.
U.S. Pat. No. 6,134,518 discloses a conventional method for coding a digital audio signal using a CELP coder and a transform coder. Referring to
However, because of weak noise tolerance, the conventional method has a low hit rate of mode determination and signal classification under noisy conditions. That is, the mode determination and signal classification are inaccurately performed. Moreover, frequent mode oscillation in frame units cannot provide a smooth reconstructed audio signal.
The present general inventive concept provides a method and apparatus to determine an encoding mode to encode an audio signal.
The present general inventive concept provides a method and apparatus to improve a hit rate of mode determination and signal classification under noisy conditions when encoding an audio signal.
The present general inventive concept provides a method and apparatus to adaptably adjust a mode determining threshold to determine an encoding mode according to the adjusted mode determining threshold.
The present general inventive concept provides a method and apparatus to encode and/or decode an audio signal according to an adaptably determined encoding mode.
The present general inventive concept provides a computer readable medium to execute a method of determining an encoding mode to encode an audio signal
Additional aspects and utilities of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.
The foregoing and/or other aspects of the present general inventive concept may be achieved by providing an apparatus to determine an encoding mode to encode an audio signal, the apparatus including a determination unit to determine an encoding mode of a first frame of an audio signal according to a short-term feature of the first frame and a long-term feature between the first frame and a second frame so that the first frame of the audio signal is encoded according to the encoding mode.
The apparatus may further include a time-domain coding unit to encode the audio signal according to the encoding mode and a time-domain, and a frequency-domain coding unit to encode the audio signal according to the encoding mode and a frequency-domain.
The apparatus may further include a speech coding unit to encode the audio signal as a speech signal according to the encoding mode, and a music coding unit to encode the audio signal as a music signal according to the encoding mode.
The apparatus may further include a speech coding unit to receive the audio signal and the encoding mode from the determining unit to encode the audio signal when the encoding mode is a speech signal encoding mode, and a music coding unit to receive the audio signal and the encoding mode from the determining unit to encode the audio signal when the encoding mode is a music signal encoding mode.
The apparatus may further include a coding unit to encode the audio signal according to the encoding mode, and a bitstream generation unit to generate a bitstream according to the encoded audio signal and information on the encoding mode.
The determining unit may include a short term feature generation unit to generate the short-term feature from the first frame of the audio signal, and a long-term feature generation unit to generate the long-term feature from the first frame and the second frame.
The determining unit may further include a mode determination threshold adjustment unit to adjust a mode determination threshold according to the short term feature and the long-term feature, and an encoding determination unit to determine the encoding mode according to the adjusted mode determination threshold and the short-term feature.
The mode determination threshold adjustment unit may adjust the mode determination threshold according to the short term feature, the long-term feature, and a second encoding mode of the second frame.
The encoding determination unit may determine the encoding mode according to the adjusted mode determination threshold, the short-term feature, and a second encoding mode of the second frame.
The long-term feature generation unit may include a first long-term feature generation unit to generate a first long-term feature according to the short-term feature of the first frame and a second short-term feature of the second feature, and a second long-term feature generation unit to generate a second long-term feature as the long-term feature according to the first long-term feature and a variation feature of at least one of the first frame and the second frame.
The determination unit may further include a mode determination threshold adjustment unit to adjust a mode determination threshold according to the short term feature and the second long-term feature, and an encoding determination unit to determine the encoding mode according to the adjusted mode determination threshold and the short-term feature.
The determination unit may determine the encoding mode of the first frame of the audio signal according to the short-term feature of the first frame, the long-term feature between the first frame and the second frame, and a second encoding mode of the second frame.
The determination unit may include an LP-LTP gain generation unit to generate an LP-LTP gain as the short-term feature of the first frame, and a long-term feature generation unit to generate the long-term feature according to the LP-LTP gain of the first frame and a second LP-LTP gain of the second frame.
The determination unit may include a spectrum tilt generation unit to generate a spectrum tilt as the short-term feature of the first frame, and a long-term feature generation unit to generate the long-term feature according to the spectrum tilt of the first frame and a second spectrum tilt of the second frame.
The determination unit may include a zero crossing rate generation unit to generate a zero crossing rate as the short-term feature of the first frame, and a long-term feature generation unit to generate the long-term feature according to the zero crossing rate of the first frame and a second zero crossing rate of the second frame.
The determination unit may include a short-term feature generation unit having one or a combination of an LP-LTP gain generation unit to generate an LP-LTP gain as the short-term feature of the first frame, a spectrum tilt generation unit to generate a spectrum tilt as the short-term feature of the first frame, and a zero crossing rate generation unit to generate a zero crossing rate as the short-term feature of the first frame, and a long-term feature generation unit to generate the long-term feature according to the short-term feature of the first frame and a second short-term feature of the second frame.
The determination unit may include a memory to store the short-term and long-term features of the first and second frames.
The first frame may be a current frame; the second frame may include a plurality of previous frames, and the long-term feature may be determined according to the short-term feature of the first frame and second short-term features of the plurality of the previous frames.
The first frame may be a current frame, the second frame may be a previous frame, and the long-term feature may be determined according to a variation feature between the current frame and the previous frame.
The first frame may be a current frame, the second frame may include a previous frame, and the long-term feature may be determined according to a variation feature of a second encoding mode of the previous frame.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing an apparatus to encode an audio signal, the apparatus including a determination unit to determine an encoding mode of a first frame of an audio signal according to a short-term feature of the first frame, a long-term feature between the first frame and a second frame, and a second encoding mode of the second frame, so that the first frame of the audio signal is encoded according to the encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing an apparatus to encode an audio signal, the apparatus including a determining unit to determine one of a speech mode and a music mode as an encoding mode to encode an audio signal according to a unique characteristic of a frame the audio signal and a relative characteristic of adjacent frames of the audio signal.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing an apparatus to decode a signal of a bitstream, the apparatus including a determining unit to determine an encoding mode from a bitstream having en encoded signal and information on the encoding mode of the encoded signal, so that the encoded signal of the bitstream is decoded according to the determined encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing an apparatus to encode and/or decode an audio signal, the apparatus including a first determining unit to determine an encoding mode of a first frame of an audio signal according to a short-term feature of the first frame and a long-term feature between the frame and a second frame so that the first frame of the audio signal is encoded according to the encoding mode; and a second determining unit to determine the encoding mode from a bitstream having the encoded signal and information on the encoding mode, so that the encoded signal of the bitstream is decoded according to the determined encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing a method of an apparatus to determine an encoding mode to encode an audio signal, the method including determining an encoding mode of a first frame of an audio signal according to a short-term feature of the first frame and a long-term feature between the first frame and a second frame so that the first frame of the audio signal is encoded according to the encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing a method of an apparatus to decode a signal of a bitstream, the method including determining an encoding mode from a bitstream having en encoded signal and information on the encoding mode of the encoded signal, so that the encoded signal of the bitstream is decoded according to the determined encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing a method of an apparatus to encode and/or decode an audio signal, the method including determining an encoding mode of a first frame of an audio signal according to a short-term feature of the first frame and a long-term feature between the frame and a second frame so that the first frame of the audio signal is encoded according to the encoding mode, and determining the encoding mode from a bitstream having the encoded signal and information on the encoding mode, so that the encoded signal of the bitstream is decoded according to the determined encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing a computer-readable medium containing computer readable codes as a program to execute a method of an apparatus to determine an encoding mode to encode an audio signal, the method including determining an encoding mode of a first frame of an audio signal according to a short-term feature of the first frame and a long-term feature between the first frame and a second frame so that the first frame of the audio signal is encoded according to the encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing a computer-readable medium containing computer readable codes as a program to execute a method of an apparatus to decode a signal of a bitstream, the method including determining an encoding mode from a bitstream having en encoded signal and information on the encoding mode of the encoded signal, so that the encoded signal of the bitstream is decoded according to the determined encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing a computer-readable medium containing computer readable codes as a program to execute a method of an apparatus to encode and/or decode an audio signal, the method including determining an encoding mode of a first frame of an audio signal according to a short-term feature of the first frame and a long-term feature between the frame and a second frame so that the first frame of the audio signal is encoded according to the encoding mode, and determining the encoding mode from a bitstream having the encoded signal and information on the encoding mode, so that the encoded signal of the bitstream is decoded according to the determined encoding mode.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing an apparatus to determine an encoding mode to encode an audio signal, the apparatus including a first generation unit to generate a short-term feature of a first frame, a second generation unit to adjust the short-term feature to a long-term feature according to a second short-feature of a second frame, an encoding mode determination unit to determine an encoding mode of the first frame of an audio signal according to the short-term feature and the long-term feature, and an encoding unit to encode the first frame of the audio signal according to the encoding unit.
The foregoing and/or other aspects of the present general inventive concept may also be achieved by providing an apparatus to determine an encoding mode to encode an audio signal, the apparatus including a first generation unit to generate a short-term feature of a first frame, a second generation unit to adjust the short-term feature according to a variation feature of the first frame with respect to a second frame, and to generate a long-term feature, an encoding mode determination unit to determine an encoding mode of the first frame of an audio signal according to the short-term feature and the long-term feature, and an encoding unit to encode the first frame of the audio signal according to the encoding unit.
These and/or other aspects and utilities of the present general inventive concept will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to the embodiments of the present general inventive concept, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present general inventive concept by referring to the figures.
The encoding mode determination apparatus 100 may include a divider (not shown) to divide an input audio signal into frames based on an input time of the audio signal and determines whether each of the frames is subject to frequency-domain coding or time-domain coding. The encoding mode determination apparatus 100 transmits mode information, indicating whether a current frame is subject to the frequency-domain coding or the time-domain coding, to the bitstream muxing unit 400 as additional information.
The encoding mode determination apparatus 100 may further include a time/frequency conversion unit (not shown) that converts an audio signal of a time domain into an audio signal of a frequency domain. In this case, the encoding mode determination apparatus 100 can determine an encoding mode for each of the frames of the audio signal in the frequency domain. The encoding mode determination apparatus 100 transmits the divided audio signal to either the time-domain coding unit 200 or the frequency-domain coding unit 300 according to the determined encoding mode. The detailed structure of the encoding mode determination apparatus 100 is illustrated in
The time-domain coding unit 200 encodes the audio signal corresponding to the current frame to be encoded in an encoding mode determined by the encoding mode determination apparatus 100 in the time domain and transmits the encoded audio signal to the bitstream muxing unit 400. In the present embodiment, the time-domain encoding may be a speech compression algorithm that performs compression in the time domain, such as code excited linear prediction (CELP).
The frequency-domain coding unit 300 encodes the audio signal corresponding to the current frame in the encoding mode determined by the encoding mode determination apparatus 100 in the frequency domain and transmits the encoded audio signal to the bitstream muxing unit 400. Since the input audio signal is a time-domain signal, a time/frequency conversion unit (not shown) may be further included to convert the input audio signal of the time domain to an audio signal of the frequency domain. In the present embodiment, the frequency-domain encoding is an audio compression algorithm that performs compression in the frequency domain, such as transform coded excitation (TCX), advanced audio codec (AAC), and the like.
The bitstream muxing unit 400 receives the encoded audio signal from the time-domain coding unit 200 or the frequency domain coding unit 300 and the mode information from the encoding mode determination apparatus 100, and generates a bitstream using the received signal and mode information. In particular, the mode information can also be used to determine a decoding mode when signals corresponding to the bit stream are decoded to reconstruct the audio signal.
The encoding mode determination apparatus 100 may include a divider to divide an input audio signal into frames based on an input time of the audio signal and determines whether each frame is subject to speech coding or music coding. The encoding mode determination apparatus 100 also transmits mode information, indicating whether the current frame is subject to speech coding and music coding, to the bitstream muxing unit 400 as additional information. The speech coding unit 200′, the music coding unit 300′, and the bitstream muxing unit 400 correspond to the time-domain coding unit 200, the frequency-domain coding unit 300, and the bitstream muxing unit 400 illustrated in
The audio signal division unit 110 divides an input audio signal into frames in the time domain and transmits the divided audio signal to the short-term feature generation unit 120.
The short-term feature generation unit 120 performs short-term analysis with respect to the divided audio signal to generate a short-term feature. In the present embodiment, the short-term feature is a unique feature of each frame to be used to determine whether a current frame is in a music mode or a speech mode and which one of time-domain coding and frequency-domain coding is efficient for the current frame.
The short-term feature may include a linear prediction-long-term prediction (LP-LTP) gain, a spectrum tilt, a zero crossing rate, a spectrum autocorrelation, and the like.
The short-term feature generation unit 120 may independently generate and output one short-term feature or a plurality of short-term features or may output a sum of a plurality of weighted short-term features as a representative short-term feature. The detailed structure of the short-term feature generation unit 120 is illustrated in
The long-term feature generation unit 130 generates a long-term feature using the short-term feature generated by the short-term feature generation unit 120 and features that are stored in the short-term feature buffer 161 and the long-term feature buffer 162. The long-term feature generation unit 130 includes a first long-term feature generation unit 140 and a second long-term feature generation unit 150.
The first long-term feature generation unit 140 obtains information about the stored short-term features of a plurality of previous frames, for example, five (5) consecutive previous frames, preceding the current frame from the short-term feature buffer 161 to calculate an average value and calculates a difference between the short-term feature of the current frame and the calculated average value to generate a variation feature.
When the short-term feature is an LP-LTP gain, the average value is an average of LP-LTP gains of the previous frames preceding the current frame and the variation feature is information describing how much the LP-LTP gain of the current frame deviates from the average value corresponding to a predetermined term or period. As illustrated in
The second long-term feature generation unit 150 generates a long-term feature having a moving average that considers a per-frame change in the variation feature generated by the first long-term feature generation unit 140 under a predetermined constraint. Here, the predetermined constraint represents a condition and a method to apply a weight to the variation feature of a previous frame preceding the current frame.
In particular, the second long-term feature generation unit 150 distinguishes between a case where the variation feature of the current frame is greater than a predetermined threshold and a case where the variation feature of the current frame is less than the predetermined threshold and applies different weights to the variation feature of the previous frame and the variation feature of the current frame, thereby generating the long-term feature. Here, the predetermined threshold is a preset value for distinguishing between a speech mode and a music mode. The generation of the long-term feature will be described in more detail later.
As mentioned above, the buffer 160 includes the short-term feature buffer 161 and the long-term feature buffer 162. The short-term feature buffer 161 stores one or more short-term features generated by the short-term feature generation unit 120 for at least a predetermined period of time and the long-term feature buffer 162 stores one or more long-term features generated by the first long-term feature generation unit 140 and the second long-term feature generation unit 150 for at least a predetermined period of time.
The long-term feature comparison unit 170 compares the long-term feature generated by the second long-term feature generation unit 150 with a predetermined threshold to generate a comparison result. Here, the predetermined threshold is a long-term feature for the case where there is a high possibility that the current mode is a speech mode and is previously determined by statistical analysis with respect to speech signals and music signals. When a threshold SpThr for a long-term feature is set as illustrated in
When the long-term feature is less than the threshold, the encoding mode for the current frame can be determined by a process of adjusting a mode determination threshold and comparing the short-term feature with the adjusted mode determination threshold. The mode determination threshold can be adjusted based on a hit rate of mode determination, and as illustrated in
The mode determination threshold adjustment unit 180 adaptively adjusts the mode determination threshold that is referred to for determining the encoding mode for the current frame when the long-term feature generated by the second long-term feature generation unit 150 is less than the threshold, i.e., when it is difficult to determine the encoding mode for the current frame only with the long-term feature.
The mode determination threshold adjustment unit 180 receives mode information of a previous frame from the encoding mode determination unit 190 and adjusts the mode determination threshold adaptively according to a determination of whether the previous frame is in the speech mode or the music mode, the short term feature received from the short-term feature generation unit 120, and the comparison result received from the long-term feature comparison unit 170s. The mode determination threshold is used to determine of which one of the speech mode and the music mode has a property of the short-term feature of the current frame. In the present embodiment, the mode determination threshold is adjusted according to the encoding mode of the previous frame preceding the current frame. The adjustment of the mode determination threshold will be described in detail later.
The encoding mode determination unit 190 compares a short-term feature STF_THR of the current frame received from the short-term feature generation unit 120 with a mode determination threshold STF_THR adjusted by the mode determination threshold adjustment unit 180 in order to determine whether the encoding mode for the current frame is the speech mode or the music mode.
The LP-LTP gain generation unit 121 generates an LP-LTP gain of the current frame by short-term analysis with respect to each frame of the input audio signal as a short-term feature.
The LP analysis unit 121a calculates a coefficient PrdErr, r[0] by performing linear analysis with respect to an audio signal corresponding to the current frame and calculates an LPC gain using the calculated value as follows:
LPC gain=−10.*log 10((PrdErr/(r[0]+0.0000001)) (1)
where PrdErr is a prediction error according to Levinson-Durbin that is a process of obtaining an LP filter coefficient and r[0] is the first reflection coefficient.
The LP analysis unit 121a calculates a linear prediction coefficient (LPC) using autocorrelation with respect to the current frame. At this time, a short-term analysis filter is specified by the LPC and a signal passing through the specified filter is transmitted to the open-loop pitch analysis unit 121b.
The open-loop pitch analysis unit 121b calculates a pitch correlation by performing long-term analysis with respect to an audio signal that is filtered by the short-term analysis filter. The open-pitch loop analysis unit 121b calculates an open-loop pitch lag for the maximum cross correlation between an audio signal corresponding to a previous frame stored in the buffer 160 and an audio signal corresponding to the current frame and specifies a long-term analysis filter using the calculated lag. The open-loop pitch analysis unit 121b obtains a pitch using correlation between a previous audio signal and the current audio signal, which is obtained by the LP analysis unit 121a, and divides the correlation by the pitch, thereby calculating a normalized pitch correlation. The normalized pitch correlation rx can be calculated as follows:
where T is an estimation value of an open-loop pitch period and xi is a weighted input signal.
The LP-LTP synthesis unit 121c receives zero excitation as an input and performs LP-LTP synthesis.
The weighted SegSNR calculation unit 121d calculates an LP-LTP gain of a reconstructed signal that is output from the LP-LTP synthesis unit 121c. The LP-LTP gain, which is a short-term feature of the current frame, is transmitted to the LP_LTP gain moving average calculation unit 141.
The LP_LTP gain moving average calculation unit 141 calculates an average of LP-LTP gains of a predetermined number of previous frames preceding the current frame, which are stored in the short-term feature buffer 161.
The first variation feature comparison unit 151 receives a difference SNR_VAR between the moving average calculated by the LP_LTP gain moving average calculation unit 141 and the LP-LTP gain of the current frame and compares the received difference with a predetermined threshold SNR_THR.
The SNR_SP calculation unit 154 calculates a long-term feature SNR_SP by an ‘if’ conditional statement according to the comparison result obtained by the first variation feature comparison unit 151, as follows:
if(SNR—VAR>SNR—THR)
SNR—SP=a1*SNR—SP+(1−a)*SNR—VAR (3),
else
SNR—SP=D1
where an initial value of SNR_SP is 0, a1 is a real number between 0 and 1 and is a weight for SNR_SP and SNR_VAR, and D1 is β1×(SNR_THR/LT−LTP gain) in which β1 is a constant indicating the degree of reduction.
In Equation 3, a1 is a constant that suppresses a mode change between the speech mode and the music mode, caused by noise, and the larger a1 allows smoother reconstruction of an audio signal. According to the ‘if’ conditional statement expressed by Equation 3, the long-term feature SNR_SP increases when SNR_VAR is greater than the threshold SNR_THR and the long-term feature SNR_SP is reduced from a long-term feature SNR_SP of a previous frame by a predetermined value when the variation feature SNR_VAR is less than the threshold SNR_THR.
The SNR_SP calculation unit 154 calculates the long-term feature SNR_SP by executing the ‘if’ conditional statement expressed by Equation 3. The variation feature SNR_VAR is also a kind of long-term feature, but is transformed into the long-term feature SNR_SP having a distribution illustrated in
Referring back to
etilt=El/Eh (4),
where Eh is an average energy in a high band and El is an average energy in a low band. The spectrum tilt average calculation unit 142 calculates an average of spectrum tilts of a predetermined number of frames preceding the current frame, which are stored in the short-term feature buffer 161, or calculates an average of spectrum tilts including the spectrum tilt of the current frame generated by the spectrum tilt generation unit 122.
The second variation feature comparison unit 152 receives a difference Tilt_VAR between the average generated by the spectrum tilt average calculation unit 142 and the spectrum tilt of the current frame generated by the spectrum tilt generation unit 122 and compares the received difference with a predetermined threshold TILT_THR.
The TILT_SP calculation unit 155 calculates a tilt speech possibility TILT_SP that is a long-term feature by executing an ‘if’ conditional statement expressed by Equation 5 according to the comparison result obtained by the spectrum tilt variation feature comparison unit 152, as follows:
if(TILT—VAR>TILT—THR)
TILT—SP=a2*TILT—SP+(1−a2)*TILT—VAR (5),
else
TILT—SP=D2
where an initial value of TILT_SP is 0, a2 is a real number between 0 and 1 and is a weight for TILT_SP and TILT_VAR, and D2 is β2×(TILT_THR/SPECTRUM TILT) in which β2 is a constant indicating the degree of reduction. A detailed description that is common to TILT_SP and SNR_SP will not be given.
Referring back to
if(S(n)·S(n−1)<0)ZCR=ZCR+1 (6),
where S(n) is a variable for determining whether an audio signal corresponding to the current frame n is a positive value or a negative value and an initial value of ZCR is 0.
The ZCR average calculation unit 143 calculates an average of zero crossing rates of a predetermined number of previous frames preceding the current frame, which are stored in the short-term feature buffer 161, or calculates an average of zero crossing rates including the zero crossing rate of the current frame, which is generated by the ZCR generation unit 123.
The third variation feature comparison unit 153 receives a difference ZC_VAR between the average generated by the ZCR average calculation unit 143 and the zero crossing rate of the current frame generated by the ZCR generation unit 123 and compares the received difference with a predetermined threshold ZC_THR.
The ZC_SP calculation unit 156 calculates ZC_SP that is a long-term feature by executing an ‘if’ conditional statement expressed by Equation 7 according to the comparison result obtained by the zero crossing rate variation feature comparison unit 153, as follows:
if(ZC—VAR>ZC—THR)
ZC—SP=a3*ZC—SP+(1−a3)*ZC—VAR (7),
else
ZC—SP=D3
where an initial value of ZC_SP is 0, a3 is a real number between 0 and 1 and is a weight for ZC_SP and ZC_VAR, D3 is β3×(ZC_THR/zero-crossing rate) in which β3 is a constant indicating the degree of reduction, and zero-crossing rate is a zero crossing rate of the current frame. A detailed description that is common to ZC_SP and SNR_SP will not be given.
The SPP generation unit 157 generates a speech presence possibility (SSP) using a long-term feature calculated by the SNR_SP calculation unit 154, the TILT_SP calculation unit 155, and the ZC_SP calculation unit 156, as follows:
SPP=SNR—W·SNR—SP+TILT—W·TILT—SP+ZC—W·ZC—SP (8),
where SNR_W is a weight for SNR_SP, TILT_W is a weight for TILT_SP, and ZC_W is a weight for ZC_SP.
Referring to
Although the short-term feature generation unit 120 is described to include the LP-LTP gain generation unit 121, the spectrum tilt generation unit 122, and the zero crossing rate (ZCR) generation unit 123, it is possible that the short-term feature generation unit 120 includes one or a combination of the LP-LTP gain generation unit 121, the spectrum tilt generation unit 122, and the zero crossing rate (ZCR) generation unit 123.
Also, the long-term feature generation unit 130 may include one or a combination of a first processing unit including the LP-LTP gain moving average calculation unit 141, the first variation feature comparison unit 151, the SNR_SP calculation unit 154, a second processing unit including the spectrum tilt moving average calculation unit 142, the second variation feature comparison unit 152, and the TILT_SP calculation unit 155, and a third processing unit including the zero crossing rate moving average calculation unit 143, the third variation feature comparison unit 153, and the ZC_SP calculation unit 156, according to the one or combination of the LP-LTP gain generation unit 121, the spectrum tilt generation unit 122, and the zero crossing rate (ZCR) generation unit 123 of the short-term feature generation unit 120.
In this case, the SPP calculation unit 157 may calculate the speech presence possibility (SPP) from one or a combination of the long-term features SNR_SP, TILT_SP, and ZC_SP.
Referring to
In operation 1200, the long-term feature generation unit 130 calculates long-term features SNR_SP, TILT_SP, and ZC_SP by performing long-term analysis with respect to the short-term features generated by the short-term feature generation unit 120 and applies weights to the long-term features, thereby calculating an SPP.
In operation 1100 and operation 1200, short-term features and long-term features of the current frame are calculated. However, it is also necessary to conduct training with respect to speech data and music data, i.e., calculation of short-term features and long-term features by performing operation 1100 and operation 1200, in order to determine the encoding mode for the audio signal. Due to the training, data establishment for the distributions of the short-term features and the long-term features can be achieved and the encoding mode for each frame of the audio signal can be determined as will be described below.
In operation 1300, the long-term feature comparison unit 170 compares SPP of the current frame calculated in operation 1200 with a preset long-term feature threshold SpThr. When SPP is greater than SpThr, the speech mode is determined as the encoding mode for the current frame. When SPP is less than SpThr, a mode determination threshold is adjusted and the adjusted mode determination threshold is compared with a short-term feature, thereby determining the encoding mode for the current frame.
In operation 1400, the mode determination threshold adjustment unit 180 receives mode information about the encoding mode of the previous frame from the long-term feature comparison unit 170 and determines whether the encoding mode of the previous frame is the speech mode or the music mode according to the received mode information.
In operation 1410, the mode determination threshold adjustment unit 180 outputs a value obtained by dividing a mode determination threshold STF_THR for determining a short-term feature of the current frame by a value Sx when the encoding mode of the previous frame is the speech mode. Sx is a value having an attribute of a cumulative probability of a speech signal and is intended to increase or reduce the mode determination threshold. Referring to
In operation 1420, the mode determination threshold adjustment unit 180 outputs a product of the mode determination threshold STF_THR for determining the short-term feature of the current frame and a value Mx when the encoding mode of the previous frame is the music mode. Mx is a value having an attribute of a cumulative probability of a music signal and is intended to increase or reduce the mode determination threshold. As illustrated in
In operation 1430, the mode determination threshold adjustment unit 180 compares a short-term feature of the current frame with the mode determination threshold that is adaptively adjusted in operation 1410 or operation 1420 and outputs the comparison result.
When the short-term feature of the current frame is less than the mode determination threshold in operation 1430, the encoding mode determination unit 190 determines the music mode as the encoding mode for the current frame and outputs the determination result as mode information in operation 1500.
When the short-term feature of the current frame is greater than the mode determination threshold in operation 1430, the encoding mode determination unit 190 determines the speech mode as the encoding mode for the current frame and outputs the determination result as mode information in operation 1600.
Referring to
The frequency-domain decoding unit 2400 decodes the received bitstream in the frequency domain and the time-domain decoding unit 2500 decodes the received bitstream in the time domain. A mixing unit 2600 mixes decoded signals in order to reconstruct an audio signal.
The present general inventive concept can also be embodied as computer-readable code on a computer-readable medium. The computer-readable medium can include a computer-readable recording medium and a computer-readable transmission medium. The computer-readable recording medium is any data storage device that can store data which can be thereafter read by a computer system.
Examples of the computer-readable recording medium include read-only memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, and so on. The computer-readable recording medium can also be distributed over network coupled computer systems so that the computer-readable code is stored and executed in a distributed fashion. The computer-readable transmission medium can transmit carrier waves and signals (e.g., wired or wireless data transmission through the Internet). Also, functional programs, code, and code segments for implementing the present invention can be easily construed by programmers skilled in the art.
As described above, according to the present general inventive concept, an encoding mode for the current frame is determined by adaptively adjusting a mode determination threshold for the current frame according to a long-term feature of the audio signal, thereby improving a hit rate of encoding mode determination and signal classification, suppressing frequent mode switching per frame, improving noise tolerance, and providing smooth reconstruction of the audio signal.
Although a few embodiments of the present general inventive concept have been shown and described, it will be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the general inventive concept, the scope of which is defined in the appended claims and their equivalents.
Sung, Ho-sang, Lee, Kang-eun, Kim, Jung-hoe, Oh, Eun-mi, Choo, Ki-hyun, Son, Chang-yong
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
5611019, | May 19 1993 | Matsushita Electric Industrial Co., Ltd. | Method and an apparatus for speech detection for determining whether an input signal is speech or nonspeech |
5778335, | Feb 26 1996 | Regents of the University of California, The | Method and apparatus for efficient multiband celp wideband speech and music coding and decoding |
6134518, | Mar 04 1997 | Cisco Technology, Inc | Digital audio signal coding using a CELP coder and a transform coder |
6735567, | Sep 22 1999 | QUARTERHILL INC ; WI-LAN INC | Encoding and decoding speech signals variably based on signal classification |
20010018650, | |||
20030101050, | |||
20030105624, | |||
20050075873, | |||
20050240399, | |||
EP932141, | |||
WO2005111567, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Nov 01 2007 | SON, CHANG-YONG | SAMSUNG ELECTRONICS CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 020102 | /0506 | |
Nov 01 2007 | OH, EUN-MI | SAMSUNG ELECTRONICS CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 020102 | /0506 | |
Nov 01 2007 | CHOO, KI-HYUN | SAMSUNG ELECTRONICS CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 020102 | /0506 | |
Nov 01 2007 | KIM, JUNG-HOE | SAMSUNG ELECTRONICS CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 020102 | /0506 | |
Nov 01 2007 | SUNG, HO-SANG | SAMSUNG ELECTRONICS CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 020102 | /0506 | |
Nov 01 2007 | LEE, KANG-EUN | SAMSUNG ELECTRONICS CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 020102 | /0506 | |
Nov 13 2007 | Samsung Electronics Co., Ltd | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Date | Maintenance Schedule |
Apr 29 2017 | 4 years fee payment window open |
Oct 29 2017 | 6 months grace period start (w surcharge) |
Apr 29 2018 | patent expiry (for year 4) |
Apr 29 2020 | 2 years to revive unintentionally abandoned end. (for year 4) |
Apr 29 2021 | 8 years fee payment window open |
Oct 29 2021 | 6 months grace period start (w surcharge) |
Apr 29 2022 | patent expiry (for year 8) |
Apr 29 2024 | 2 years to revive unintentionally abandoned end. (for year 8) |
Apr 29 2025 | 12 years fee payment window open |
Oct 29 2025 | 6 months grace period start (w surcharge) |
Apr 29 2026 | patent expiry (for year 12) |
Apr 29 2028 | 2 years to revive unintentionally abandoned end. (for year 12) |