An echo cancellation system that detects and compensates for differences in sample rates between the echo cancellation system and a set of wireless speakers based on a frequency-domain analysis. The system generates fourier transforms for a microphone signal and a reference signal and determines a series of angles for individual frames. For each tone in the fourier transforms, the system determines the angles and uses linear regression to determine an individual frequency offset associated with the tone. Using the individual frequency offsets associated with the tones, the system uses linear regression to determine an overall frequency offset between the audio sent to the speakers and the audio received from a microphone. Based on the overall frequency offset, samples of the audio are added or dropped when echo cancellation is performed, compensating for the frequency offset.
|
5. A computer-implemented method, comprising:
receiving a first reference signal in a frequency domain, the first reference signal being a discrete fourier Transform (dft) of a second reference signal in a time domain;
receiving a first input signal in the frequency domain, the first input signal being a dft of an audio signal in the time domain;
determining a first summation for a first frame at a first tone index using the first input signal and a complex conjugate of the first reference signal;
determining a second summation for a second frame at the first tone index using the first input signal and the complex conjugate of the first reference signal, the second frame following the first frame;
determining a first angle associated with the first frame using the first summation;
determining a second angle associated with the second frame using the first summation and the second summation;
performing a first linear regression to determine a first linear fit based on the first angle and the second angle; and
determining a first frequency offset between the first reference signal and the first input signal based on the first linear fit, wherein the first frequency offset is a difference between a first sampling rate of the first reference signal and a second sampling rate of the first input signal.
13. A system, comprising:
at least one processor;
a memory device including instructions operable to be executed by the at least one processor to configure the system for:
receiving a first reference signal in a frequency domain, the first reference signal being a discrete fourier Transform (dft) of a second reference signal in a time domain;
receiving a first input signal in the frequency domain, the first input signal being a dft of an audio signal in the time domain;
determining a first summation for a first frame at a first tone index using the first input signal and a complex conjugate of the first reference signal;
determining a second summation for a second frame at the first tone index using the first input signal and the complex conjugate of the first reference signal, the second frame following the first frame;
determining a first angle associated with the first frame using the first summation;
determining a second angle associated with the second frame using the first summation and the second summation;
performing a first linear regression to determine a first linear fit based on the first angle and the second angle; and
determining a first frequency offset between the first reference signal and the first input signal based on the first linear fit, wherein the first frequency offset is a difference between a first sampling rate of the first reference signal and a second sampling rate of the first input signal.
1. A computer-implemented method for removing a frequency offset from a received audio signal, the method comprising:
transmitting a first reference signal to a first wireless speaker;
receiving a first signal from a first microphone, the first signal representing audible sound output by the first wireless speaker;
generating a second signal using the first signal, the second signal aligned to the first reference signal to remove a propagation delay between the first reference signal and the first signal;
applying a Fast fourier Transform (FFT) to the second signal to determine a first microphone signal in a frequency domain;
applying the FFT to the first reference signal to determine a first reference signal in the frequency domain;
determining a first summation for a first frame at a first tone index of a plurality of tone indexes using the first microphone signal and a complex conjugate of the first reference signal;
determining a second summation for a second frame at the first tone index using the first microphone signal and the complex conjugate of the first reference signal, the second frame following the first frame;
determining a first angle associated with the first frame using the first summation, wherein the first angle is in radians and corresponds to a phase difference between the first reference signal and the first microphone signal;
determining a second angle associated with the second frame using the first summation and the second summation, wherein the second angle is in radians;
determining that the first angle is less than a threshold value;
determining that the second angle is less than the threshold value;
performing a first linear regression to determine a first linear fit based on the first angle and the second angle;
determining a first frequency offset between the first reference signal and the second signal based on the first linear fit, wherein the first frequency offset is a difference between a first sampling rate of the first reference signal and a second sampling rate of the second signal;
determining that the first frequency offset has a negative value; and
removing at least one sample of the first reference signal per cycle based on the first frequency offset.
2. The computer-implemented method of
multiplying a first complex value of the first microphone signal by a complex conjugate of a second complex value of the first reference signal to determine a first product, the first complex value and the second complex value associated with the first frequency and the first frame;
multiplying a third complex value of the first microphone signal by a complex conjugate of a fourth complex value of the first reference signal to determine a second product, the third complex value and the fourth complex value associated with the first frequency and the second frame; and
generating the first summation by summing the first product and the second product.
3. The computer-implemented method of
multiplying the second summation by a complex conjugate of the first summation to determine a first product;
determining a third angle of the first product;
multiplying the first tone index by 2π to determine a second product; and
determining the first angle by dividing the third angle by the second product.
4. The computer-implemented method of
determining a second frequency offset between a second reference signal and a third signal, wherein the second frequency offset is a difference between a third sampling rate of the second reference signal and a fourth sampling rate of the third signal;
determining that the second frequency offset is a positive value; and
adding a duplicate copy of at least one sample of the second reference signal to the second reference signal based on the second frequency offset.
6. The computer-implemented method of
determining that the first frequency offset has a negative value; and
removing at least one sample of the first reference signal from the first reference signal per cycle.
7. The computer-implemented method of
determining that the first frequency offset has a positive value; and
adding a duplicate copy of at least one sample of the first reference signal to the first reference signal per cycle.
8. The computer-implemented method of
determining, using the second summation, a third angle associated with the first frame;
determining that the third angle is above a threshold; and
performing the first linear regression to determine the first linear fit based on the first angle and the second angle.
9. The computer-implemented method of
multiplying a first complex value of the first input signal by a complex conjugate of a second complex value of the first reference signal to determine a first product, the first complex value and the second complex value associated with the first tone index and the first frame;
multiplying a third complex value of the first input signal by a complex conjugate of a fourth complex value of the first reference signal to determine a second product, the third complex value and the fourth complex value associated with the first tone index and the second frame; and
generating the first summation by summing the first product and the second product.
10. The computer-implemented method of
multiplying the second summation by a complex conjugate of the first summation to determine a first product;
determining a third angle of the first product;
multiplying the first tone index by 2π to determine a second product; and
determining the first angle by dividing the third angle by the second product.
11. The computer-implemented method of
transmitting the second reference signal to a first wireless speaker;
receiving the audio signal from a first microphone, the audio signal representing audible sound output by the first wireless speaker;
applying a Fast fourier Transform (FFT) to the audio signal to determine the first input signal; and
applying the FFT to the second reference signal to determine the first reference signal.
12. The computer-implemented method of
determining a second frequency offset between the first reference signal and the first input signal associated with a second tone index;
performing a second linear regression to determine a second linear fit based on the first frequency offset and the second frequency offset; and
determining a third frequency offset between the first reference signal and the first input signal based on the second linear fit.
14. The system of
determining that the first frequency offset has a negative value; and
removing at least one sample of the first reference signal from the first reference signal per cycle.
15. The system of
determining that the first frequency offset has a positive value; and
adding a duplicate copy of at least one sample of the first reference signal to the first reference signal per cycle.
16. The system of
determining, using the second summation, a third angle associated with the first frame;
determining that the third angle is above a threshold; and
performing the first linear regression to determine the first linear fit based on the first angle and the second angle.
17. The system of
multiplying a first complex value of the first input signal by a complex conjugate of a second complex value of the first reference signal to determine a first product, the first complex value and the second complex value associated with the first tone index and the first frame;
multiplying a third complex value of the first input signal by a complex conjugate of a fourth complex value of the first reference signal to determine a second product, the third complex value and the fourth complex value associated with the first tone index and the second frame; and
generating the first summation by summing the first product and the second product.
18. The system of
multiplying the second summation by a complex conjugate of the first summation to determine a first product;
determining a third angle of the first product;
multiplying two by π by the first tone index to determine a second product; and
determining the first angle by dividing the third angle by the second product.
19. The system of
transmitting the second reference signal to a first wireless speaker;
receiving the audio signal from a first microphone, the audio signal representing audible sound output by the first wireless speaker;
applying a Fast fourier Transform (FFT) to the audio signal to determine the first input signal; and
applying the FFT to the second reference signal to determine the first reference signal.
20. The system of
determining a second frequency offset between the first reference signal and the first input signal associated with a second tone index;
performing a second linear regression to determine a second linear fit based on the first frequency offset and the second frequency offset; and
determining a third frequency offset between the first reference signal and the first input signal based on the second linear fit.
|
In audio systems, automatic echo cancellation (AEC) refers to techniques that are used to recognize when a system has recaptured sound via a microphone after some delay that the system previously output via a speaker. Systems that provide AEC subtract a delayed version of the original audio signal from the captured audio, producing a version of the captured audio that ideally eliminates the “echo” of the original audio signal, leaving only new audio information. For example, if someone were singing karaoke into a microphone while prerecorded music is output by a loudspeaker, AEC can be used to remove any of the recorded music from the audio captured by the microphone, allowing the singer's voice to be amplified and output without also reproducing a delayed “echo” the original music. As another example, a media player that accepts voice commands via a microphone can use AEC to remove reproduced sounds corresponding to output media that are captured by the microphone, making it easier to process input voice commands.
For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
Many electronic devices operate based on a timing “clock” signal produced by a crystal oscillator. For example, when a computer is described as operating at 2 GHz, the 2 GHz refers to the frequency of the computer's clock. This clock signal can be thought of as the basis for an electronic device's “perception” of time. Specifically, a synchronous electronic device may time its own operations based on cycles of its own clock. If there is a difference between otherwise identical devices' clocks, these differences can result in some devices operating faster or slower than others.
In stereo and multi-channel audio systems that include wireless or network-connected loudspeakers and/or microphones, a major cause of problems for conventional AEC is when there is a difference in clock synchronization between loudspeakers and microphones. For example, in a wireless “surround sound” 5.1 system comprising six wireless loudspeakers that each receive an audio signal from a surround-sound receiver, the receiver and each loudspeaker has its own crystal oscillator which provides the respective component with an independent “clock” signal.
Among other things that the clock signals are used for is converting analog audio signals into digital audio signals (“A/D conversion”) and converting digital audio signals into analog audio signals (“D/A conversion”). Such conversions are commonplace in audio systems, such as when a surround-sound receiver performs A/D conversion prior to transmitting audio to a wireless loudspeaker, and when the loudspeaker performs D/A conversion on the received signal to recreate an analog signal. The loudspeaker produces audible sound by driving a “voice coil” with an amplified version of the analog signal.
An implicit premise in using an acoustic echo canceller (AEC) is that the clock for A/D conversion for a microphone and the clock for D/A conversion are generated from the same oscillator (there is no frequency offset between A/D conversion and D/A conversion). In modern complex devices (PCs, smartphones, smart TVs, etc.), this condition cannot be satisfied, because of the use of multiple audio devices, external devices connected by USB or wireless, and so on. The difference in sampling rate between the clocks degrades the AEC performance. That means that a standard AEC cannot be used if the clock of A/D and D/A are not made from the same crystal.
A problem for an AEC system occurs when the audio that the surround-sound receiver transmits to a speaker is output at a subtly different “sampling” rate by the loudspeaker. When the AEC system attempts to remove the audio output by the loudspeaker from audio captured by the system's microphone(s) by subtracting a delayed version of the originally transmitted audio, the playback rate of the audio captured by the microphone is subtly different than the audio that had been sent to the loudspeaker.
For example, consider loudspeakers built for use in a surround-sound system that transfers audio data using a 48 kHz sampling rate (i.e., 48,000 digital samples per second of analog audio signal). An actual rate based on a first component's clock signal might actually be 48,000.001 samples per second, whereas another component might operate at an actual rate of 48,000.002 samples per second. This difference of 0.001 samples per second between actual frequencies is referred to as a frequency “offset.” The consequences of a frequency offset is an accumulated “drift” in the timing between the components over time. Uncorrected, after one-thousand seconds, the accumulated drift is an entire sample of difference between components.
In practice, each loudspeaker in a multi-channel audio system may have a different frequency offset to the surround sound receiver, and the loudspeakers may have different frequency offsets relative to each other. If the microphone(s) are also wireless or network-connected to the AEC system (e.g., a microphone on a wireless headset), they may also contribute to the accumulated drift between the captured reproduced audio signal(s) and the captured audio signals(s).
The portion of the sounds output by each of the loudspeakers that reaches each of the microphones 118a/118b can be characterized based on transfer functions.
The transfer functions (e.g., 116a, 116b) characterize the acoustic “impulse response” of the room 104 relative to the individual components. The impulse response, or impulse response function, of the room 104 characterizes the signal from a microphone when presented with a brief input signal (e.g., an audible noise), called an impulse. The impulse response describes the reaction of the system as a function of time. If the impulse response between each of the loudspeakers 116a/116b is known, and the content of the reference signals x1(n) 112a and x2(n) 112b output by the loudspeakers is known, then the transfer functions 116a and 116b can be used to estimate the actual loudspeaker-reproduced sounds that will be received by a microphone (in this case, microphone 118a). The microphone 118a converts the captured sounds into a signal y1(n) 120a. A second set of transfer functions is associated with the other microphone 118b, which converts captured sounds into a signal y2(n) 120b.
The “echo” signal y1(n) 120a contains some of the reproduced sounds from the reference signals x1(n) 112a and x2(n) 112b, in addition to any additional sounds picked up in the room 104. The echo signal y1(n) 120a can be expressed as:
y1(n)=h1(n)*x1(n)+h2(n)*x2(n) [1]
where h1(n) 116a and h2(n) 116b are the loudspeaker-to-microphone impulse responses in the receiving room 104, x1(n) 112a and x2(n) 112b are the loudspeaker reference signals, * denotes a mathematical convolution, and “n” is an audio sample.
The acoustic echo canceller 102a calculates estimated transfer functions ĥ1 (n) 122a and ĥ2 (n) 122b. These estimated transfer functions produce an estimated echo signal ŷ1(n) 124a corresponding to an estimate of the echo component in the echo signal y1(n) 120a. The estimated echo signal can be expressed as:
ŷ1(n)=ĥ1(k)*x1(n)+ĥ2(n)*x2(n) [2]
where * again denotes convolution. Subtracting the estimated echo signal 124a from the echo signal 120a produces the error signal e1(n) 126a, which together with the error signal e2(n) 126b for the other channel, serves as the output (i.e., audio output 128). Specifically:
ê1(n)=y1(n)−ŷ1(n) [3]
The acoustic echo canceller 102a calculates frequency domain versions of the estimated transfer functions ĥ1(n) 122a and ĥ2(n) 122b using short term adaptive filter coefficients W(k,r). In conventional AEC systems operating in time domain, the adaptive filter coefficients are derived using least mean squares (LMS) or stochastic gradient algorithms, which use an instantaneous estimate of a gradient to update an adaptive weight vector at each time step. With this notation, the LMS algorithm can be iteratively expressed in the usual form:
hnew=hold+μ*e*x [4]
where hnew is an updated transfer function, hold is a transfer function from a prior iteration, μ is the step size between samples, e is an error signal, and x is a reference signal.
Applying such adaptation over time (i.e., over a series of samples), it follows that the error signal “e” should eventually converge to zero for a suitable choice of the step size μ (assuming that the sounds captured by the microphone 118a correspond to sound entirely based on the references signals 112a and 112b rather than additional ambient noises, such that the estimated echo signal ŷ1(n) 124a cancels out the echo signal y1(n) 120a). However, e→0 does not always imply that h−ĥ→0, where the estimated transfer function ĥ cancelling the corresponding actual transfer function h is the goal of the adaptive filter. For example, the estimated transfer functions ĥ may cancel a particular string of samples, but is unable to cancel all signals, e.g., if the string of samples has no energy at one or more frequencies. As a result, effective cancellation may be intermittent or transitory. Having the estimated transfer function ĥ approximate the actual transfer function h is the goal of single-channel echo cancellation, and becomes even more critical in the case of multichannel echo cancellers that require estimation of multiple transfer functions.
While drift accumulates over time, the need for multiple estimated transfer functions ĥ in multichannel echo cancellers accelerates the mismatch between the echo signal y from a microphone and the estimated echo signal ŷ from the echo canceller. To mitigate and eliminate drift, it is therefore necessary to estimate the frequency offset for each channel, so that each estimated transfer function ĥ can compensate for difference in component clocks.
The relative frequency offset can be defined in terms of “ppm” (parts-per-million) error between components. The normalized sampling clock frequency offset (error) is defined as:
PPM error=Ftx/Frx−1 [5]
For example, if a loudspeaker (transmitter) sampling frequency Ftx is 48,000 Hz and a microphone (receiver) sampling frequency Frx is 48,001 Hz, then the frequency offset between Ftx and Frx is −20.833 ppm. During 1 second, the transmitter and receiver are creating 48,000 and 48,001 samples respectively. Hence, there will be 1 additional sample created at the receiver side during every second.
The time domain input signal y(n) 120 and the time domain reference signal x(n) 112 are input to a propagation delay estimator 160 that determines the propagation delay and aligns the input signal y(n) 120 with the reference signal x(n) 112, generating aligned input signal y′(n) 150. The propagation delay estimator 160 may determine the propagation delay using techniques known to one of skill in the art and the aligned input signal y′(n) 150 is assumed to be determined for the purposes of this disclosure. For example, the propagation delay estimator 160 may identify a peak value in the reference signal x(n) 112, identify the peak value in the input signal y(n) 120 and may determine a propagation delay based on the peak values.
The AEC 102 applies a short-time Fourier transform (STFT) 162 to the aligned time domain signal y′(n) 150, producing the frequency-domain input values Y(k,r) 154, where the tone index “k” is 0 to N−1 and “r” is a frame index. The AEC 102 also applies an STFT 164 to the time-domain reference signal x(n) 112, producing the frequency-domain reference values X(k,r) 152.
The frequency-domain input values Y(k,r) 154 and the frequency-domain reference values X(k,r) 152 are input to block 166 to determine individual frequency offsets for each tone index “k,” generating individual frequency offsets PPM(k) 156. For example, the AEC 102 may perform the steps of
The individual frequency offsets PPM(k) 156 may be input to block 168 and the AEC 102 may determine an overall frequency offset PPM 158, as described in greater detail above with regard to
As illustrated in
Sm(k)=Σm=1m=MXm(k)*conj(Ym(k)) [6]
where m is a current frame index, M is a number of previous frame indices, Xm(k) corresponds to X(k,r) 152 and Ym(k) corresponds to Y(k,r) 154. The AEC 102 may determine a series of correlation matrix Sm(k) values for Q consecutive frame indices.
SS(k)=[Sm(k)Sm+1(k)Sm+2(k) . . . Sm+Q-1(k)] [7]
The AEC 102 may determine (134) angles (αm) representing a rotation (e.g. phase difference) of Xm(k) relative to Ym(k) for each frame index (m) and each tone index (k) for the series of Q consecutive frames. For example, the AEC 102 may calculate the angles using:
A(k)=[α1α2 . . . αQ-1] [8.1]
Where,
αj=angle(P(k))/(2*pi*k) [8.2]
and
P(k)=Sm+j(k)*conj(Sm+j-1(k)) [8.3]
After determining the angles A(k), the AEC 102 may remove (136) angles above a threshold. As the rate of rotation is relatively constant between adjacent frame indices, the angles should be within a range. Therefore, the AEC 102 may remove angles that exceed the range using the threshold (e.g., 40-100 ppm) to improve an estimate of the frequency offset.
The AEC 102 may determine (138) individual frequency offsets PPM(k) for each tone index k within a frequency range (K1 to K2) (e.g., 1 kHz to 4 kHz). For example, the AEC 102 may use linear regression and equation (9):
PPM(k)=b0/(2*pi*k0) [9]
After determining the individual frequency offsets PPM(k) for each tone index k, the AEC 102 may determine (140) an overall frequency offset PPM. For example, the AEC 102 may use linear regression to the PPM(k) data set to determine the overall frequency offset PPM within the tone index range of K1 to K2 (e.g., 1 kHz to 4 kHz). The AEC 102 may compress/add/drop (142) samples to eliminate the frequency offset. For example, the AEC 102 may compress, add or remove samples from the reference values X(k,r) 152 and/or input values Y(k,r) 154 to compensate for a difference between a sampling rate of the loudspeaker 114 and a sampling rate of the microphone 118.
The performance of AEC is measured in ERLE (echo-return loss enhancement).
As illustrated in
For normal audio playback, such differences in frequency offset are usually imperceptible to a human being. However, the frequency offset between the crystal oscillators of the AEC system, the microphones, and the loudspeaker will create major problems for multi-channel AEC convergence (i.e., the error e does not converge to zero). Specifically, the predictive accuracy of the estimated transfer functions (e.g., ĥ1(n) and ĥ2(n)) will rapidly degrade as a predictor of the actual transfer functions (e.g., h1(n) and h2(n)).
A communications protocol-specific solution to this problem has been to embed a sinusoidal pilot signal when transmitting reference signals “x” and receiving echo signals “y.” Using a phase-locked loop (PLL) circuit, components can synchronize their clocks to the pilot signal, and/or estimate the frequency error. However, that requires that the communications protocol between components supports use of a pilot, and that each component supports clock synchronization.
Another alternative is to transmit an audible sinusoidal signal with the reference signals x. Such a solution does not require a specialized communications protocol, nor any particular support from components such as the loudspeakers and microphones. However, the audible signal will be heard by users, which might be acceptable during a startup or calibration cycle, but is undesirable during normal operations. Further, if limited to startup or calibration, any information gleaned as to frequency offsets will be static, such that the system will be unable to detect if the frequency offset changes over time (e.g., due to thermal changes within a component altering frequency of the component's clock).
Another alternative is to transmit an ultrasonic sinusoidal signal with the reference signals x at a frequency that is outside the range of frequencies that human beings can perceive. A first shortcoming of this approach is that it requires loudspeakers and microphones capable of operating at the ultrasonic frequency. Another shortcoming is that the ultrasonic signal will create a constant sound “pressure” on the microphones, potentially reducing the microphones' sensitivity in the audible parts of the spectrum.
To address these shortcomings of the conventional solutions, the acoustic echo cancellers 102a and 102b in
From definition of the PPM error in Equation 5, if the frequency offset is “A” ppm, then in 1/A samples, one additional sample will be added. This may be performed, for example, by adding on a duplicate of the last sample every 1/A samples. Hence, if difference is 1 ppm, then one additional sample will be created in 1/1e-6=106 samples; if the difference is 20.833 ppm, then one additional sample will be added for every 48,000 samples; and so on. Likewise, if the frequency offset is “−A” ppm, then in 1/A samples, one additional sample will be dropped. This may be performed, for example, by dropping/skipping/removing the last sample every 1/A samples.
For the purposes of discussion, an example of system 100 includes “Q” loudspeakers 114 (Q>1) and a separate microphone array system (microphones 118) for hands free near-end/far-end multichannel AEC applications. The frequency offsets for each loudspeaker and the microphone array can be characterized as df1, df2, . . . , dfQ. Existing and well known solutions for frequency offset correction for LTE (Long Term Evolution cellular telephony) and WiFi (free running oscillators) are based on Fractional Delayed Interpolator methods. Fractional delay interpolator methods provide accurate correction with additional computational cost. Accurate correction is required for high speed communication systems. However, audio applications are not high speed and relatively simple frequency correction algorithm could be applied, such as a sample add/drop method. Hence, if playback of reference signals x1 112(a) (corresponding to loudspeaker 114a) is signal 1, and the frequency offset between signal 1 and the microphone output signal y1 120a is dfk, then frequency correction may be performed by dropping/adding one sample every 1/dfk samples.
The acoustic echo canceller(s) 102 uses short time Fourier transform-based frequency-domain multi-tap acoustic echo cancellation (STFT AEC) to estimate frequency offset. The following high level description of STFT AEC refers to echo signal y (120) which is a time-domain signal comprising an echo from at least one loudspeaker (114) and is the output of a microphone 118. The reference signal x (112) is a time-domain audio signal that is sent to and output by a loudspeaker (114). The variables X and Y correspond to a Short Time Fourier Transform of x and y respectively, and thus represent frequency-domain signals. A short-time Fourier transform (STFT) is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time.
Using a Fourier transform, a sound wave such as music or human speech can be broken down into its component “tones” of different frequencies, each tone represented by a sine wave of a different amplitude and phase. Whereas a time-domain sound wave (e.g., a sinusoid) would ordinarily be represented by the amplitude of the wave over time, a frequency domain representation of that same waveform comprises a plurality of discrete amplitude values, where each amplitude value is for a different tone or “bin.” So, for example, if the sound wave consisted solely of a pure sinusoidal 1 kHz tone, then the frequency domain representation would consist of a discrete amplitude spike in the bin containing 1 kHz, with the other bins at zero. In other words, each tone “k” is a frequency index. The response of a Fourier-transformed system, as a function of frequency, can also be described by a complex function.
In addition, the AEC 102 may determine the frequency offset using only a portion of the overall FFT (corresponding to a portion of the time-domain signal). For example,
If the STFT is an “N” point Fast Fourier Transform (FFT), then the frequency-domain variables would be X(k,r) and Y(k,r), where the tone “k” is 0 to N−1 and “r” is a frame index. The STFT AEC uses a “multi-tap” process. That means for each tone “k” there are M taps, where each tap corresponds to a sample of the signal at a different time. Each tone “k” is a frequency point produced by the transform from time domain to frequency domain, and the history of the values across iterations is provided by the frame index “r.” The STFT taps would be W(k,m), where k is 0 to N−1 and m is 0 to M−1. The tap parameter M is defined based on tail length of AEC. The “tail length,” in the context of AEC, is a parameter that is a delay offset estimation. For example, if the STFT processes tones in 8 ms samples and the tail length is defined to be 240 ms, then M=240/8 which would correspond to M=32.
Given a signal z[n], the STFT Z(k,r) of x[n] is defined by
Z(k,r)=Σn=0N-1Win(n)*z(n+r*R)*e−2pi*k*n/N [10.1]
Where, Win(n) is a window function for analysis, k is a frequency index, r is a frame index, R is a frame step, and N is an FFT size. Hence, for each block (at frame index r) of N samples, the STFT is performed which produces N complex tones X(k,r) corresponding frequency index k and frame index r.
Referring to the Acoustic Echo Cancellation using STFT operations in
Y(k,r)=Σn=0N-1Win(n)*y(n+r*R)*e−2pi*k*n/N [10.2]
The reference signal x(n) 112 to the loudspeaker 114 has a frequency domain STFT representation:
X(k,r)=Σn=0N-1Win(n)*x(n+r*R)*e−2pi*k*n/N [10.3]
As noted above, each tone “k” can be represented by a sine wave of a different amplitude and phase, such that each tone may be represented as a complex number. A complex number is a number that can be expressed in the form a+bj, where a and b are real numbers and j is the imaginary unit, that satisfies the equation j2=−1. A complex number whose real part is zero is said to be purely imaginary, whereas a complex number whose imaginary part is zero is a real number. For a sine wave of a given frequency, the real component corresponds to an amplitude of the wave while the imaginary component corresponds to the phase. In addition, the complex conjugate of a complex number is the number with equal real part and imaginary part equal in magnitude but opposite in sign. For example, the complex conjugate of 3+4i is 3-4i.
As mentioned above, in order to determine a frequency offset between the loudspeaker 114 and the microphone 118, the AEC 102 may determine a propagation delay and generate an aligned input y′(n) 150 from the input y(n) 120.
To determine the propagation delay, the AEC 102 may determine a coherence between individual index frames in x(n) 112 and y(n) 120. Coherence means that a frame (xi) in x(n) 112 corresponds to a frame (yj) in y(n) 120, and the propagation delay (D) is determined based on the difference between the two (e.g., D=j−i). Thus, the AEC 102 may determine that xi (e.g., x1) corresponds to yj (e.g., y7) and may determine the propagation delay accordingly (e.g., D=7−1=6 frames).
Using the propagation delay, the AEC 102 may shift y(n) 120 by D frames (e.g., 6 frames), illustrated in
After the propagation offset is removed and the x(n) 112 is aligned with y′(n) 150, the AEC 102 may generate a Fourier transform of x(n) 112 to generate X(k,r) 152 and may generate a Fourier transform of y′(n) 150 to generate Y(k,r) 154. Therefore, the propagation delay (D) is accounted for and X(k,r) 152 extends from X1 to XU and Y(k,r) 154 extend from Y1 to YU. Thus, X1 corresponds to Y1, X2 corresponds to Y2, and so on.
To provide clarity for subsequent equations and explanations,
As the representation of each tone k is a complex value, each entry in the matrixes X(k, m) and Y(k,m) may likewise be a complex number.
If there is no frequency offset between the microphone echo signal y(n)120 and the loudspeaker reference signal x(n) 112, then X(k,m) will have a zero mean phase rotation relative to Y(k,m) (e.g., equal in amplitude and phase). In the alternative, if there is a frequency offset (equal to A PPM) between y(n) 120 and x(n) 112, then the frequency offset will create continuous delay (i.e., will result in the adding/dropping of samples in the time domain). Such a delay will correspond to a phase “rotation” in frequency domain (e.g., equal in amplitude, different in phase). For example, the frequency offset may result in a rotation in the frequency domain between X(k,m) and Y(k,m) for an index value m. If the frequency offset is positive, the rotation will be clockwise. If the frequency offset is negative, the rotation will be counterclockwise. The rotation may be determined by taking a correlation matrix between X(k,m) and Y(k,m) for a series of frames and comparing the correlation matrixes between frames. The speed of the rotation of the angle from frame to frame corresponds to the size of the offset, with a larger offset producing a faster rotation than a smaller offset.
To determine the frequency offset and corresponding rotation 622, the AEC 102 may determine a rotation between a first correlation matrix and a second correlation matrix. For example,
The AEC 102 may select (726) a frame index (m) and may determine (728) an angle αm for the frame index (m) using Equations 8.2-8.3. The AEC 102 may determine (730) if the frame index (m) is equal to a maximum frame index (Q) and if not, may increment (732) the frame index (m) and repeat step 728. If the frame index (m) is equal to a maximum frame index (Q), the AEC 102 may determine (734) a set of angles A(k) using Equation 8.1. The AEC 102 may determine (736) if the tone index (k) is equal to a maximum tone index (K2) and if not, may increment (738) the tone index (k) and repeat steps 716-736. If the tone index (k) is equal to a maximum tone index (K2), the process may end. Thus, the AEC 102 may determine a set of angles A(k) using a series of Q frames for each tone index (k) between K1 and K2 (e.g., 1 kHz and 4 kHz).
Sm(k)=Σm=1m=MXm(k)*conj(Ym(k)) [6]
where m is a current frame index, M is a number of previous frame indices, Xm(k) corresponds to X(k,r) 152 and Ym(k) corresponds to Y(k,r) 154. As illustrated in
A(k)=[α1α2 . . . αQ-1] [8.1]
Where,
αj=angle(P(k))/(2*pi*k) [8.2]
and
P(k)=Sm+j(k)*conj(Sm+j-i(k)) [8.3]
As illustrated in
The AEC 102 may determine (1016) if the tone index (k) corresponds to an ending (e.g., K2) of the desired range and if not, may increment (1018) the tone index (k) and repeat step 1012. If the tone index (k) corresponds to the ending (e.g., K2), the AEC 102 may determine (1020) an overall frequency offset (PPM) value using linear regression and the individual frequency offsets (PPM(k)). The AEC 102 may then correct (1022) a sampling frequency of an input using the overall frequency offset (PPM) value.
For example, the AEC 102 may compress, add or remove samples from the reference values X(k,r) 152 and/or input values Y(k,r) 154 to compensate for a difference between a sampling rate of the loudspeaker 114 and a sampling rate of the microphone 118. The value of the frequency offset is used to determine how many samples to add or subtract from the reference signals x(n) 112 and/or input signals y(n) 120 input into the AEC 102. If the PPM value is positive, samples are added (i.e., repeated) to x(n) 112/y(n) 120. If the PPM value is negative, samples are dropped from x(n) 112/y(n) 120. For example, if the frequency offset indicates that there is a different of 1 ppm between the reference signal x(n) 112 and the input signal y(n) 120, the AEC 102 may drop one sample for every million samples to correct the offset. The AEC 102 may add/drop samples from the reference signal x(n) 112 or the input signal y(n) 120 depending on a system configuration. For example, if the AEC 102 receives a single reference signal and a single input signal, the AEC 102 may add/drop samples from the signal having a higher frequency, as the higher frequency will be able to add/drop samples more quickly to align the signals. However, if the AEC 102 receives a single reference signal and ten input signals, the AEC 102 may add/drop samples from the reference signal regardless of frequency if the ten input signals have the same frequency offset. In some examples, the AEC 102 may add/drop samples from the ten input signals individually if the frequency offsets change between the input signals.
Adding and/or dropping samples may be performed, among other ways, by storing the reference signal x(n) 112 received by the AEC 102 in a circular buffer (e.g., 162a, 162b), and then by modifying read and write pointers for the buffer, skipping or adding samples. In a system including multiple microphones 118, each with a corresponding AEC 102, the AEC 102 may share circular buffer(s) 162 to store the reference signals x(n) 112, but each AEC 102 may independently set its own pointers so that the number of samples skipped or added is specific to that AEC 102.
As an additional feature, AEC systems generally do not handle large signal propagation delays “D” well between the reference signals x(n) 112 and the echo signals y(n) 120. While the PPM for a system may change over time (e.g., due to thermal changes, etc.), the propagation delay time D remains relatively constant. The STFT AEC “taps” as described above may be used to accurately measure the propagation delay time D for each channel, which may then be used to set the delay provided by each of the buffers 162.
For example, assume that the microphone echo signal y(n) 120 and reference signal x(n) 112 are not properly aligned. Then, there would be a constant delay D (in samples) between the transmitted reference signals x(n) 112 and the received echo signals y(n) 120. This delay in the time domain creates a rotation in frequency domain.
If x(t) is the time domain signal and X(f) is the corresponding Fourier transform of x(t), then the Fourier transform of x(t−D) would be X(f)*exp(−j*f*D).
If echo cancellation algorithm is designed with long tail length (the number of taps of AEC frequency impulse response (FIR) filter is long enough), then the AEC will converge with initial D taps close to zero. Simply, AEC will lose first D taps. If D is large (e.g., D could be 100 ms or larger), then impact on AEC performance will be large. Hence, the delay D should be measured and should be compensated.
The system 100 may include one or more audio capture device(s), such as a microphone or an array of microphones 118. The audio capture device(s) may be integrated into the device 1501 or may be separate.
The system 100 may also include an audio output device for producing sound, such as speaker(s) 116. The audio output device may be integrated into the device 1501 or may be separate.
The device 1501 may include an address/data bus 1524 for conveying data among components of the device 1501. Each component within the device 1501 may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus 1524.
The device 1501 may include one or more controllers/processors 1504, that may each include a central processing unit (CPU) for processing data and computer-readable instructions, and a memory 1506 for storing data and instructions. The memory 1506 may include volatile random access memory (RAM), non-volatile read only memory (ROM), non-volatile magnetoresistive (MRAM) and/or other types of memory. The device 1501 may also include a data storage component 1508, for storing data and controller/processor-executable instructions (e.g., instructions to perform the algorithms illustrated in
Computer instructions for operating the device 1501 and its various components may be executed by the controller(s)/processor(s) 1504, using the memory 1506 as temporary “working” storage at runtime. The computer instructions may be stored in a non-transitory manner in non-volatile memory 1506, storage 1508, or an external device. Alternatively, some or all of the executable instructions may be embedded in hardware or firmware in addition to or instead of software.
The device 1501 includes input/output device interfaces 1502. A variety of components may be connected through the input/output device interfaces 1502, such as the speaker(s) 116, the microphones 118, and a media source such as a digital media player (not illustrated). The input/output interfaces 1502 may include A/D converters 119 for converting the output of microphone 118 into signals y 120, if the microphones 118 are integrated with or hardwired directly to device 1501. If the microphones 118 are independent, the A/D converters 119 will be included with the microphones, and may be clocked independent of the clocking of the device 1501. Likewise, the input/output interfaces 1502 may include D/A converters 115 for converting the reference signals x 112 into an analog current to drive the speakers 114, if the speakers 114 are integrated with or hardwired to the device 1501. However, if the speakers are independent, the D/A converters 115 will be included with the speakers, and may be clocked independent of the clocking of the device 1501 (e.g., conventional Bluetooth speakers).
The input/output device interfaces 1502 may also include an interface for an external peripheral device connection such as universal serial bus (USB), FireWire, Thunderbolt or other connection protocol. The input/output device interfaces 1502 may also include a connection to one or more networks 1599 via an Ethernet port, a wireless local area network (WLAN) (such as WiFi) radio, Bluetooth, and/or wireless network radio, such as a radio capable of communication with a wireless communication network such as a Long Term Evolution (LTE) network, WiMAX network, 3G network, etc. Through the network 1599, the system 100 may be distributed across a networked environment.
The device 1501 further includes an STFT module 1530 that includes the individual AEC 102, where there is an AEC 102 for each microphone 118.
Multiple devices 1501 may be employed in a single system 100. In such a multi-device system, each of the devices 1501 may include different components for performing different aspects of the STFT AEC process. The multiple devices may include overlapping components. The components of device 1501 as illustrated in
The concepts disclosed herein may be applied within a number of different devices and computer systems, including, for example, general-purpose computing systems, multimedia set-top boxes, televisions, stereos, radios, server-client computing systems, telephone computing systems, laptop computers, cellular phones, personal digital assistants (PDAs), tablet computers, wearable computing devices (watches, glasses, etc.), other mobile devices, etc.
The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art. Persons having ordinary skill in the field of digital signal processing and echo cancellation should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk and/or other media. Some or all of the STFT AEC module 1530 may be implemented by a digital signal processor (DSP).
As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.
Hilmes, Philip Ryan, Ayrapetian, Robert
Patent | Priority | Assignee | Title |
10354673, | Jan 24 2018 | Hisense Mobile Communications Technology Co., Ltd.; Hisense International Co., Ltd.; Hisense USA Corporation | Noise reduction method and electronic device |
10490203, | Dec 19 2016 | GOOGLE LLC | Echo cancellation for keyword spotting |
10861479, | Dec 19 2016 | GOOGLE LLC | Echo cancellation for keyword spotting |
11381903, | Feb 14 2014 | Sonic Blocks Inc. | Modular quick-connect A/V system and methods thereof |
11670317, | Feb 23 2021 | KYNDRYL, INC | Dynamic audio quality enhancement |
RE48371, | Sep 24 2010 | LI CREATIVE TECHNOLOGIES INC | Microphone array system |
Patent | Priority | Assignee | Title |
4682358, | Dec 04 1984 | American Telephone and Telegraph Company; AT&T Information Systems Inc. | Echo canceller |
4896318, | Nov 18 1987 | HITACHI, LTD , A CORP OF JAPAN | Method for cancelling echo in a transmitter and an apparatus therefor |
6549587, | Sep 20 1999 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Voice and data exchange over a packet based network with timing recovery |
7120259, | May 31 2002 | Microsoft Technology Licensing, LLC | Adaptive estimation and compensation of clock drift in acoustic echo cancellers |
8259928, | Apr 23 2007 | Microsoft Technology Licensing, LLC | Method and apparatus for reducing timestamp noise in audio echo cancellation |
8320554, | Oct 19 2010 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Acoustic echo canceller clock compensation |
9219456, | Dec 17 2013 | Amazon Technologies, Inc | Correcting clock drift via embedded sin waves |
9373318, | Mar 27 2014 | Amazon Technologies, Inc | Signal rate synchronization for remote acoustic echo cancellation |
9472203, | Jun 29 2015 | Amazon Technologies, Inc | Clock synchronization for multichannel system |
20090185695, | |||
20130044873, | |||
20150117656, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Dec 02 2015 | AYRAPETIAN, ROBERT | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037191 | /0564 | |
Dec 02 2015 | HILMES, PHILIP RYAN | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037191 | /0564 | |
Dec 02 2015 | Amazon Technologies, Inc. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Sep 08 2020 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Oct 28 2024 | REM: Maintenance Fee Reminder Mailed. |
Date | Maintenance Schedule |
Mar 07 2020 | 4 years fee payment window open |
Sep 07 2020 | 6 months grace period start (w surcharge) |
Mar 07 2021 | patent expiry (for year 4) |
Mar 07 2023 | 2 years to revive unintentionally abandoned end. (for year 4) |
Mar 07 2024 | 8 years fee payment window open |
Sep 07 2024 | 6 months grace period start (w surcharge) |
Mar 07 2025 | patent expiry (for year 8) |
Mar 07 2027 | 2 years to revive unintentionally abandoned end. (for year 8) |
Mar 07 2028 | 12 years fee payment window open |
Sep 07 2028 | 6 months grace period start (w surcharge) |
Mar 07 2029 | patent expiry (for year 12) |
Mar 07 2031 | 2 years to revive unintentionally abandoned end. (for year 12) |