An audio signal is processed in the frequency domain to convert an input signal format to an output signal format. That is, a multichannel audio signal intended for playback over a predefined speaker layout can be formatted to achieve spatial reproduction over a different layout comprising a different number of speakers.
|
6. A method of upmixing or downmixing an input signal to an output signal format, the method comprising:
converting the input signal to an intermediate signal having the same number of channels as the output signal format;
spatially analyzing the input signal to identify spatial cues that are independent of the input signal format wherein the spatial analyzing localizes a sound event by determining a first associated parameter that describes the event's sound in the range from an omnidirectional source to a point-source and a second parameter that describes an angular position for the sound event; and
processing those spatial cues to generate an output signal reflecting the spatial cues.
1. A method for multichannel surround format conversion of an audio recording from an input signal format to an output signal format, comprising:
converting an input signal to one of a frequency-domain or subband representation comprising a plurality of time-frequency tiles;
deriving a direction for each time-frequency tile in the plurality; and
for each time-frequency tile, deriving a scaling factor for each output channel of the output signal format, according to the direction; wherein the input signal is a multichannel signal and is downmixed to a single-channel intermediate signal and wherein each output signal channel is obtained by receiving the intermediate signal and applying the scaling factor for the respective output channel for each time-frequency tile.
3. A method for multichannel surround format conversion of an audio recording from an input signal format to an output signal format, comprising:
converting an input signal to one of a frequency-domain or subband representation comprising a plurality of time-frequency tiles;
deriving a direction for each time-frequency tile in the plurality;
for each time-frequency tile, deriving a scaling factor for each output channel of the output signal format, according to the direction; and performing a passive format conversion wherein each output signal channel in the output signal is derived by linear combination of the input signal channels nearest to it in the layouts corresponding to the respective input and output signal formats and applying the scaling factor for the respective output signal channel for each time-frequency tile.
14. An audio format conversion system configured for multichannel surround format conversion of an audio recording from an input signal format to an output signal format, the processor comprising:
an input port for receiving an input audio signal;
a frequency domain converter for converting an input signal to one of a frequency-domain or subband representation comprising a plurality of time-frequency tiles; and
a processor configured for deriving a direction for each time-frequency tile in the plurality; for each time-frequency tile, deriving a scaling factor for each output channel of the output signal format, according to the direction; and performing a passive format conversion wherein each output signal channel in the output signal is derived by linear combination of the input signal channels nearest to it in the layouts corresponding to the respective input and output signal formats and applying the scaling factor for the respective output signal channel for each time-frequency tile.
2. The method as recited in
4. The method as recited in
5. The method as recited in
7. The method as recited in
8. The method as recited in
9. The method as recited in
10. The method as recited in
11. The method as recited in
12. The method as recited in
13. The method as recited in
|
This application is a continuation-in-part of U.S. patent application Ser. No. 11/750,300, which is entitled Spatial Audio Coding Based on Universal Spatial Cues, and filed on May 17, 2007 which claims priority to and the benefit of the disclosure of U.S. Provisional Patent Application Ser. No. 60/747,532, filed on May 17, 2006, and entitled Spatial Audio Coding Based on Universal Spatial Cues, the specifications of which are incorporated herein by reference in their entirety. Further, this application claims priority to and the benefit of the disclosure of U.S. Provisional Patent Application Ser. No. 60/894,622, filed on Mar. 13, 2007, and entitled Multichannel Surround Format Conversion and Generalized Upmix, which is incorporated herein by reference in its entirety.
1. Field of the Invention
The present invention relates to signal processing techniques. More particularly, the present invention relates to methods for processing audio signals based on spatial audio cues.
2. Description of the Related Art
A common limitation of existing time-domain approaches to multichannel audio format conversion is that the reproduction causes spatial spreading or “leakage” of a given directional sound event into loudspeakers other than those nearest the due direction of the event. This affects the perceived “sharpness” of the spatial image of the sound event and the robustness of the spatial image with respect to listener position.
What is desired is an improved format conversion technique.
Provided is a frequency-domain method for format conversion of a multichannel audio signal, intended for playback over a pre-defined loudspeaker layout, in order to achieve accurate spatial reproduction over a different layout potentially comprising a different number of loudspeakers.
In accordance with one embodiment, a format conversion method for multichannel surround sound such as contained in an audio recording is provided. In order to convert from the input format to an output format, an initial operation involves converting the signals to a frequency-domain or subband representation. For each time and frequency in the time-frequency signal representation, a spatial localization vector is derived by a spatial analysis algorithm. Further, for each time and frequency, a scaling factor associated with each output channel is determined, according to the derived localization. In one embodiment, the scaling factor is applied to a single-channel downmix of the input signals to derive the output channel signals. In another embodiment, the scaling factor is applied to output channel signals derived by an initial format conversion so as to improve the spatial fidelity of the initial conversion.
These and other features and advantages of the present invention are described below with reference to the drawings.
Reference will now be made in detail to preferred embodiments of the invention. Examples of the preferred embodiments are illustrated in the accompanying drawings. While the invention will be described in conjunction with these preferred embodiments, it will be understood that it is not intended to limit the invention to such preferred embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well known mechanisms have not been described in detail in order not to unnecessarily obscure the present invention.
It should be noted herein that throughout the various drawings like numerals refer to like parts. The various drawings illustrated and described herein are used to illustrate various features of the invention. To the extent that a particular feature is illustrated in one drawing and not another, except where otherwise indicated or where the structure inherently prohibits incorporation of the feature, it is to be understood that those features may be adapted to be included in the embodiments represented in the other figures, as if they were fully illustrated in those figures. Unless otherwise indicated, the drawings are not necessarily to scale. Any dimensions provided on the drawings are not intended to be limiting as to the scope of the invention but merely illustrative.
In accordance with several embodiments, provided is a frequency-domain method for format conversion of a multichannel audio signal intended for playback over a pre-defined loudspeaker layout, in order to achieve accurate spatial reproduction over a different layout potentially comprising a different number of loudspeakers. Embodiments of the present invention overcome spatial spreading or leakage limitations by using the frequency-domain spatial analysis/synthesis techniques described in pending U.S. patent application Ser. No. 11/750,300. This specification incorporates by reference in its entirety the disclosure of U.S. patent application Ser. No. 11/750,300, filed on May 17, 2007, and entitled Spatial Audio Coding Based on Universal Spatial Cues. In one embodiment of the present invention, the single-channel (or “mono”) downmix step included in the spatial audio coding scheme is incorporated in the format conversion system. In another and preferred embodiment of the present invention, an alternative to the mono downmix step included in the spatial audio coding scheme described generally in U.S. patent application Ser. No. 11/750,300 is provided. This alternative, a general “passive upmix” technique, reduces or avoids signal leakage across channels.
The current invention overcomes the spatial limitations of prior methods by incorporating a spatial analysis process.
In
The operation of the format conversion system 200 in
Input and Output Formats
The angle θn is defined to be within the range [−180°, 180°] and is measured clockwise from the vertical axis such that channel position 401 corresponds to a positive angle and channel position 409 to a negative angle. An entire N-channel format or reproduction layout can thus be described equivalently as a set of angles {θ1, θ2, θ3, . . . θN}, a set of format vectors {{right arrow over (p)}1, {right arrow over (p)}2, {right arrow over (p)}3, . . . {right arrow over (p)}N} or as a “format matrix” whose columns are the format vectors:
P=[{right arrow over (p)}1 {right arrow over (p)}2 {right arrow over (p)}3 . . . {right arrow over (p)}N].
Those skilled in the art will recognize that although for the purposes of illustration and specification the formats are depicted as two-dimensional (planar) and the format vectors are analogously comprised of two dimensions, the channel format vector description and the full current invention can be extended to three-dimensional layouts without limitation. In one non-limiting example, an embodiment of the invention applicable to a three-dimensional layout is achieved by adding an elevation angle for each channel and adding a third dimension to the format vectors.
Passive Upmix
This section describes the implementation of passive format conversion or “passive upmix” in accordance with several embodiments of the present invention. Several methods suitable for use in block 203 of
At each time t, the input sample vector (of length M) is converted to an output sample vector (of length N) by matrix multiplication. This format conversion is referred to as “passive” in that the coefficients cnm of the conversion matrix C depend only on the input and output formats and not on the content of the input signals. Those of skill in the art will recognize that passive format conversion by matrix multiplication could be carried out on time-domain signals as shown in the above equation, on frequency-domain signals, or on other signal representations and still be in keeping with the scope of the present invention.
In one embodiment, the coefficients cnm of the conversion matrix are all selected to be equal. With this choice, the output signals of the passive format conversion are all identical. This choice corresponds to providing a single-channel downmix of the input signals to each of the output channels. In a preferred embodiment, the downmix signal is energy-normalized such that its energy is equal to the total energy in the input signals as taught in U.S. patent application Ser. No. 11/750,300. Energy normalization is preferred in that it compensates for potential cancellation of out-of-phase components in the downmix signal. In one embodiment of the invention, as taught in U.S. patent application Ser. No. 11/750,300, an energy-normalized downmix signal is computed as the sum of the input signals multiplied by a factor equal to the square root of the sum of the energies of the input signals divided by the square root of the energy of their sum.
In another embodiment, the coefficients cnm of the conversion matrix are selected according to the following procedure. Each input channel is considered in turn. For input channel m with channel angle φm, the procedure first identifies the output channels i and j whose channel angles ψi and ψj are the closest output channel angles on either side of the input channel angle φm. Then, pairwise-panning coefficients cim and cjm are determined for panning input channel m into output channels i and j. These coefficients are entered into the conversion matrix C in the (i,m) and (j,m) positions, respectively, and the other entries in the m-th column of C are set to zero. That is, each input channel is pairwise-panned into the nearest adjacent output channels. The pairwise panning coefficients cim and cjm are determined by an appropriate panning scheme such as vector-base amplitude panning (VBAP) or others known by those skilled in the art.
In a preferred embodiment, the passive format conversion matrix is configured according to the procedure depicted in
Those of skill in the art will understand that other methods of passive format conversion could be used in the present invention. The invention is not limited in this regard, and other methods of passive format conversion are within its scope. Those of skill in the art will also recognize that passive format conversion methods which provide output signals that are spatially consistent with the input signals are preferred in the current invention. Furthermore, those of skill in the art will further recognize that speaker-filling passive format conversion is preferable in the current invention to methods which leave some of the available output channels permanently silent.
Spatial Analysis
In a preferred embodiment, the spatial analysis in block 211 of
In a preferred embodiment, the sound events for which the spatial analysis determines localization vectors correspond to time-frequency components of the sound scene. In other words, at each time and frequency, the spatial analysis determines an aggregate localization of the time-frequency content of the channel signals. According to the teachings of U.S. patent application Ser. No. 11/750,300, the localization vector d is determined for each time and frequency as follows.
As a first step in the spatial analysis to determine the spatial localization vector {right arrow over (d)}[k,l], the input channel format is described using unit-length format vectors ({right arrow over (p)}m) corresponding to each channel position as described above. A normalized weight for each channel signal is then computed. In a preferred embodiment, the normalized coefficient for channel m is determined according to
where this normalization is preferred due to energy-preserving considerations. In an alternate embodiment, the normalized coefficient for channel m is determined according to
Those skilled in the arts will recognize that other methods for computing such coefficients could be incorporated. The invention is not limited in this regard. In preferred embodiments, the coefficients αm are normalized such that
and furthermore satisfy the condition 0≦αm≦1. Using the format vectors and channel weights, an initial direction vector is computed according to
Note that all of the terms in the above equations are functions of frequency k and time l; in the remainder of the description, the notation will be simplified by dropping the [k,l] indices on some variables that are indeed time and frequency dependent. In the remainder of the description, the sum vector {right arrow over (g)}[k,l] will be referred to as the Gerzon vector, as it is known as such to those of skill in the relevant arts.
The Gerzon vector {right arrow over (g)}[k,l] formed by vector addition to yield an overall perceived spatial location for the combination of channel signals may in some cases need to be corrected. In particular, the Gerzon vector has a significant shortcoming in that its magnitude does not faithfully describe the radial location of sound events. As taught in U.S. patent application Ser. No. 11/750,300, the Gerzon vector is bounded by the inscribed polygon whose vertices correspond to the input format vector endpoints. Thus, the radial location of a sound event is generally underestimated by the Gerzon vector (except when the sound event is active in only one channel) such that rendering based on the Gerzon vector magnitude will introduce errors in the spatial reproduction.
In one embodiment of the present invention, the Gerzon vector {right arrow over (g)}[k,l] is used as specified. In preferred embodiments, a modified localization vector is derived from the Gerzon vector so as to correct the radial localization error described above and thereby improve the spatial rendering. In one embodiment, an improved localization vector is derived by decomposing {right arrow over (g)}[k,l] into a directional component and a non-directional component. The decomposition is based on matrix mathematics. First, note that the vector {right arrow over (g)}[k,l] can be expressed as
{right arrow over (g)}[k,l]=P{right arrow over (α)}[k,l]
where P is the input format matrix whose m-th column is the format vector {right arrow over (p)}m and where the m-th element of the column vector {right arrow over (α)}[k,l] is the coefficient αm[k,l]. Since the format matrix P is rank-deficient (when the number of channels is sufficiently large as in typical multichannel scenarios), the direction vector {right arrow over (g)}[k,l] can be decomposed as
{right arrow over (g)}[k,l]=P{right arrow over (α)}[k,l]=P{right arrow over (ρ)}[k,l]+P{right arrow over (ε)}[k,l]
where {right arrow over (α)}[k,l]={right arrow over (ρ)}[k,l]+{right arrow over (ε)}[k,l] and where the vector {right arrow over (ε)}[k,l] is in the null space of P, i.e. P{right arrow over (ε)}[k,l]=0 with ∥{right arrow over (ε)}[k,l]∥2>0. Of the infinite number of possible decompositions of this form, there is a uniquely specifiable decomposition of particular value for the current application: if the coefficient vector {right arrow over (ρ)}[k,l] is chosen to only have nonzero elements for the channels whose format vectors are adjacent (on either side) to the vector {right arrow over (g)}[k,l], the resulting decomposition gives a pairwise-panned component with the same direction as {right arrow over (g)}[k,l] and a non-directional component (whose Gerzon vector sum is zero). Denoting the channel vectors adjacent to {right arrow over (g)}[k,l] as {right arrow over (p)}i and {right arrow over (p)}j, we can write:
where ρi and ρj are the nonzero coefficients in {right arrow over (ρ)}, which correspond to the i-th and j-th channels. Here, we are finding the unique expansion of {right arrow over (g)} in the basis defined by the adjacent channel vectors; the remainder {right arrow over (ε)}={right arrow over (α)}−{right arrow over (ρ)} is in the null space of P by construction. The i-th and j-th channels identified as adjacent to {right arrow over (g)}[k,l] are dependent on the frequency k and time l although this dependency is not explicitly included in the notation.
Given the decomposition into pairwise and non-directional components specified above, the norm of the pairwise coefficient vector {right arrow over (ρ)}[k,l] can be used to determine a robust localization vector according to:
where the subscript “1” denotes the 1-norm of the vector, namely the sum of the magnitudes of the vector elements, and where the subscript “2” denotes the 2-norm of the vector, namely the square root of the sum of the squared magnitudes of the vector elements. In this formulation, the magnitude of {right arrow over (p)}[k,l] indicates the radial sound position at frequency k and time l. Note that in the above we are assuming that the weights in {right arrow over (p)}[k,l] are energy weights, such that ∥{right arrow over (p)}[k,l]∥1=1 for a discrete pairwise-panned source as in standard panning methods.
The angle and magnitude of the localization vector {right arrow over (d)}[k,l] are computed for each time and frequency in the signal representation.
Those skilled in the arts will recognize that alternate methods for estimating the localization of sound events could be incorporated in the current invention. Thus, the particular use of the spatial analysis taught in U.S. patent application Ser. No. 11/750,300 is not a restriction as to the scope of the current invention.
Spatial Synthesis
In a preferred embodiment, the spatial synthesis in block 215 of
As a first step in the spatial synthesis, in a preferred embodiment the signals generated by the passive upmix are normalized to all have the same energy. Those of skill in the arts will understand that this normalization can be implemented as a separate process or that the normalization scaling can be incorporated into the weights derived subsequently by the spatial synthesis; either approach is within the scope of the invention.
The spatial synthesis derives a set of weights for the output channels based on the output format and the spatial cues provided by the spatial analysis. In a preferred embodiment, the weights are derived for each time and frequency in the following manner. First, the localization vector {right arrow over (d)}[k,l] is identified as comprising an angular cue θ[k,l] and a radial cue r[k,l]. The output channels adjacent to θ[k,l] (on either side) are identified. The corresponding channel format vectors {right arrow over (q)}i and {right arrow over (q)}j, namely the unit vectors in the directions of the i-th and j-th output channels, are then used in a vector-based panning method to derive pairwise panning coefficients σi and σj according to
These coefficients are used to construct a panning vector {right arrow over (σ)} which consists of all zero values except for σi in the i-th position and σj in the j-th position. The panning vector so constructed is then scaled such that ∥{right arrow over (σ)}∥1=1. The pairwise panning σi and σj coefficients capture the angle cue θ[k,l]; they represent an on the-circle point in the listening scenario of
To correctly render the radial position of the source as represented by the radial cue r[k,l], a second panning is carried out between the pairwise weights {right arrow over (σ)} and a non-directional set of panning weights, i.e. a set of weights which render a non-directional sound event over the given output configuration. An appropriate set of non-directional weights can be derived according the procedure taught in U.S. patent application Ser. No. 11/750,300, which uses a Lagrange multiplier optimization to determine such a set of weights for a given (arbitrary) output format. Those of skill in the arts will understand that alternate methods for deriving the set of non-directional weights may be employed in the present invention; the use of such alternate methods is within the scope of the invention. Denoting the non-directional set by {right arrow over (δ)}, the overall weights resulting from a linear pan between the pairwise weights and the non-directional weights are given by
{right arrow over (β)}[k,l]=r[k,l]{right arrow over (σ)}[k,l]+(1−r[k,l]){right arrow over (δ)}.
where it should be noted that the non-directional set {right arrow over (δ)} is not dependent on time or frequency and need only be computed at initialization or when the output format changes. This panning approach preserves the sum of the panning weights as taught in U.S. patent application Ser. No. 11/750,300. Under the assumption that these are energy panning weights, this linear panning is energy-preserving. Those of skill in the art will understand that other panning methods could be used at this stage; other panning methods, such as quadratic panning, are within the scope of the invention.
The weights {right arrow over (β)}[k,l] computed by the spatial synthesis procedure are then applied to the signals provided by the passive upmix to generate the final output signals to be used for rendering over the output format. The application of the weights to the channel signals is done in accordance with the channel index and the element index in the vector {right arrow over (β)}[k,l]. The i-th element of the vector {right arrow over (β)}[k,l] determines the gain applied to the i-th output channel. In a preferred embodiment, the weights in the vector {right arrow over (β)}[k,l] correspond to energy weights, and a square root is applied to the i-th element prior to deriving the scale factor for the i-th output channel. In one embodiment, the normalization of the intermediate channel signals is incorporated in the output scale factors as explained earlier.
In some embodiments, it may desirable for the sake of reducing artifacts or to achieve a desired spatial effect to apply the weights determined by {right arrow over (β)}[k,l] only partially to determine the output channel signals from the intermediate channel signals. In such embodiments, a gain is introduced which controls the degree to which the weights {right arrow over (β)}[k,l] are applied and the degree to which the intermediate channel signals are provided directly to the output. This gain provides a cross-fade between the signals provided by the passive format conversion and those provided by a full application of the spatial synthesis weights. Those of skill in the art will understand that this cross-fade corresponds to the derivation of a new scale factor to be applied to the intermediate channel signals, where the scale factor is a weighted combination of a set of unit weights (corresponding to providing the passive upmix as the final output) and the set of weights determined by {right arrow over (β)}[k,l] (corresponding to applying the spatial synthesis fully).
In some embodiments, it may be desirable for the sake of reducing artifacts to smooth the set of scale factors derived by the spatial synthesis to generate a set of smoothed scale factors to use for generating the output signals, where such smoothing may be applied in any or all of the temporal dimension (in time), the spectral dimension (across frequency bands), and the spatial dimension (across channels) without limitation. Such smoothing procedures are within the scope of the present invention.
Primary-Ambient Decomposition
It is often advantageous to separate primary and ambient components in the representation and synthesis of an audio scene.
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.
Goodwin, Michael M., Jot, Jean-Marc
Patent | Priority | Assignee | Title |
10057704, | Dec 04 2012 | Samsung Electronics Co., Ltd. | Audio providing apparatus and audio providing method |
10149084, | Dec 04 2012 | Samsung Electronics Co., Ltd. | Audio providing apparatus and audio providing method |
10341800, | Dec 04 2012 | Samsung Electronics Co., Ltd. | Audio providing apparatus and audio providing method |
10616705, | Oct 17 2017 | CITIBANK, N A | Mixed reality spatial audio |
10779082, | May 30 2018 | CITIBANK, N A | Index scheming for filter parameters |
10863301, | Oct 17 2017 | Magic Leap, Inc. | Mixed reality spatial audio |
10887694, | May 30 2018 | Magic Leap, Inc. | Index scheming for filter parameters |
11012778, | May 30 2018 | Magic Leap, Inc. | Index scheming for filter parameters |
11304017, | Oct 25 2019 | MAGIC LEAP, INC | Reverberation fingerprint estimation |
11477510, | Feb 15 2018 | MAGIC LEAP, INC | Mixed reality virtual reverberation |
11540072, | Oct 25 2019 | Magic Leap, Inc. | Reverberation fingerprint estimation |
11678117, | May 30 2018 | Magic Leap, Inc. | Index scheming for filter parameters |
11778398, | Oct 25 2019 | Magic Leap, Inc. | Reverberation fingerprint estimation |
11800174, | Feb 15 2018 | Magic Leap, Inc. | Mixed reality virtual reverberation |
11895483, | Oct 17 2017 | Magic Leap, Inc. | Mixed reality spatial audio |
Patent | Priority | Assignee | Title |
20060093152, | |||
20070269063, | |||
20080205676, | |||
20080232616, | |||
20080267413, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Mar 13 2008 | CREATIVE TECHNOLOGY LTD | (assignment on the face of the patent) | / | |||
Jun 09 2008 | GOODWIN, MICHAEL M | CREATIVE TECHNOLOGY LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 021067 | /0621 | |
Jun 09 2008 | JOT, JEAN-MARC | CREATIVE TECHNOLOGY LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 021067 | /0621 |
Date | Maintenance Fee Events |
Oct 22 2018 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Apr 06 2022 | SMAL: Entity status set to Small. |
Oct 21 2022 | M2552: Payment of Maintenance Fee, 8th Yr, Small Entity. |
Date | Maintenance Schedule |
Apr 21 2018 | 4 years fee payment window open |
Oct 21 2018 | 6 months grace period start (w surcharge) |
Apr 21 2019 | patent expiry (for year 4) |
Apr 21 2021 | 2 years to revive unintentionally abandoned end. (for year 4) |
Apr 21 2022 | 8 years fee payment window open |
Oct 21 2022 | 6 months grace period start (w surcharge) |
Apr 21 2023 | patent expiry (for year 8) |
Apr 21 2025 | 2 years to revive unintentionally abandoned end. (for year 8) |
Apr 21 2026 | 12 years fee payment window open |
Oct 21 2026 | 6 months grace period start (w surcharge) |
Apr 21 2027 | patent expiry (for year 12) |
Apr 21 2029 | 2 years to revive unintentionally abandoned end. (for year 12) |