A method is provided for encoding multiple microphone signals into a composite source-separable audio (SSA) signal, conducive for transmission over a voice network. The embodiments enable the processing of source separation of the target voice signal from its ambient sound to be performed at any point in the voice communication network, including the internet cloud. A multiplicity of processing is possible over the SSA signal, based on the intended voice application. The level of processing is adapted with the availability of the processing power at the chosen processing node in the network in one embodiment. An apparatus for separating out the target source voice from its ambient sound is also provided. The apparatus includes a directed source separation (DSS) unit, which processes the two virtual microphone signals in the SSA representation, to generate a new SSA signal including the enhanced target voice and the enhanced ambient noise.
|
1. A method for network transmission of voice captured through a plurality of microphones spatially disposed in a first group and a second group, comprising:
combining two digital audio signals into a composite source separable audio (SSA) signal, each digital audio signal of the two digital audio signals representing an independent mixture of a target source voice and an ambient noise, wherein outputs of the plurality of microphones within the first group are summed together as a first output digital audio signal of the two digital audio signals and the outputs of the plurality of microphones within the second group are summed together as a second output digital audio signal of the two digital audio signals, thereby defining a first virtual microphone and a second virtual microphone, respectively, and wherein the combining process comprises interleaving the two digital audio signals to generate the composite SSA signal; and
separating the two digital audio signals within the composite SSA signal into two mono audio signals by performing a first instance of directed source separation (DSS) on the composite SSA signal, the DSS comprising:
generating one or more control signals indicating an instantaneous signal-to-noise ratio, in the composite SSA signal, between the target source voice and the ambient noise;
under direction of the one or more control signals, separating the target source voice of the composite SSA signal into a first mono audio signal; and
under direction of the one or more control signals, separating the ambient noise of the composite SSA signal into a second mono audio signal.
0. 2. A method of
separating the two audio signals within the composite SSA signal into two mono audio signals by performing directed source separation (DSS).
3. A method of
4. A method of
0. 5. A method of
performing a first ambient sound separation process for human listening intelligibility; and
performing a second ambient sound separation process for a machine voice application.
6. A method of
0. 7. A method of
0. 8. The method of claim 1, comprising:
separating the two digital audio signals within the composite SSA signal into another two mono audio signals by performing a second instance of DSS on the composite SSA signal;
wherein the two mono audio signals include a first level of the ambient noise of the composite SSA signal;
wherein the other two mono audio signals include a second level of the ambient noise of the composite SSA signal;
wherein the first level is different than the second level.
0. 9. The method of claim 1, wherein the separating is performed in an intermediate server in a network cloud.
|
This application claims priority to U.S. patent application Ser. No. 61/477,573, filed Apr. 20, 2011, and entitled “METHOD FOR ENCODING MULTIPLE MICROPHONE SIGNALS INTO A SOURCE-SEPARABLE AUDIO SIGNAL FOR NETWORK TRANSMISSION AND AN APPARATUS FOR DIRECTED SOURCE SEPARATION OF TARGET SOURCE VOICE FROM AMBIENT SOUND”; and U.S. Application No. 61/486,088, filed on May 13, 2011, and entitled “MULTI-MICROPHONE NOISE SUPPRESSION OVER SINGLE AUDIO CHANNEL,” which are incorporated herein by reference.
Recent developments in the art of manufacturing has brought significant reduction in cost and form factor of mobile consumer devices—tablet, blue tooth headset, net book, net TV etc. As a result, there is an explosive growth in consumption of these consumer devices. Besides communication applications such as voice and video telephony, voice driven machine applications are becoming increasing popular as well. Voice based machine applications include voice driven automated attendants, command recognition, speech recognition, voice based search engine, networked games and such. Video conferencing and other display oriented applications require the user to watch the screen from a hand-held distance. In the hand-held mode, the signal to noise ratio of the desired voice signal at the microphone is severely degraded, both due to the exposure to ambient noise and the exposure to loud acoustic echo feedback from the loudspeakers in close proximity. This is further exacerbated by the fact that voice driven applications and improved voice communications require wide band voice.
A few examples of the devices which benefit from this invention are shown in
The said voice sensing problem due to the reduced signal to noise ratio can be addressed by employing multiple microphones. As shown in
An alternate method called blind source separation (BSS) has been discussed in the academia. Given two microphones placed in strategic locations with respect to two sources of sound, it is possible to separate out the two sources without any distortion. As shown in
It is within this context that the embodiments arise.
The embodiments provide a technique for transforming the outputs of multiple microphones into a source separable audio signal, whose format is independent of the number of microphones. The signal may flow from end to end in the network and processing functions may be performed at any point in the network, including the cloud. The value functions attainable with multi-microphone processing include but are not limited to:
In the present embodiments, an arbitrary number of microphones are bifurcated into two groups. The microphones in each group are summed together to form two microphone arrays. Due to the computing ease of the processing operation, i.e., summing, these arrays by themselves provide very little improvement of signal to noise ratio in the desired look direction. However, the microphones are arranged such that the characteristics of the ambient noise from other directions orthogonal to the look direction, is substantially different between the outputs of the two microphone arrays. The embodiments employ a source separation adaptive filtering process between these two outputs to generate the desired signal with substantially improved signal to noise ratio. The separation process also provides ambient noise with significantly reduced voice. There are applications where the ambient noise is of use. The outputs of a multiplicity of microphones is reduced or encoded into two signals, i.e., the virtual microphones. With the reduced bandwidth and fixed signal dimension, it is easier to perform the processing through existing hardware and software systems, such that the processing of interest may be performed either on the end hosts or the network cloud.
The above summary does not include all aspects of the present invention. The invention includes all systems and methods disclosed in the Detailed Description below and particularly pointed out in the claims.
The embodiments of the invention are illustrated by way of examples and not be interpreted by way of limitation in the accompanying drawings.
While several details are set forth, it is understood that some embodiments of the invention may be practiced without these details. In some instances, well-known circuits and techniques have not been shown in detail so as not to obscure the understanding of this description.
As mentioned above two microphones in the beam forming array may provide some mitigation, however, it is possible to do much better with more than two microphones. Increasing the number of microphones brings several scaling hurdles with it, such as:
With advances in server technology, the processing hurdle may be overcome by moving processing to the cloud, making the consumer clients thinner and lighter. With the advent of personal WiFi routers connected to the internet via 3G/4G cellular network, it is becoming more and more feasible to defer voice processing to the cloud.
To overcome the hardware and bandwidth hurdle, it is desirable to reduce the outputs of multiple microphones into a signal, whose required bandwidth does not increase with the increase in the number of microphones. This reduction or encoding should be achievable using hardware circuitry, such as a summer. The encoding needs to preserve the useful information from multiple microphones with respect to the applications mentioned herein which benefit from the use of multiple microphones.
In the embodiments described above, a plurality of microphones is bifurcated into two groups.
In all the above cases, the impact of target voice from the desired look direction is similar on both the virtual microphones. The impact of ambient noise is relatively dissimilar on the two virtual microphones. A shown in
In another embodiment, the acoustic feedback from loud speakers is treated as another source of ambient noise. The plurality of microphones are placed and grouped in such a fashion that the acoustic feedback has maximally disparate impact on the two virtual microphones. In one embodiment, as shown in pre-processing module 82 in
One aspect of the embodiments is the ability of simplify the hardware requirement for grouping multiple microphones into a virtual microphone. One embodiment is to passively gang or wire-sum the outputs of analog microphones, 091, as shown in
Logically, SSA is a composite or a bundle of two audio streams, Channel A and Channel B. As shown in
In another embodiment, the SSA signal may be transmitted end to end, i.e., from the plurality of microphones on the transmit end to the receiving end, through the voice communication network. Along the way, the SSA signal may be transmitted using the two channel stereo format or the mono audio format. The SSA format is such that the intermediate processing is optional. In others words, the SSA signal degenerates gracefully to a voice signal (with ambient noise) in the absence of any DSS processing. The SSA composite is agnostic to the existing voice communication network, requiring no change at the system level. The SSA composite works with any existing voice communication standard, including bluetooth and voice over Internet Protocol (VoIP). When the DSS signal processing needs to be performed, it can be done so at any point in the network shown in
In another embodiment, where the inputs from the two virtual microphones are analog, an analog SSA signal is generated as shown in
In another embodiment, it is possible for the receiving end to recover the ambient noise, while suppressing the primary source voice. For example, it may be socially interesting for the receiving listener to experience the party ambience around the transmitting talker. The ambient noise may be used by an application to determine the proximity of two talkers in one embodiment. In another example, an internal map of a shopping mall may be annotated with the ambient noise in several critical spots such as shops, to guide a phone user in reaching their target destination.
In another embodiment, the SSA representation enables effective processing required for audio conferencing, as illustrated in
In another embodiment, the signal processing on a primary call is enhanced by taking advantage of the reference ambient sound present in another secondary call, when the two transmit parties are located in proximity. For example, if two parties are transmitting voice from the same social gathering, they are sharing the ambient noise environment. In fact, a target voice may be another's ambient noise. If the call server is aware of the situation, the server can take advantage of one call's SSA to perform better enhancement in the other call. In today's consumer gadget deployment, one can use global positioning satellite (GPS) to locate whether the two transmit hosts are in physical proximity. In the example of
The DSS signal processing requirement is different for different applications. While speech recognition is better off with silence insertion between speech segments, the discontinuity caused by the silence insertion is extremely annoying to human listener. Also, the quality of left over ambient noise is extremely important for human listening. Unlike speech recognition or voice search, voice command recognition is typically much more robust in the presence of ambient noise, hence it does not require as much processing. In another embodiment, as shown in
In another embodiment, a slowly varying (voice-band compatible) non-voice signal 161 is mixed into the Channel A 162 of the SSA composite, and it's inversion 164 is mixed into the Channel B 163, to generate a new SSA (166,167) be carried end-to-end. It is best to modulate these signals into the higher bands of the wide-band voice, so it has the least interference with voice. The said slowly varying signal is not audible to the listener, since it is suppressed by the DSS process for voice enhancement. The slow non-voice sensor signal may be GPS, Gyro, temperature, barometer, accelerometer, illumination, gaming controller, etc.
With the above embodiments in mind, it should be understood that the embodiments might employ various computer-implemented operations involving data stored in computer systems. These operations are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing. Any of the operations described herein that form part of the invention are useful machine operations. The embodiments also relates to a device or an apparatus for performing these operations. The apparatus can be specially constructed for the required purpose, or the apparatus can be a general-purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general-purpose machines can be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations
The invention can also be embodied as computer readable code on a computer readable medium. The computer readable medium is any data storage device that can store data, which can be thereafter read by a computer system. Examples of the computer readable medium include hard drives, network attached storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion. Embodiments of the present invention may be practiced with various computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers and the like. The invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a wire-based or wireless network.
Although the method operations were described in a specific order, it should be understood that other operations may be performed in between described operations, described operations may be adjusted so that they occur at slightly different times or the described operations may be distributed in a system which allows the occurrence of the processing operations at various intervals associated with the processing.
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications can be practiced within the scope of the appended claims. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.
Patent | Priority | Assignee | Title |
11832051, | Sep 14 2018 | SQUAREHEAD TECHNOLOGY AS | Microphone arrays |
Patent | Priority | Assignee | Title |
6618485, | Feb 18 1998 | Fujitsu Limited | Microphone array |
7343187, | Nov 02 2001 | Covidien LP | Blind source separation of pulse oximetry signals |
7813923, | Oct 14 2005 | Microsoft Technology Licensing, LLC | Calibration based beamforming, non-linear adaptive filtering, and multi-sensor headset |
7983907, | Jul 22 2004 | Qualcomm Incorporated | Headset for separation of speech signals in a noisy environment |
20010037195, | |||
20050005025, | |||
20050074129, | |||
20050281410, | |||
20060072767, | |||
20060210096, | |||
20090003623, | |||
20090003640, | |||
20090010449, | |||
20090010450, | |||
20090010451, | |||
20090055170, | |||
20090116661, | |||
20100081466, | |||
20100098266, | |||
20100130198, | |||
20110040397, | |||
GB2236640, | |||
JP2008271067, | |||
KR10200100072746, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jul 08 2014 | MUKUND, SHRIDHAR K | AURENTA, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 042170 | /0130 | |
Jul 08 2014 | AURENTA, INC | Plantronics, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 042170 | /0153 | |
Mar 17 2015 | Plantronics, Inc. | (assignment on the face of the patent) | / | |||
Jul 02 2018 | Plantronics, Inc | Wells Fargo Bank, National Association | SECURITY AGREEMENT | 046491 | /0915 | |
Jul 02 2018 | Polycom, Inc | Wells Fargo Bank, National Association | SECURITY AGREEMENT | 046491 | /0915 | |
Aug 29 2022 | Wells Fargo Bank, National Association | Plantronics, Inc | RELEASE OF PATENT SECURITY INTERESTS | 061356 | /0366 | |
Aug 29 2022 | Wells Fargo Bank, National Association | Polycom, Inc | RELEASE OF PATENT SECURITY INTERESTS | 061356 | /0366 | |
Oct 09 2023 | Plantronics, Inc | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | NUNC PRO TUNC ASSIGNMENT SEE DOCUMENT FOR DETAILS | 065549 | /0065 |
Date | Maintenance Fee Events |
Jul 23 2021 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Jan 19 2024 | 4 years fee payment window open |
Jul 19 2024 | 6 months grace period start (w surcharge) |
Jan 19 2025 | patent expiry (for year 4) |
Jan 19 2027 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jan 19 2028 | 8 years fee payment window open |
Jul 19 2028 | 6 months grace period start (w surcharge) |
Jan 19 2029 | patent expiry (for year 8) |
Jan 19 2031 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jan 19 2032 | 12 years fee payment window open |
Jul 19 2032 | 6 months grace period start (w surcharge) |
Jan 19 2033 | patent expiry (for year 12) |
Jan 19 2035 | 2 years to revive unintentionally abandoned end. (for year 12) |