Techniques are disclosed for manipulating a media player based on the environment in which content is consumed. For example, a user listening to a radio broadcast or some other ambient sound hears a song begin to play. Recognizing the song, the user wishes to watch an associated music video. A smartphone is used to record a portion of the ambient sound using an application configured according to certain disclosed embodiments. The observed audio is compared with one or more archived audio segments, each of which is associated with corresponding video content. If a match is found between the observed audio segment and an archived audio segment, video content corresponding to the matched archived audio segment is played back via a media player installed on the device. The playback is synchronized with the ambient sound. This allows the user to enjoy both the ambient audio and corresponding video content.
|
17. A non-transitory computer readable medium with instructions that, when executed by one or more processors, causes a process for synchronizing observed audio with archived content to be carried out, the process comprising:
receiving an observed audio segment from a client computing device wherein the observed audio segment corresponds to ambient audio recorded by the client computing device;
identifying an archived audio segment that includes at least a portion of the observed audio segment, wherein the archived audio segment is stored in a data repository that is separate from the client computing device;
determining a time lag corresponding to a relative time offset between the observed audio segment and the archived audio segment;
transmitting archived content to the client computing device, wherein the archived content is associated with the archived audio segment, and wherein the archived content is transmitted from a time point that is at least partially based on the time lag.
1. A method for synchronizing observed audio with archived video content, the method comprising:
receiving an observed audio segment from a client computing device, wherein the observed audio segment corresponds to ambient audio recorded by the client computing device;
generating a plurality of hash values corresponding to the observed audio segment;
performing a comparison of each of the plurality of hash values to a plurality of archived hash values, wherein each of the plurality of archived hash values (a) is associated with one of a plurality of archived audio segments, and (b) is stored in a data repository that is separate from the client computing device;
identifying a selected archived audio segment and a time lag based on the comparison, wherein a portion of the selected archived audio segment corresponds to the observed audio segment;
identifying video content corresponding to the selected archived audio segment, wherein the identified video content is stored in a video content repository that is separate from the client computing device; and
streaming the identified video content to the client computing device from a time point based on the time lag.
11. A system for video synchronization that comprises a memory device and a processor that is operatively coupled to the memory device, wherein the processor is configured to execute instructions stored in the memory device that, when executed, cause the processor to carry out a process for synchronizing observed audio with archived video content, the process comprising:
receiving multimedia content that includes audio content and video content;
generating archived unique hash data based on the audio content; storing the archived unique hash data in a data repository; receiving an observed audio segment from a client computing device that is separate from the data repository;
generating observed unique hash data based on the observed audio segment;
storing in the memory device, a comprehensive time lag data map that correlates a plurality of archived audio segments with a list of (time lag, count) data pairs, wherein the time lag is based on a comparison of the archived unique hash data and the observed unique hash data, and wherein the count is based on a frequency of the paired time lag;
identifying a matching archived audio segment that corresponds to the observed audio segment based on a maximum count identified from the comprehensive time lag data map: and,
transmitting video content that was received with the matching archived audio segment to the client computing device.
2. The method of
a plurality of time lags are identified based on the comparison; and the method further comprises selecting one of the plurality of time lags based on receipt of an additional audio segment from the client computing device.
3. The method of
generating a synchronization map that includes a matching hash value that is found in both the plurality of hash values corresponding to the observed audio segment and the plurality of archived hash values, wherein:
the matching hash value is keyed to one or more (observed, archived) time pairs,
the observed time corresponds to a time of the observed audio segment at which the matching hash value was found, and
the archived time corresponds to a time of a potentially matching archived audio segment at which the matching hash value was found; and generating a time lag data map that includes (a) a listing of one or more time lags derived from the synchronization map, wherein each of the one or more time lags is defined as a difference between the observed time and the archived time, and (b) a frequency count corresponding to each of the one or more time lags.
4. The method of
each of the plurality of hash values corresponding to the observed audio segment is paired with a time of the observed audio segment at which the hash value was generated; and
each of the plurality of archived hash values is paired with a time of the associated archived audio segment at which the archived hash value was generated.
5. The method of
the matching hash value is keyed to one or more (observed, archived) time pairs;
the observed time corresponds to a time of the observed audio segment at which the matching hash value was found; and
the archived time corresponds to a time of a potentially matching archived audio segment at which the matching hash value was found.
6. The method of
receiving the plurality of archived audio segments before receiving the observed audio segment form the client computing device; and
generating the plurality of archived hash values.
7. The method of
segment from the client computing device, wherein the multimedia content item includes one of the plurality of archived audio segments and corresponding video content;
generating the plurality of archived hash values; and
storing the corresponding video content in the video content repository.
8. The method of
dividing a frequency spectrum of the observed audio segment into a plurality of frequency bands;
dividing each of the plurality of frequency bands into a plurality of bin subsets;
identifying a bin index corresponding to a maximum power in each of the plurality of bin subsets; and
generating a plurality of hash values over a duration of the observed audio segment based on the bin indices associated with each of the plurality of frequency bands, wherein each of the plurality of hash values are defined by a powered sum of the bin indices.
9. The method of
dividing a frequency spectrum of the observed audio segment into a plurality of frequency bands;
dividing each of the plurality of frequency bands into a plurality of bin subsets;
identifying a bin index corresponding to a maximum power in each of the plurality of bin subsets; and
generating a plurality of hash values over a duration of the observed audio segment based on the bin indices associated with each of the plurality of frequency bands.
10. The method of
dividing a frequency spectrum of the observed audio segment into a plurality of frequency bands;
dividing each of the plurality of frequency bands into a plurality of bin subsets;
identifying a bin index corresponding to a maximum power in each of the plurality of bin subsets; and
generating a plurality of hash values over a duration of the observed audio segment based on the bin indices associated with each of the plurality of frequency bands,
wherein the frequency spectrum is divided into 5, 6, 7, 8, 9 or 10 frequency bands, and each of the frequency bands is divided into 3, 4, 5, 6, 7 or 8 bin subsets.
12. The system of
13. The system of
14. The system of
15. The system of
the video content repository is separate from the client computing device;
and
the process for synchronizing observed audio with archived video content further comprises retrieving the video content from the video content repository.
16. The system of
18. The non-transitory computer readable medium of
19. The non-transitory computer readable medium of
20. The non-transitory computer readable medium of
generating a plurality of hash values corresponding to the observed audio segment; and
performing a comparison of each of the plurality of hash values to a plurality of archived hash values, wherein each of the plurality of archived hash values is associated with one of a plurality of archived audio segments.
|
This disclosure relates generally to signal processing techniques, and more specifically, to methods for synchronizing an observed audio signal with archived video content having an audio track that matches the observed audio signal.
As portable computing devices such as smartphones and tablet computers have become increasingly ubiquitous, consumers have come to expect such devices to provide a wide range of functionality. This functionality is provided by both hardware and software components. For example, in terms of hardware, these devices often include components such as a touch sensitive display, one or more speakers, a microphone, a gyroscope, one or more antennae for wireless communication, a compass, and an accelerometer. In terms of software, these devices are capable of executing an ever-growing number of applications which are specifically configured to take advantage of the aforementioned hardware. Among the more popular software applications used with portable computing devices are media players which are capable of playing music, video, animation, and other such multimedia content. In particular, a wide range of commercially and freely available media players can be used to play both locally saved and remotely streamed multimedia content on a portable device. In the case of remotely streamed content, such content can be prerecorded and archived at a server that is configured to stream the content in response to a client request. Content can also be streamed “live”, such that a client can view the content nearly instantaneously with its initial recording. Regardless of how the content is streamed to the client, media players not only allow consumers to enjoy a wide range of multimedia content on their portable devices, but they also provide an valuable way for advertisers to reach a target audience.
Existing media players allow a user to consume a wide range of multimedia content, including both locally saved and remotely streamed content. Such players also provide a user with substantial control over how such content is consumed. For instance, a user can manipulate when playback of a media stream starts and stops, which can be particularly useful where a user does not wish to consume an entire media stream. To provide a specific example, in the case of a media stream that comprises a recorded baseball game, the user may wish to watch only the last three innings of the game. Existing media players also allow users to create customized playlists or to randomize playback of a collection of content items, both of which can be particularly useful in the context of audio content playback. In other applications, a media player can be configured to play primary and secondary content items which are acquired from different sources, such as where playback of a television program that is streamed from a first source is occasionally interrupted by playback of an advertisement that is streamed form a second source. While these features are useful in certain applications, the fact that existing media players function without regard to their operational environment is problematic. In particular, the inability to adapt media playback to a particular use context represents a substantial limitation on the functionality provided by existing media players.
Thus, and in accordance with certain of the embodiments disclosed herein, techniques are disclosed for manipulating the operation of a media player based on the environment in which content is consumed. For example, a user listening to a radio broadcast, a music performance, or some other source of ambient sound hears a popular song begin to play. Recognizing the song, the user wishes to watch an associated music video. A device such as a smartphone is used to record a portion of the observed ambient sound using an application configured according to certain of the embodiments disclosed herein. The observed audio segment is analyzed and compared with one or more archived audio segments, wherein each of the archived audio segments is associated with corresponding video content. If a match is found between the observed audio segment and an archived audio segment, video content corresponding to the matched archived audio segment is played back via a media player installed on the device. The playback is synchronized with the ongoing radio broadcast, music performance, or other ambient sound. This allows the user to enjoy both the ambient audio and corresponding video content.
Such embodiments provide media playback that is responsive to the environment in which the media is to be consumed. In particular, this allows users to consume video content that corresponds to observed audio, wherein the video content is also synchronized with the observed audio. As a result, a user can enjoy audiovisual content where only audio content, such as received via a radio broadcast, might otherwise be available. Not only does this enhance user experience, but it also provides a valuable way for advertisers to convert an audio impression, such as a radio advertisement, into an audiovisual impression. For instance, certain embodiments can be configured to detect an audio advertisement and play a synchronized visual segment in response to such detection. In addition to enhancing the advertiser's impression, this also provides the advertiser with a better understanding of parameters such as audience size and geolocation. In another example application, a content creator such as a radio show producer can invite listeners to synchronize their computing devices by simply recording a portion of the radio show. Once synchronized, dynamic content can be streamed to the participating listeners' devices, which can also be used to display video content associated with advertisements played during the course of the radio show. In this example application, the producer of the radio show can derive advertiser revenue based on the number of listeners subscribing to a synchronized video stream.
Certain embodiments can be understood as operating in a client-server computing environment, and include both client-side and server-side functionality. For example, a client-side device can be configured to execute an application that is capable of recording an observed audio segment, uploading the observed audio segment to a server, receiving synchronized video content from the server, and playing the received content. Several of the disclosed embodiments are specifically configured for, and described in the context of, use with a portable computing device capable of observing ambient audio via a microphone and playing back video content via a display screen. However, it will be appreciated that other embodiments can be implemented using a wide range of other computing devices, including desktop computers and smart television sets. Thus the present disclosure is not intended to be limited to implementation using any specific type of client computing device.
On the other hand, a server-side device can include a multimedia content archive that is configured in a way that facilitates subsequent matching of an observed audio segment with an archived audio segment. For example, in one embodiment such an archive is based on unique hash data that represents the various bands that comprise an audible frequency spectrum, thereby increasing the likelihood that a portion of the spectrum having peak power will be hashed at some point. Audio segments can be compared and matched based on this unique hash data. Once an archived audio segment is identified as a positive match with an observed audio segment, server-side techniques for determining a time gap between the observed and archived audio segments are provided. This enables video content corresponding to the matching archived audio segment to be streamed to the client device such that the video content is synchronized with the ambient audio.
As used herein, the term “data structure” refers, in addition to its ordinary meaning, to a way of storing and organizing data in a computer accessible memory so that data can be used by an application or software module. A data structure in its simplest form can be, for example, a set of one or more memory locations. In some cases, a data structure may be implemented as a so-called record, sometimes referred to as a struct or tuple, and may have any appropriate number of fields, elements or storage locations. As will be further appreciated, a data structure may include data of interest or a pointer that refers to a memory location where the data of interest can be found. A data structure may have any appropriate format such as, for example, a look-up table or index format; an array format; a hash table format; a graph, tree or hierarchal format having a number of nodes; an object format that includes data fields, for instance similar to a record; or a combination of the foregoing. A data structure may also include executable code for accessing and modifying the underlying structure and format. In a more general sense, the data structure may be implemented as a data set that can store specific values without being constrained to any particular order or format. In one embodiment, a data structure comprises a synchronization map, wherein matching audio hash values are keyed to time pairs associated with observed and archived audio segments. In another embodiment a data structure comprises a time lag data map for a particular archived audio segment, wherein a particular time lag is keyed to (a) a listing of time pairs associated with observed and archived audio segments, as well as (b) a count of such time pairs. In yet another embodiment a data structure comprises a comprehensive time lag data map for a plurality of archived audio segments, wherein an archived audio segment is keyed to a listing of (time lag, count) data pairs that are sorted by count in decreasing order. Numerous other data structure formats and applications will be apparent in light of this disclosure.
As used herein, the term “multimedia content” refers, in addition to its ordinary meaning, to audio, visual, or audiovisual information intended for consumption by a user, organization, or other human- or computer-controlled entity. Examples of multimedia content include an audible recording played via speakers or headphones, a visual presentation that includes one or more visual assets which may or may not change with the progression of time, and a combination of both audible and visual assets. Multimedia content can therefore be understood as including both audio content and video content in certain applications, and in such case the audio and video components can be separated and subjected to different processing techniques. Multimedia content can be stored in a compressed digital format and may be created and manipulated using any suitable editing application. For example, multimedia content can be stored in any suitable file format defined by the Moving Picture Experts Group (MPEG), including MPEG-4, can be stored as a sequence of frames defined in a color space such as red-green-blue (RGB) or luma-chrominance (YUV), or can be stored in any other suitable compressed or uncompressed file format, including file formats generated in real-time by animation engines, compositing engines, or other video generation applications. Multimedia content may also include information that is not specifically intended for display, and thus also encompasses items such as embedded executable instructions, scripts, hyperlinks, metadata, encoding information, and formatting information.
System Architecture
In general, content server 200 can be understood as receiving one or more items of multimedia content 500 as “archived input”. Multimedia content 500 preferably includes audiovisual content which corresponds to audio segments which may be observed by client computing device 100. Thus, as illustrated in
Client computing device 100 may comprise, for example, one or more devices selected from a desktop computer, a laptop computer, a workstation, a tablet computer, a smartphone, a set-top box, a server, or any other such computing device. A combination of different devices may be used in certain embodiments. In the example embodiment illustrated in
Processor 110 can be any suitable processor, and may include one or more coprocessors or controllers, such as a graphics processing unit or an audio processor, to assist in control and processing operations associated with client computing device 100. Memory 120 can be implemented using any suitable type of digital storage, such as one or more of a disk drive, a universal serial bus (USB) drive, flash memory, random access memory, or any suitable combination of the foregoing. Operating system 140 may comprise any suitable operating system, such as Google Android (Google, Inc., Mountain View, Calif.), Microsoft Windows (Microsoft Corp., Redmond, Wash.), or Apple OS X (Apple Inc., Cupertino, Calif.). As will be appreciated in light of this disclosure, the techniques provided herein can be implemented without regard to the particular operating system provided in conjunction with client computing device 100, and therefore may also be implemented using any suitable existing or subsequently-developed platform. Communications module 150 can be any appropriate network chip or chipset which allows for wired or wireless connection to network 300 and other computing devices and resources. Network 300 may be a local area network (such as a home-based or office network), a wide area network (such as the Internet), or a combination of such networks, whether public, private, or both. In some cases access to resources on a given network or computing system may require credentials such as usernames, passwords, or any other suitable security mechanism.
Still referring to the example embodiment illustrated in
In certain embodiments audio recorder 160 is configured to record and compress a predetermined duration of audio signal. For example, in one implementation any observed audio segment having sufficient duration to identify a matching archived audio segment can be used. To provide a more specific example, in one embodiment the observed audio segment is between about 5 seconds and about 60 seconds in duration, in another embodiment the observed audio segment is between about 10 seconds and about 30 seconds in duration, and in yet another embodiment the observed audio segment is between about 15 seconds and about 25 seconds in duration. In one specific embodiment the observed audio segment is 20 seconds in duration. In a modified embodiment audio recorder 160 is configured to record, compress, and stream an audio signal to content server 200 until such time as a valid return signal is received from content server 200.
In certain embodiments multimedia player 170 comprises a software application capable of rendering multimedia content. To this end, multimedia player 170 can be implemented or used in conjunction with a variety of suitable hardware components that can be coupled to or that otherwise form part of client computing device 100. Examples of such hardware components include a speaker 172 and a display 174. Examples of existing multimedia players which can be adapted for use with certain of the disclosed embodiments include Windows Media Player (Microsoft Corp., Redmond, Wash.), QuickTime (Apple Inc., Cupertino, Calif.), and RealPlayer (RealNetworks, Inc., Seattle, Wash.). While multimedia players such as these are capable of playing audiovisual content, in certain embodiments multimedia player 170 can be configured to play only video content, such as video content 520 received from content server 200. In such embodiments speaker 172 may be considered optional. In certain embodiments operating system 140 is configured to automatically invoke multimedia player 170 upon receipt of video content 520. In embodiments where client computing device 100 is implemented in a client-server arrangement, such as illustrated in
Audio recorder 160 or multimedia player 170 can be configured to require a user to login before accessing the functionality described herein. Imposing such a requirement advantageously helps content providers collect additional information with respect to the audience receiving the audio and video content, thereby allowing content providers to target particular market segments with the streamed video content 520. This can be especially useful, for example, in the context of a radio advertiser that wishes to profile its audience and develop video content that is specifically intended for such audience.
Turning to
Archived content processing module 240 and observed content processing module 250 also each include hashing sub-module 246, 256. Hashing sub-modules 246, 256 are configured to generate unique hash data based on the archived or observed FFT data 244, 254, respectively. Additional details regarding calculation of the unique hash data will be provided in turn. The resulting archived unique hash (AUH) data can be stored in an AUH repository 248, while the resulting observed unique hash (OUH) data can be stored in an OUH repository 258. The archived input processed by archived content processing module 240 also includes video content 520, as distinguished from observed content processing module 250 which may only receive compressed audio signal 410. Consequently, archived content processing module 240 can further be configured to separate video content 520 from audio content 510 and to store the separated video content 520 in a video content repository 249, as illustrated in
Still referring to the example embodiment illustrated in
The embodiments disclosed herein can be implemented in various forms of hardware, software, firmware, or special purpose processors. For example, in one embodiment a non-transitory computer readable medium has instructions encoded therein that, when executed by one or more processors, cause one or more of the digital signal processing methodologies disclosed herein to be implemented. The instructions can be encoded using one or more suitable programming languages, such as C, C++, object-oriented C, JavaScript, Visual Basic .NET, BASIC, or alternatively, using custom or proprietary instruction sets. Such instructions can be provided in the form of one or more computer software applications or applets that are tangibly embodied on a memory device, and that can be executed by a computer having any suitable architecture. In one embodiment the system can be hosted on a given website and implemented using JavaScript or another suitable browser-based technology.
The functionalities disclosed herein can optionally be incorporated into a variety of different software applications, such as multimedia players, web browsers, and content editing applications. For example, a multimedia player installed on a smartphone can be configured to observe ambient audio and play corresponding video content based on the server-side audio matching techniques disclosed herein. The computer software applications disclosed herein may include a number of different modules, sub-modules, or other components of distinct functionality, and can provide information to, or receive information from, still other components and services. These modules can be used, for example, to communicate with peripheral hardware components, networked storage resources, or other external components. Other components and functionality not reflected in the illustrations will be apparent in light of this disclosure, and it will be appreciated that the present disclosure is not intended to be limited to any particular hardware or software configuration. Thus in other embodiments the components illustrated in
The aforementioned non-transitory computer readable medium may be any suitable medium for storing digital information, such as a hard drive, a server, a flash memory, or random access memory. In alternative embodiments, the computer and modules disclosed herein can be implemented with hardware, including gate level logic such as a field-programmable gate array (FPGA), or alternatively, a purpose-built semiconductor such as an application-specific integrated circuit (ASIC). Still other embodiments may be implemented with a microcontroller having a number of input/output ports for receiving and outputting data, and a number of embedded routines for carrying out the various functionalities disclosed herein. It will be apparent that any suitable combination of hardware, software, and firmware and be used, and that the present disclosure is not intended to be limited to any particular system architecture.
Methodology: Audio Hashing
Still referring to
As illustrated in
The first frequency band can be understood as ranging from 300 Hz to 3 kHz, the second frequency band can be understood as ranging from 3 kHz to 6 kHz, the third frequency band can be understood as ranging from 6 kHz to 9 kHz, and so forth, as illustrated in
Each of the frequency bands is, in turn, divided into nbs bin subsets per frequency band. See reference numeral 1120 in
Thus the first bin subset can be understood as ranging from 3.0 kHz to 3.6 kHz, the second bin subset can be understood as ranging from 3.6 kHz to 4.2 kHz, the third bin subset can be understood as ranging from 4.2 kHz to 4.8 kHz, and so forth, as illustrated in
The FFT techniques applied by FFT calculation sub-modules 242, 252 are based on a given sampling rate SR and window size WS. For example, in one embodiment FFT calculation sub-modules 242, 252 use a sampling rate of 44.1 kHz, although sampling rates ranging from 8 kHz to 5.64 MHz can be used in other embodiments, depending on the nature of the audio signal being analyzed. Likewise, in one embodiment FFT calculation sub-modules 242, 252 use a FFT window size having 4096 bins, although in window sizes ranging from 1024 bins to 16834 bins can be used in other embodiments, depending on the nature of the audio signal being analyzed and the processing capacity of content server 200. The ratio of the sampling rate to the window size defines the frequency resolution FR of the resulting FFT analysis. For instance, in the example embodiment illustrated in
Thus where the first bin subset ranges from 3.0 kHz to 3.6 kHz, this spectral range can be understood as corresponding to bins ranging from 3.0 kHz÷10.77 Hz bin−1=279th bin to 3.6 kHz÷10.77 Hz bin−1=334th bin. In other words, the first bin subset illustrated in
Each bin comprising the audible spectrum illustrated in
As illustrated in
A sequence of unique hash values {h0, h1, h2, . . . hd} is calculated over the duration td of the audio segment being analyzed for each of the nba frequency bands. See reference numeral 1220 in
Any of a variety of suitable hashing functions can be used to generate a hash value from the nbs maximum power bin indices. For example in one embodiment a unique hash value h can be defined by a powered sum of the bin indices associated with the maximum power for each of the nbs bin subsets, such as:
wherein the expression (logical) ? a:b evaluates to a if the logical expression is true, and evaluates to b if the logical expression is false. Equation (4) produces a unique hash value based on the set of bin indices {b1, b2, b3, b4, b5} associated with the maximum power for each of the five bin subsets at a given time. Bin indices bp, bp+1, and bp+2 are treated the same to introduce a degree of tolerance into the hashing process. This degree of tolerance can be increased, decreased, or wholly omitted in other embodiments. The hashing calculation provided by Equation (4) can be modified in alternative embodiments, and thus it will be appreciated that other calculations can be used in such embodiments. For example, in an alternative embodiment the hash value is calculated based on a subset of the nbs maximum power bin indices without any degree of tolerance. One example of such a hashing function is provided by:
h(b1,b2,b3,b4)=[b4−(b4%3)]108+[b3−(b3%3)]105+[b2−(b2%3)]102+[b1−(b1%3)]. (5)
Once generated, the nba unique hashes are stored in an appropriate hash repository. See reference numeral 1230 in
In certain embodiments archived content processing module 240 can be used to apply hashing methodology 1000 to a large quantity of multimedia content 500 before it is attempted to synchronize an observed audio signal with archived video content. In particular, processing a large quantity of multimedia content 500 increases the likelihood that an appropriate match will be found for a subsequently-observed audio segment. In such embodiments archiving multimedia content 500 comprises (a) receiving multimedia content 500 that comprises audio content 510 and video content 520 which are synchronized; (b) separating audio content 510 from video content 520; (c) generating AUH data based on audio content 510; and (d) storing video content 520 in video content repository 249. Video content 520 can be indexed by the same AudioID_q parameter used in AUH repository 248, such that once a particular AudioID_q parameter is identified as matching an observed audio segment, the corresponding video content can be retrieved. Compilation of AUH data enables such data to be used in a subsequent matching process, as will be described in turn. While certain embodiments involve compilation of a large quantity of AUH data before the matching and synchronization processes are attempted, it will be appreciated that in other embodiments multimedia content 500 can continue to be received and processed even after matching and synchronization commences.
Methodology: Audio Matching and Video Synchronization
In certain embodiments the example synchronization and matching method 2000 commences once observed content processing module 250 generates OUH data based on an observed audio signal 400. Because significant portions of method 2000 are applied individually to the nba frequency bands comprising the audible spectrum, the processing associated with method 2000 can be expedited through the use of parallel processing techniques. Therefore in certain embodiments hash matching module 270 is configured to create nba parallel processing threads for each of the nba frequency bands. See reference numeral 2110 in
Parallel processing over nba frequency bands increases the likelihood that frequencies will be hashed where a particular audio signal has strong frequency power. For example, a first archived audio segment may have strong frequency power in a first frequency band, while a second archived audio segment may have strong frequency power in a second frequency band. By hashing an observed audio segment in both frequency bands, this ensures that AUH data from a strong frequency power spectrum of both the first and second archived audio segments is compared with OUH data from the same frequency spectrum of the observed audio segment.
Hash matching module 270 is configured to receive an observed unique hash for the ith frequency band of an observed audio segment. See reference numeral 2120 in
Because the observed and archived audio segments are not necessarily the same duration, the observed and archived hashes may have different quantities of (time, hash value) data pairs. For instance,
Once the counting parameters j and k are set, the jth hash value of the observed unique hash (hj) is compared to the kth hash value of the archived unique hash that is associated with the A′th archived audio segment (hk). See reference numeral 2210 in
Regardless of whether or not hj=hk, the archived unique hash value counting parameter k is incremented by one. See reference numeral 2220 in
However, if the incremented archived unique hash counting parameter k is greater than the total quantity of archived unique hash values associated with the A′th archived audio segment |AUH(A′)|, this indicates that all of the archived unique hash values for audio segment A′ have been compared to the jth hash value of the observed unique hash. In this case, the observed unique hash value counting parameter j is incremented by one. See reference numeral 2240 in
If all of the archived unique hash values for audio segment A′ have been compared to all of the observed unique hash values, it is determined whether or not synchronization map 275a is empty. See reference numeral 2260 in
Referring again to reference numeral 2260 in
Once the counting parameters M′ and L′M′ are set, the time lag TL for the L′M′th time pair in the list keyed to the M′th keyed matching hash value is evaluated. See reference numeral 2450 in
TL1→c1,{(tj,tk),(tj,tk), . . . , (tj,tk)}
TL2→c2,{(tj,tk),(tj,tk), . . . , (tj,tk)}
TL3→c3,{(tj,tk),(tj,tk), . . . , (tj,tk)} (6)
It will therefore be appreciated that because multiple time pairs may evaluate to the same time lag TL, a given time lag TL may be keyed to a plurality of time pairs.
If the evaluated time lag TL does not already exist in time lag data map for A′th audio segment 275b, a time lag data map element that corresponds to TL and that has a counter c=1 and a one-element list {(tj, tk)} is created. See reference numeral 2512 in
Regardless of whether or not the evaluated time lag TL already exists in time lag data map for A′th audio segment 275b, the time pair counting parameter L′M′ is incremented by one. See reference numeral 2520 in
However, if the incremented time pair counting parameter L′M′ is greater than the total number of time pairs associated with the M′th keyed matching hash value LM′, this indicates that all of the time pairs associated with the M′th keyed matching hash value have been correlated with a time lag TL indexed in time lag data map for A′th audio segment 275b. In this case, the matching hash value counting parameter M′ is incremented by one. See reference numeral 2540 in
On the other hand, if the incremented matching hash value counting parameter M′ is greater than the total number of keyed matching hash values contained in synchronization map 275a, this indicates that all of the time pairs contained in synchronization map 275a have been correlated with a time lag TL indexed in time lag data map for A′th audio segment 275b. In this case time lag data map for A′th audio segment 275b is sorted by decreasing count c, such that the maximum count c−A′1 is listed first. See reference numeral 2610 in
In certain embodiments the sorted time lag data map for the A′th audio segment is added to a comprehensive time lag data map 275c. See reference numeral 2620 in
On the other hand, if the incremented audio segment counting parameter A′ is greater than the total quantity of archived audio segments A, this indicates that the ith band of all A archived audio segments has been compared to the ith band of the observed audio segment. The results of these comparisons are provided in comprehensive time lag data map 275c. Waveform manager 290 can be configured to determine whether comprehensive time lag data map 275c is empty. See reference numeral 2340 in
However, if comprehensive time lag data map 275c is not empty and contains (time lag, count) data pairs for each of the archived audio segments B having matching hash values, then waveform manager 290 is configured to end parallel processing of the nba bands. See reference numeral 2346 in
In certain embodiments the audio segment associated with the maximum count cmax present in a given comprehensive time lag data map 275c is identified. See reference numeral 2720 in
Where different bands identify different audio segments as being most common, it may not be possible to match the observed audio segment with an archived audio segment with a threshold confidence level. See reference numeral 2732. In this case, the analysis ends without identifying a matching archived audio segment, although a user may wish to repeat the analysis with a longer observed audio segment. Thus in some cases content server 200 is configured to request client computing device 100 to send additional observed audio data in response to a detected failure to identify a matching archived audio segment. On the other hand, where all of the bands identify the same archived audio segment as being most common, or in alternative embodiments where a majority or a threshold plurality of the bands identify an particular audio segment as being most common, the identified most common audio segment can be considered a positive match with the observed audio segment. See reference numeral 2734 in
Once an archived audio segment is identified as a positive match to the observed audio segment, it is determined whether the identified match is sufficiently precise to begin streaming video content 520 to client computing device 100 such that the streamed video content 520 is synchronized with observed audio signal 400. For example, even where a positive match is identified, ambiguity may exist with respect to the appropriate time differential between the observed and archived audio segments. To provide a specific example, this ambiguity may arise where a repeating refrain is present in the observed audio segment, in which case it may be unclear which repetition of the refrain was actually observed. Whether such ambiguity exists may be established by determining whether different time lag values are associated with the maximum observed count cmax. In particular, where the maximum observed count cmax is associated with multiple time lags, this suggests that the observed audio segment matches more than one portion of the archived audio segment. This may occur, for instance, in the example embodiment illustrated in
Thus, in certain embodiments it is determined whether multiple time lag values are associated with the maximum observed count cmax. See reference numeral 2740 in
In an alternative embodiment the retrieved video content 520 is not streamed to client computing device 100, but rather is sent as a bulk data transfer. In such case client computing device 100 can be configured to compute which portion of the received video content 520 to display at a given time. Such a configuration may be particularly advantageous in applications where a limited quantity of visual assets are to be displayed at certain points of an audio segment. For example, a 60-second radio advertisement may call for three still slides to be displayed at certain points in time. Once the audio associated with the radio advertisement is recognized, the three still slides can be downloaded to client computing device 100 and displayed at the appropriate time points. Such embodiments reduce bandwidth associated with ongoing data streaming between client computing device 100 and content server 200.
The various embodiments disclosed herein advantageously provide media playback that is responsive to the environment in which the media is to be consumed. This allows users to consume video content that corresponds to observed audio, wherein the video content is synchronized with the observed audio. The methodologies disclosed herein enable a user to enjoy audiovisual content where only audio content, such as received via a radio broadcast, might otherwise be available. Not only does this enhance user experience, but it also provides a valuable way for advertisers to convert an audio impression, such as a radio advertisement, into an audiovisual impression. It also allows video content to be streamed to content consumers on an “on-demand” basis, thereby addressing the difficulty of streaming content to different consumers who receive content at different times, as in the case of consumers located in different time zones. As described herein, in certain embodiments the audio/video synchronization functionality is provided by processing modules executing at content server 200, such that any applications executing on client computing device 100 do not require significant processing resources. Thus, from a user's perspective, the functionality described herein can be achieved using, for example, portable communicating devices such as smartphones and tablet computers.
For instance,
Numerous variations and configurations will be apparent in light of this disclosure. For instance, one example embodiment provides a method for synchronizing observed audio with archived video content. The method comprises receiving an observed audio segment from a client computing device. The method further comprises generating a plurality of hash values corresponding to the observed audio segment. The method further comprises performing a comparison of each of the plurality of hash values to a plurality of archived hash values. Each of the plurality of archived hash values is associated with one of a plurality of archived audio segments. The method further comprises identifying a selected archived audio segment and a time lag based on the comparison. A portion of the selected archived audio segment corresponds to the observed audio segment. The method further comprises identifying video content corresponding to the selected archived audio segment. The method further comprises streaming the video content to the client computing device. The video content is streamed from a time point based on the time lag. In some cases (a) a plurality of time lags are identified based on the comparison; and (b) the method further comprises selecting one of the plurality of time lags based on receipt of an additional audio segment from the client computing device. In some cases the method further comprises (a) generating a synchronization map that includes a matching hash value that is found in both the plurality of hash values corresponding to the observed audio segment and the plurality of archived hash values, wherein (i) the matching hash value is keyed to one or more (observed, archived) time pairs, (ii) the observed time corresponds to a time of the observed audio segment at which the matching hash value was found, and (iii) the archived time corresponds to a time of a potentially matching archived audio segment at which the matching hash value was found; and (b) generating a time lag data map that includes (i) a listing of one or more time lags derived from the synchronization map, wherein each of the one or more time lags is defined as a difference between the observed time and the archived time, and (ii) a frequency count corresponding to each of the one or more time lags. In some cases (a) each of the plurality of hash values corresponding to the observed audio segment is paired with a time of the observed audio segment at which the hash value was generated; and (b) each of the plurality of archived hash values is paired with a time of the associated archived audio segment at which the archived hash value was generated. In some cases the method further comprises generating a synchronization map that includes a matching hash value that is found in both the plurality of hash values corresponding to the observed audio segment and the plurality of archived hash values, wherein (a) the matching hash value is keyed to one or more (observed, archived) time pairs; (b) the observed time corresponds to a time of the observed audio segment at which the matching hash value was found; and (c) the archived time corresponds to a time of a potentially matching archived audio segment at which the matching hash value was found. In some cases the method further comprises (a) receiving the plurality of archived audio segments before receiving the observed audio segment form the client computing device; and (b) generating the plurality of archived hash values. In some cases the method further comprises (a) receiving a multimedia content item before receiving the observed audio segment from the client computing device, wherein the multimedia content item includes one of the plurality of archived audio segments and corresponding video content; (b) generating the plurality of archived hash values; and (c) storing the corresponding video content in a video content repository. In some cases generating the plurality of hash values corresponding to the observed audio segment further comprises (a) dividing a frequency spectrum of the observed audio segment into a plurality of frequency bands; (b) dividing each of the plurality of frequency bands into a plurality of bin subsets; (c) identifying a bin index corresponding to a maximum power in each of the plurality of bin subsets; and (d) generating a plurality of hash values over a duration of the observed audio segment based on the bin indices associated with each of the plurality of frequency bands, wherein each of the plurality of hash values are defined by a powered sum of the bin indices. In some cases generating the plurality of hash values corresponding to the observed audio segment further comprises (a) dividing a frequency spectrum of the observed audio segment into a plurality of frequency bands; (b) dividing each of the plurality of frequency bands into a plurality of bin subsets; (c) identifying a bin index corresponding to a maximum power in each of the plurality of bin subsets; and (d) generating a plurality of hash values over a duration of the observed audio segment based on the bin indices associated with each of the plurality of frequency bands. In some cases generating the plurality of hash values corresponding to the observed audio segment further comprises (a) dividing a frequency spectrum of the observed audio segment into a plurality of frequency bands; (b) dividing each of the plurality of frequency bands into a plurality of bin subsets; (c) identifying a bin index corresponding to a maximum power in each of the plurality of bin subsets; and (d) generating a plurality of hash values over a duration of the observed audio segment based on the bin indices associated with each of the plurality of frequency bands, wherein the frequency spectrum is divided into 5, 6, 7, 8, 9 or 10 frequency bands, and each of the frequency bands is divided into 3, 4, 5, 6, 7 or 8 bin subsets.
Another example embodiment provides a system for video synchronization that comprises an archived content processing module that is configured to receive multimedia content that includes audio content and video content. The archived content processing module further includes an archived content hashing sub-module configured to generate archived unique hash data based on the audio content. The system further comprises an observed content processing module that is configured to receive an observed audio segment from a client computing device. The observed content processing module includes an observed content hashing sub-module configured to generate observed unique hash data based on the observed audio segment. The system further comprises a memory configured to store a comprehensive time lag data map that correlates a plurality of archived audio segments with a list of (time lag, count) data pairs. The time lag is based on a comparison of the archived unique hash data and the observed unique hash data. The count is based on a frequency of the paired time lag. The system further comprises a waveform manager that is configured to (a) identify a matching archived audio segment that corresponds to the observed audio segment based on a maximum count identified from the comprehensive time lag data map, and (b) transmit video content that was received with the matching archived audio segment to the client computing device. In some cases the observed content processing module is configured to send the client computing device an instruction to terminate transmission of the observed audio segment in response to receipt of a predetermined duration of the observed audio segment. In some cases the observed content processing module is configured to receive a second observed audio segment from the client computing device in response to the waveform manager detecting that the maximum count identified from the comprehensive time lag data map is associated with a plurality of time lags. In some cases the video content that was received with the matching archived audio segment is streamed to the client computing device. In some cases the system further comprises (a) a video content repository configured to store the video content included in the received multimedia content; and (b) a content manager configured to retrieve the video content from the video content repository and to provide the retrieved video content to the waveform manager. In some cases the system further comprises a client computing device configured to record the observed audio segment and send the observed audio segment to the observed content processing module.
Another example embodiment provides a computer program product encoded with instructions that, when executed by one or more processors, causes a process for synchronizing observed audio with archived video content to be carried out. The process comprises receiving an observed audio segment from a client computing device. The process further comprises identifying an archived audio segment that includes at least a portion of the observed audio segment. The process further comprises determining a time lag corresponding to a relative time offset between the observed audio segment and the archived audio segment. The process further comprises transmitting video content to the client computing device. The video content is associated with the archived audio segment. The video content is transmitted from a time point that is at least partially based on the time lag. In some cases the observed audio segment is streamed from the client computing device for a predetermined recording period. In some cases the process further comprises receiving the archived audio segment before receiving the observed audio segment, wherein the archived audio segment is not received from the client computing device. In some cases identifying the archived audio segment further comprises (a) generating a plurality of hash values corresponding to the observed audio segment; and (b) performing a comparison of each of the plurality of hash values to a plurality of archived hash values, wherein each of the plurality of archived hash values is associated with one of a plurality of archived audio segments.
The foregoing detailed description has been presented for illustration. It is not intended to be exhaustive or to limit the disclosure to the precise form described. Many modifications and variations are possible in light of this disclosure. Therefore it is intended that the scope of the disclosure be limited not by this detailed description, but rather by the claims appended hereto. Subsequently filed applications claiming priority to this application may claim the disclosed subject matter in a different manner, and may generally include any set of one or more features as variously disclosed or otherwise demonstrated herein.
Biswas, Sanjeev Kumar, Chatterjee, Arijit, Luthra, Gaurav, Munshi, Kausar
Patent | Priority | Assignee | Title |
10922720, | Jan 11 2017 | Adobe Inc | Managing content delivery via audio cues |
11410196, | Jan 11 2017 | Adobe Inc. | Managing content delivery via audio cues |
Patent | Priority | Assignee | Title |
7503488, | Oct 17 2003 | Idemia Identity & Security USA LLC | Fraud prevention in issuance of identification credentials |
20050190928, | |||
20110076942, | |||
20110191823, | |||
20120194737, | |||
20130272672, | |||
20140028914, | |||
20140106710, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 20 2014 | BISWAS, SANJEEV KUMAR | Adobe Systems Incorporated | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 033159 | /0876 | |
Jun 20 2014 | CHATTERJEE, ARIJIT | Adobe Systems Incorporated | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 033159 | /0876 | |
Jun 20 2014 | LUTHRA, GAURAV | Adobe Systems Incorporated | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 033159 | /0876 | |
Jun 20 2014 | MUNSHI, KAUSAR | Adobe Systems Incorporated | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 033159 | /0876 | |
Jun 23 2014 | Adobe Systems Incorporated | (assignment on the face of the patent) | / | |||
Oct 08 2018 | Adobe Systems Incorporated | Adobe Inc | CHANGE OF NAME SEE DOCUMENT FOR DETAILS | 048867 | /0882 |
Date | Maintenance Fee Events |
Jan 13 2020 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jan 12 2024 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Jul 12 2019 | 4 years fee payment window open |
Jan 12 2020 | 6 months grace period start (w surcharge) |
Jul 12 2020 | patent expiry (for year 4) |
Jul 12 2022 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jul 12 2023 | 8 years fee payment window open |
Jan 12 2024 | 6 months grace period start (w surcharge) |
Jul 12 2024 | patent expiry (for year 8) |
Jul 12 2026 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jul 12 2027 | 12 years fee payment window open |
Jan 12 2028 | 6 months grace period start (w surcharge) |
Jul 12 2028 | patent expiry (for year 12) |
Jul 12 2030 | 2 years to revive unintentionally abandoned end. (for year 12) |