Speech end-pointer

Speech end-pointer
US8165880

An end-pointer determines a beginning and an end of a speech segment. The end-pointer includes a voice triggering module that identifies a portion of an audio stream that has an audio speech segment. A rule module communicates with the voice triggering module. The rule module includes a plurality of rules used to analyze a part of the audio stream to detect a beginning and an end of the audio speech segment. A consonant detector detects occurrences of a high frequency consonant in the portion of the audio stream.

PTO Wrapper PDF
Dossier Espace Google

Patent 8165880
Priority Jun 15 2005
Filed May 18 2007
Issued Apr 24 2012
Expiry Dec 09 2026 Extension 542 days
Inventors Hetheringt…
Assg.orig QNX SOFTWA…
Assg.curr BlackBerry…
Entity Large
Referenced by 18
References 127
Maint.: all paid

PRIORITY CLAIM
BACKGROUND OF THE IN…
SUMMARY
BRIEF DESCRIPTION OF…
DETAILED DESCRIPTION…

34. A system that determines a beginning and an end of an audio speech segment in an audio stream, comprising:

an /s/ detector that converts a difference between a signal-to-noise ratio in a high frequency band of the audio stream and a signal-to-noise ratio in a low frequency band of the audio stream into a probability value that predicts a likelihood of an /s/ sound in the audio stream; and

an end-pointer comprising a processor that varies an amount of an audio input sent to a recognition device based on a plurality of rules and an output of the /s/ detector;

where the end-pointer identifies a beginning of the audio input or an end of the audio input based on the output of the /s/ detector, and where the beginning of the audio input and the end of the audio input represent boundaries between speech and non-speech portions of the audio stream.

27. A system that identifies a beginning and an end of a speech segment comprising:

an end-pointer comprising a processor that analyzes a dynamic aspect of an audio stream to determine the beginning and the end of the speech segment; and

a high frequency consonant detector that marks the end of the speech segment, where the high frequency consonant detector calculates a difference between a signal-to-noise ratio in a high frequency band of the audio stream and a signal-to-noise ratio in a low frequency band of the audio stream, and where the high frequency consonant detector converts the difference between the signal-to-noise ratio in the high frequency band and the signal-to-noise ratio in the low frequency band into a probability value that predicts a likelihood that a high frequency consonant exists in a frame of the audio stream;

where the beginning of the speech segment and the end of the speech segment represent boundaries between speech and non-speech portions of the audio stream, and where the end-pointer identifies the beginning of the audio speech segment or the end of the audio speech segment based on an output of the high frequency consonant detector.

1. An end-pointer that determines a beginning and an end of a speech segment comprising:

a voice triggering module that identifies a portion of an audio stream comprising an audio speech segment;

a rule module in communication with the voice triggering module, the rule module comprising a plurality of rules used by a processor to analyze a part of the audio stream to detect a beginning and an end of the audio speech segment; and

a consonant detector that calculates a difference between a signal-to-noise ratio in a high frequency band and a signal-to-noise ratio in a low frequency band, where the consonant detector converts the difference between the signal-to-noise ratio in the high frequency band and the signal-to-noise ratio in the low frequency band into a probability value that predicts a likelihood of a high frequency consonant in the portion of the audio stream;

where the beginning of the audio speech segment and the end of the audio speech segment represent boundaries between speech and non-speech portions of the audio stream, and where the rule module identifies the beginning of the audio speech segment or the end of the audio speech segment based on an output of the consonant detector.

16. A method that identifies a beginning and an end of a speech segment using an end-pointer comprising:

receiving a portion of an audio stream;

determining whether the portion of the audio stream includes a triggering characteristic;

calculating a difference between a signal-to-noise ratio in a high frequency band of the portion of the audio stream and a signal-to-noise ratio in a low frequency band of the portion of the audio stream;

converting, by a consonant detector implemented in hardware or embodied in a computer-readable storage medium, the difference between the signal-to-noise ratio in the high frequency band and the signal-to-noise ratio in the low frequency band into a probability value that predicts a likelihood of a high frequency consonant in the portion of the audio stream; and

applying a rule that passes only a portion of the audio stream to a device when the triggering characteristic identifies a beginning of a voiced segment and an end of a voiced segment;

where the identification of the end of the voiced segment is based on an output of the consonant detector, where the end of the voiced segment represents a boundary between speech and non-speech portions of the audio stream.

36. A non-transitory computer readable medium that stores software that determines at least one of a beginning and end of an audio speech segment comprising:

a detector that converts sound waves into operational signals;

a triggering logic that analyzes a periodicity of the operational signals;

a signal analysis logic that analyzes a variable portion of the sound waves that are associated with the audio speech segment to determine a beginning and end of the audio speech segment, and

a consonant detector that calculates a difference between a signal-to-noise ratio in a high frequency band and a signal-to-noise ratio in a low frequency band, where the consonant detector converts the difference between the signal-to-noise ratio in the high frequency band and the signal-to-noise ratio in the low frequency band into a probability value that predicts a likelihood of an /s/ sound in the sound waves, where the consonant detector provides an input to the signal analysis logic when the /s/ is detected;

where the beginning of the audio speech segment and the end of the audio speech segment represent boundaries between speech and non-speech portions of the sound waves, and where the signal analysis module identifies the beginning of the audio speech segment or the end of the audio speech segment based on an output of the consonant detector.

2. The end-pointer of claim 1, where the voice triggering module identifies a vowel.

3. The end-pointer of claim 1, where the consonant detector comprises an /s/ detector.

4. The end-pointer of claim 1, where the portion of the audio stream comprises a frame.

5. The end-pointer of claim 1, where the rule module analyzes an energy level in the portion of the audio stream.

6. The end-pointer of claim 1, where the rule module analyzes an elapsed time in the portion of the audio stream.

7. The end-pointer of claim 1, where the rule module analyzes a predetermined number of plosives in the portion of the audio stream.

8. The end-pointer of claim 1, where the rule module identifies the beginning of the audio speech segment or the end of the audio speech segment based on the probability value that predicts the likelihood of the high frequency consonant in the portion of the audio stream.

9. The end-pointer of claim 1, further comprising an energy detector.

10. The end-pointer of claim 1, further comprising a controller in communication with a memory, where the rule module resides within the memory.

11. The end-pointer of claim 1, where the probability value indicates a likelihood that a consonant exists in a frame of the audio stream, where the consonant detector compares the probability value to consonant likelihood values associated with previous frames and identifies a consonant based on detection of an increasing trend.

12. The end-pointer of claim 1, where the probability value indicates a likelihood that a consonant exists in a frame of the audio stream, where the consonant detector compares the probability value to consonant likelihood values associated with previous frames and sets an endpoint based on detection of a decreasing trend.

13. The end-pointer of claim 1, where the probability value comprises a current probability value associated with a current frame of the audio stream, where the consonant detector modifies the current probability value when the current probability value deviates from consonant probability values associated with previous frames.

14. The end-pointer of claim 13, where the consonant detector adds to the current probability value a temporally smoothed difference between the current probability value and a probability value associated with a previous frame, upon determination that the current probability value exceeds the probability value associated with the previous frame, where the consonant detector generates the temporally smoothed difference by multiplying a smoothing factor with the difference between the current probability value and the probability value associated with the previous frame;

where the consonant detector adds to the current probability value a portion of the difference between the current probability value and the probability value associated with the previous frame, upon determination that the current probability value is less than the probability value associated with the previous frame, where the consonant detector generates the portion of the difference by multiplying the difference by a percentage; and

where the smoothing factor is different than the percentage.

15. The end-pointer of claim 1, where the consonant detector comprises a non-transitory computer-readable medium or circuit.

17. The method of claim 16, where the rule identifies the portion of the audio stream to be sent to the device.

18. The method of claim 16, where the rule is applied to a portion of the audio stream that does not include the triggering characteristic.

19. The method of claim 16, where the triggering characteristic comprises a vowel.

20. The method of claim 16, where the triggering characteristic comprises an /s/ or an /x/.

21. The method of claim 16, further comprising raising a voice threshold in response to a detection of a high frequency consonant.

22. The method of claim 21, where the voice threshold is raised across a plurality of audio frames.

23. The method of claim 16, where the rule module analyzes an energy in the portion of the audio stream.

24. The method of claim 16, where the rule module analyzes an elapsed time in the portion of the audio stream.

25. The method of claim 16, where the rule module analyzes a predetermined number of plosives in the portion of the audio stream.

26. The method of claim 16, further comprising marking the beginning and the end of a potential speech segment.

28. The system of claim 27, where the dynamic aspect of the audio stream comprises a characteristic of a speaker.

29. The end system of claim 28, where the characteristic of the speaker comprises a rate of speech.

30. The system of claim 27, where the dynamic aspect of the audio stream comprises a level of background noise in the audio stream.

31. The system of claim 27, where the dynamic aspect of the audio stream comprises an expected sound in the audio stream.

32. The system of claim 31, where the expected sound comprises an expected answer to a question.

33. The system of claim 27, where the high frequency consonant detector comprises a non-transitory computer-readable medium or circuit.

35. The end system of claim 34, where the recognition device comprises an automatic speech recognition device, and where the end-pointer adapts an endpoint of the audio input based on the output of the /s/ detector.

37. The non-transitory computer readable medium of claim 36, where the signal analysis logic analyzes a time duration before a voiced speech sound.

38. The non-transitory computer readable medium of claim 36, where the signal analysis logic analyzes a time duration after a voiced speech sound.

39. The non-transitory computer readable medium of claim 36, where the signal analysis logic analyzes a number of transitions before or after a voiced speech sound.

40. The non-transitory computer readable medium of claim 36, where the signal analysis logic analyzes a duration of continuous silence before a voiced speech sound.

41. The non-transitory computer readable medium of claim 36, where the signal analysis logic analyzes a duration of continuous silence after a voiced speech sound.

42. The non-transitory computer readable medium of claim 36, where the signal analysis logic is coupled to a vehicle.

43. The non-transitory computer readable medium of claim 36, where the signal analysis logic is coupled to an audio system.

PRIORITY CLAIM

This application is a continuation-in-part of U.S. application Ser. No. 11/152,922 filed Jun. 15, 2005. The entire content of the application is incorporated herein by reference, except that in the event of any inconsistent disclosure from the present application, the disclosure herein shall be deemed to prevail.

BACKGROUND OF THE INVENTION

1. Technical Field

These inventions relate to automatic speech recognition, and more particularly, to systems that identify speech from non-speech.

2. Related Art

Automatic speech recognition (ASR) systems convert recorded voice into commands that may be used to carry out tasks. Command recognition may be challenging in high-noise environments such as in automobiles. One technique attempts to improve ASR performance by submitting only relevant data to an ASR system. Unfortunately, some techniques fail in non-stationary noise environments, where transient noises like clicks, bumps, pops, coughs, etc trigger recognition errors. Therefore, a need exists for a system that identifies speech in noisy conditions.

SUMMARY

An end-pointer determines a beginning and an end of a speech segment. The end-pointer includes a voice triggering module that identifies a portion of an audio stream that has an audio speech segment. A rule module communicates with the voice triggering module. The rule module includes a plurality of rules used to analyze a part of the audio stream to detect a beginning and end of an audio speech segment. A consonant detector detects occurrences of a high frequency consonant in the portion of the audio stream.

Other systems, methods, features and advantages of the invention will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the following claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The inventions can be better understood with reference to the following drawings and description. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.

FIG. 1 is a block diagram of a speech end-pointing system.

FIG. 2 is a partial illustration of a speech end-pointing system incorporated into a vehicle.

FIG. 3 is a speech end-pointer-process.

FIG. 4 is a more detailed flowchart of a portion of FIG. 3.

FIG. 5 is an end-pointing of simulated speech.

FIG. 6 is an end-pointing of simulated speech.

FIG. 7 is an end-pointing of simulated speech.

FIG. 8 is an end-pointing of simulated speech.

FIG. 9 is an end-pointing of simulated speech.

FIG. 10 is a portion of a dynamic speech end-pointing process.

FIG. 11 is a partial block diagram of a consonant detector.

FIG. 12 is a partial block diagram of a consonant detector.

FIG. 13 is a process that adjusts voice thresholds.

FIG. 14 are spectrograms of a voiced segment.

FIG. 15 is a spectrogram of a voiced segment.

FIG. 16 is a spectrogram of a voiced segment.

FIG. 17 are spectrograms of a voiced segment positioned above an output of a consonant detector.

FIG. 18 are spectrograms of a voiced segment positioned above an end-point interval.

FIG. 19 are spectrograms of a voiced segment positioned above an end-point interval enclosing an output of the consonant detector.

FIG. 20 are spectrograms of a voiced segment positioned above an end-point interval.

FIG. 21 are spectrograms of a voiced segment positioned above an end-point interval enclosing an output of the consonant detector.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

ASR systems are tasked with recognizing spoken commands. These tasks may be facilitated by sending voice segments to an ASR engine. A voice segment may be identified through end-pointing logic. Some end-pointing logic applies rules that identify the duration of consonants and pauses before and/or after a vowel. The rules may monitor a maximum duration of non-voiced energy, a maximum duration of continuous silence before a vowel, a maximum duration of continuous silence after a vowel, a maximum time before a vowel, a maximum time after a vowel, a maximum number of isolated non-voiced energy events before a vowel, and/or a maximum number of isolated non-voiced energy events after a vowel. When a vowel is detected, the end-pointing logic may follow a signal-to-noise (SNR) contour forward and backward in time. The limits of the end-pointing logic may occur when the amplitude reaches a predetermined level which may be zero or near zero. While searching, the logic identifies voiced and unvoiced intervals to be processed by an ASR engine.

Some end-pointers examine one or more characteristics of an audio stream for a triggering characteristic. A triggering characteristic may identify a speech interval that includes voiced or unvoiced segments. Voiced segments may have a near periodic structure in the time-domain like vowels. Non-voiced segments may have a noise-like structure (nonperiodic) in the time domain like a fricative. The end-pointers analyze one or more dynamic aspects of an audio stream. The dynamic aspects may include: (1) characteristics that reflect a speaker's pace (e.g., rate of speech), pitch, etc.; (2) a speaker's expected response (such as a “yes” or “no” response); and/or (3) environmental characteristics, such as a background noise level, echo, etc.

FIG. 1 is a block diagram of a speech end-pointing system. The end-pointing system 100 encompasses hardware and/or software running on one or more processors on top of one or more operating systems. The end-pointing system 100 includes a controller 102 and a processor 104 linked to a remote (not shown) and/or local memory 106. The processor 104 accesses the memory 106 through a unidirectional or a bidirectional bus. The memory 106 may be partitioned to store a portion of an input audio stream, a rule module 108, and support files that detect the beginning and/or end of an audio segment, and a voicing analysis module 116. When read by the processor 104, the voicing analysis module 116 may detect a triggering characteristic that identifies a speech interval. When integrated within or when a unitary part of controller serving an ASR engine, the speech interval may be processed when the ASR code 118 is read by the processor 104.

The local or remote memory 106 may buffer audio data received before or during an end-pointing process. The processor 104 may communicate through an input/output (I/O) interface 110 that receives input from devices that convert sound waves into electrical, optical, or operational signals 114. The I/O 110 may transmit these signals to devices 112 that convert signals into sound. The controller 104 and/or processor 104 may execute the software or code that implements each of the processes described herein including those described in FIGS. 3, 4, 10, and 13.

FIG. 2 illustrates an end-pointer system 100 within a vehicle 200. The controller 102 may be programmed within or linked to a vehicle on-board computer, such as an electronic control unit, an electronic control module, and/or a body control module. Some systems may be located remote from the vehicle. Each system may communicate with vehicle logic through one or more serial or parallel buses or wireless protocols. The protocols may include one or more J1850VPW, J1850PWM, ISO, ISO9141-2, ISO14230, CAN, High Speed CAN, MOST, LIN, IDB-1394, IDB-C, D2B, Bluetooth, TTCAN, TTP, or other protocols such as a protocol marketed under the trademark FlexRay.

FIG. 3 is a flowchart of a speech end-pointer process. The process operates by dividing an input audio stream into discrete segments or packages of information, such as frames. The input audio stream may be analyzed on a frame-by-frame basis. In some systems, the fixed or variable length frames may be comprised of about 10 ms to about 100 ms of audio input. The system may buffer a predetermined amount of data, such as about 350 ms to about 500 ms audio input data, before processing is carried out. An energy detector 302 (or process) may be used to detect voiced and unvoiced sound. Some energy detectors and processes compare the amount of energy in a frame to a noise estimate. The noise estimate may be constant or may vary dynamically. The difference in decibels (dB), or ratio in power, may be an instantaneous signal to noise ratio (SNR).

Initially, the process designates some or all of the initial frames as not speech 304. When energy is detected, voicing analysis of the current frame or, designated frame_noccurs at 306. The voicing analysis described in U.S. Ser. No. 11/131,150, filed May 17, 2005, which is incorporated herein by reference, may be used. The voicing analysis monitors triggering characteristics that may be present in frame_n. The voicing analysis may detect higher frequency consonants such as an “s” or “x” in a frame_n. Alternatively, the voicing analysis may detect vowels. To further explain the process, a vowel triggering characteristic is further described.

Voicing analysis detects vowels in frames in FIG. 3. A process may identify vowels through a pitch estimator. The pitch estimator may look for a periodic signal in a frame to identify a vowel. Alternatively, the pitch estimator may look for a predetermined threshold at a predetermined frequency to identify vowels.

When the voicing analysis detects a vowel in frame_n, the frame_nis marked as speech at 310. The system then processes one or more previous frames. A previous frame may be an immediate preceding frame, frame_n−1at 312. The system may determine whether the previous frame was previously marked as speech at 314. If the previous frame was marked as speech (e.g., answer of “Yes” to block 314), the system analyzes a new audio frame at 304. If the previous frame was not marked as speech (e.g., answer of “No” to 314), the process applies one or more rules to determine whether the frame should be marked as speech.

Block 316 designates decision block “Outside EndPoint” that applies one or more rules to determine when the frame should be marked as speech. The rules may be applied to any part of the audio segment, such as a frame or a group of frames. The rules may determine whether the current frame or frames contain speech. If speech is detected, the frame is designated within an end-point. If not, the frame is designated outside of the endpoint.

If a frame_n−1is outside of the end-point (e.g., no speech is present), a new audio frame, frame_n+1, may be processed. It may be initially designated as non-speech, at block 304. If the decision at 316 indicates that frame_n−1is within the end-point (e.g., speech is present), then frame_n−1is designated or marked as speech at 318. The previous audio stream is then analyzed, until the last frame is read from a local or remote memory at 320.

FIG. 4 is an exemplary detailed process of 316. Act 316 may apply one or more rules. The rules relate to aspects that may identify the presence and/or absence of speech. In FIG. 4, the rules detect verbal segments by identifying a beginning and/or an endpoint of a spoken utterance. Some rules are based on analyzing an event (e.g. voiced energy, un-voiced energy, an absence/presence of silence, etc.). Other rules are based on a combination of events (e.g. un-voiced energy followed by silence followed by voiced energy, voiced energy followed by silence followed by unvoiced energy, silence followed by un-voiced energy followed by silence, etc.).

The rules may examine transitions into energy events from periods of silence or from periods of silence into energy events. A rule may analyze the number of transitions before a vowel is detected; another rule may determine that speech may include no more than one transition between an unvoiced event or silence and a vowel. Some rules may analyze the number of transitions after a vowel is detected with a rule that speech may include no more than two transitions from an unvoiced event or silence after a vowel is detected.

One or more rules may be based on the occurrence of one or multiple events (e.g. voiced energy, un-voiced energy, an absence/presence of silence, etc.). A rule may analyze the time preceding an event. Some rules may be triggered by the lapse of time before a vowel is detected. A rule may expect a vowel to occur within a variable range such as about a 300 ms to 400 ms interval or a rule may expect a vowel to be detected within a predetermined time period (e.g., about 350 ms in some processes). Some rules determine a portion of speech intervals based on the time following an event. When a vowel is detected a rule may extend a speech interval by a fixed or variable length. In some processes the time period may comprise a range (e.g., about 400 ms to 800 ms in some processes) or a predetermined time limit (e.g., about 600 ms in some processes).

Some rules may examine the duration of an event. The rules may examine the duration of a detected energy (e.g., voiced or unvoiced) or the lack of energy. A rule may analyze the duration of continuous unvoiced energy. A rule may establish that continuous unvoiced energy may occur within a variable range (e.g., about 150 ms to about 300 ms in some processes), or may occur within a predetermined limit (e.g., about 200 ms in some processes). A rule may analyze the duration of continuous silence before a vowel is detected. A rule may establish that speech may include a period of continuous silence before a vowel is detected within a variable range (e.g., about 50 ms to about 80 ms in some processes) or at a predetermined limit (e.g., about 70 ms in some processes). A rule may analyze the time duration of continuous silence after a vowel is detected. Such a rule may establish that speech may include a duration of continuous silence after a vowel is detected within a variable range (e.g., about 200 ms to about 300 ms in some processes) or a rule may establish that silence occurs across a predetermined time limit (e.g., about 250 ms in some processes).

At 402, the process determines if a frame or group of frames has an energy level above a background noise level. A frame or group of frames having more energy than a background noise level may be analyzed based on its duration or its relationship to an event. If the frame or group of frames does not have more energy than a background noise level, then the frame or group of frames may be analyzed based on its duration or relationship to one or more events. In some systems the events may comprise a transition into energy events from periods of silence or a transition from periods of silence into energy events.

When energy is present in the frame or a group of frames, an “energy” counter is incremented at block 404. The “energy” counter tracks time intervals. It may be incremented by a frame length. If the frame size is about 32 ms, then block 404 may increment the “energy” counter by about 32 ms. At 406, the “energy” counter is compared to a threshold. The threshold may correspond to the continuous unvoiced energy rule which may be used to determine the presence and/or absence of speech. If decision 406 determines that the threshold was exceeded, then the frame or group of frames are designated outside the end-point (e.g. no speech is present) at 408 at which point the system jumps back to 304 of FIG. 3. In some alternative processes multiple thresholds may be evaluated at 406.

If the time threshold is not exceeded by the “energy” counter at 406, then the process determines if the “noenergy” counter exceeds an isolation threshold at 410. The “noenergy” counter 418 may track time and is incremented by the frame length when a frame or group of frames does not possess energy above a noise level. The isolation threshold may comprise a threshold of time between two plosive events. A plosive relates to a speech sound produced by a closure of the oral cavity and subsequent release accompanied by a burst of air. Plosives may include the sounds /p/ in pit or /d/ in dog. An isolation threshold may vary within a range (e.g., such as about 10 ms to about 50 ms) or may be a predetermined value such as about 25 ms. If the isolation threshold is exceeded, an isolated unvoiced energy event (e.g., a plosive followed by silence) was identified, and “isolatedevents” counter 412 is incremented. The “isolatedevents” counter 412 is incremented in integer values. After incrementing the “isolatedevents” counter 412, “noenergy” counter 418 is reset at block 414. The “isolatedevents” counter may be reset due to the energy found within the frame or group of frames analyzed. If the “noenergy” counter 418 does not exceed the isolation threshold, the “noenergy” counter 418 is reset at block 414 without incrementing the “isolatedevents” counter 412. The “noenergy” counter 418 is reset because energy was found within the frame or group of frames analyzed. When the “noenergy” counter 418 is reset, the outside end-point analysis designates the frame or group of frames analyzed within the end-point (e.g. speech is present) by returning a “NO” value at 416. As a result, the system marks the analyzed frame(s) as speech at 318 or 322 of FIG. 3.

Alternatively, if the process determines that there is no energy above the noise level at 402 then the frame or group of frames analyzed contain silence or background noise. In this condition, the “noenergy” counter 418 is incremented. At 420, the process determines if the value of the “noenergy” counter exceeds a predetermined time threshold. The predetermined time threshold may correspond to the continuous non-voiced energy rule threshold which may be used to determine the presence and/or absence of speech. At 420, the process evaluates the duration of continuous silence. If the process determines that the threshold is exceeded by the value of the “noenergy” counter at 420, then the frame or group of frames are designated outside the end-point (e.g. no speech is present) at block 408. The process then proceeds to 304 of FIG. 3 where a new frame, frame_n+1, is received and marked as non-speech. Alternatively, multiple thresholds may be evaluated at 420.

If no time threshold is exceeded by the value of the “noenergy” counter 418, then the process determines if the maximum number of allowed isolated events has occurred at 422. The maximum number of allowed isolated events is a configurable or programmed parameter. If grammar is expected (e.g. a “Yes” or a “No” answer) the maximum number of allowed isolated events may be programmed to “tighten” the end-pointer's interval or band. If the maximum number of allowed isolated events is exceeded, then the frame or frames analyzed are designated as being outside the end-point (e.g. no speech is present) at block 408. The system then jumps back to block 304 where a new frame, frame_n+1, is processed and marked as non-speech.

If the maximum number of allowed isolated events is not reached, “energy” counter 404 is reset at block 424. “Energy” counter 404 may be reset when a frame of no energy is identified. When the “energy” counter 404 is reset, the outside end-point analysis designates the frame or frames analyzed inside the end-point (e.g. speech is present) by returning a “NO” value at block 416. The process then marks the analyzed frame as speech at 318 or 322 of FIG. 3.

FIGS. 5-9 show time series of a simulated audio stream, characterization plots of these signals, and spectrographs of the corresponding time series signals. The simulated audio stream 502 of FIG. 5 comprises the spoken utterances “NO” 504, “YES” 506, “NO” 504, “YES” 506, “NO” 504, “YESSSSS” 508, “NO” 504, and a number of “clicking” sounds 510. The clicking sounds may represent the sound heard when a vehicle's turn signal is engaged. Block 512 illustrates various characterization plots for the time series audio stream. Block 512 displays the number of samples along the x-axis. Plot 514 is a representation of an end-pointer marking a speech interval. When plot 514 has little or no amplitude, the end-pointer has not detected a speech segment. When plot 514 has measurable amplitude the end-pointer detected speech that may be within the bounded interval. Plot 516 represents the energy detected above a background energy level. Plot 518 represents a spoken utterance in the time domain. Block 520 illustrates a spectral representation of the audio stream in block 502.

Block 512 illustrates how the end-pointer may respond to an input audio stream. In FIG. 5, end-pointer plot 514 captures the “NO” 504 and the “YES” 506 signals. When the “YESSSSS” 508 is processed, the end-pointer plot 514 captures a portion of the trailing “S”, but when it reaches a maximum time period after a vowel or a maximum duration of continuous non-voiced energy has been exceeded (by rule) the end-pointer truncates a portion of the signal. The rule-based end-pointer sends the portion of the audio stream that is bound by end-pointer plot 514 to an ASR engine. In block 512, and FIGS. 6-9, the portion of the audio stream sent to an ASR engine may vary with the selected rule.

In FIG. 5, the detected “clicks” 510 have energy. Because no vowel was detected within that interval, the end-pointer does not capture the energy. A pause is declared which is not sent to the ASR engine.

FIG. 6 magnifies a portion of an end-pointed “NO” 504. The lag in the spoken utterance plot 518 may be caused by time smearing. The magnitude of 518 reflects period in which energy is detected. The energy of the spoken utterance 518 is nearly constant. The passband of the end-pointer 514 begins when speech energy is detected and cuts off by rule. A rule may determine the maximum duration of continuous silence after a vowel or the maximum time following the detection of a vowel. In FIG. 6, the audio segment sent to an ASR engine comprises approximately 3150 samples.

FIG. 7 magnifies a portion of an end-pointed “YES” 506. The lag in the spoken utterance plot 518 may be caused by time smearing. The passband of the end-pointer 514 begins when speech energy is detected and continues until the energy falls off from the random noise. The upper limit of the passband may be set by a rule that establishes the maximum duration of continuous non-voiced energy or by a rule that establishes the maximum time after a vowel is detected. In FIG. 7, the portion of the audio stream that is sent to an ASR engine comprises approximately 5550 samples.

FIG. 8 magnifies a portion of one end-pointed “YESSSSS” 508. The end-pointer accepts the post-vowel energy as a possible consonant for a predetermined period of time. When the period lapses, a maximum duration of continuous non-voiced energy rule or a maximum time after a vowel rule may be applied limiting the data passed to an ASR engine. In FIG. 8, the portion of the audio stream that is sent to an ASR engine comprises approximately 5750 samples. Although the spoken utterance continues for an additional 6500 samples, in one system, the end-pointer truncates the sound segment by rule.

FIG. 9 magnifies an end-pointed “NO” 504 and several “clicks” 510. In FIG. 9, the lag in the spoken utterance plot 518 may be caused by time smearing. The passband of the end-pointer 514 begins when speech energy is detected. A click may be included within end-pointer 514 because the system detected energy above the background noise threshold.

Some end-pointers determine the beginning and/or end of a speech segment by analyzing a dynamic aspect of an audio stream. FIG. 10 is a partial process that analyzes the dynamic aspect of an audio segment. An initialization of global aspects occurs at 1002. Global aspects may include selected characteristics of an audio stream such as characteristics that reflect a speaker's pace (e.g., rate of speech), pitch, etc. The initialization at 1004 may be based on a speaker's expected response (such as a “yes” or “no” response); and/or environmental characteristics, such as a background noise level, echo, etc.

The global and local initializations may occur at various times throughout system operation. The background noise estimations (local aspect initialization) may occur during nonspeech intervals or when certain events occur such as when the system is powered up. The pace of a speaker's speech or pitch (global initialization) and monitoring of certain responses (local aspect initialization) may be initialized less frequently. Initialization may occur when an ASR engine communicates to an end-pointer or at other times.

During initialization periods 1002 and 1004, the end-pointer may operate at programmable default thresholds. If a threshold or timer needs to be change, the system may dynamically change the thresholds or timing values. In some systems, thresholds, times, and other variables may be loaded into an end-pointer by reading specific or general user profiles from the system's local memory or a remote memory. These values and settings may also be changed in real-time or near real-time. If the system determines that a user speaks at a fast pace, the duration of certain rules may be changed and retained within the local or remote profiles. If the system uses a training mode, these parameters may also be programmed or set during a training session.

The operation of some dynamic end-pointer processes may have similar functionality to the processes described in FIGS. 3 and 4. Some dynamic end-pointer processes may include one or more thresholds and/or rules. In some applications the “Outside Endpoint” routine, block 316 is dynamically configured. If a large background noise is detected, the noise threshold at 402 may be raised dynamically. This dynamic re-configuration may cause the dynamic end-pointer to reject more transients and non-speech Sounds. Any threshold utilized by the dynamic end-pointer may be dynamically configured.

An alternative end-pointer system includes a high frequency consonant detector or s-detector that detects high-frequency consonants. The high frequency consonant detector calculates the likelihood of a high-frequency consonant by comparing a temporally smoothed SNR in a high-frequency band to a SNR in one or more low frequency bands. Some systems select the low frequency bands from a predetermined plurality of lower frequency bands (e.g., two, three, four, five, etc. of the lower frequency bands). The difference between these SNR measurements is converted into a temporally smoothed probability through probability logic that generates a ratio between about zero and one hundred that predicts the likelihood of a consonant.

FIG. 11 is a diagram of a consonant detector 1100 that may be linked to or may be a unitary part of an end-pointing system. A receiver or microphone captures the sound waves during voice activity. A Fast Fourier Transform (FFT) element or logic converts the time-domain signal into a frequency domain signal that is broken into frames 1102. A filter or noise estimate logic predicts the noise spectrum in each of a plurality of low frequency bands 1104. The energy in each noise estimate is compared to the energy in the high frequency band of interest through a comparator that predicts the likelihood of an /s/ (or unvoiced speech sound such as /f/, /th/, /h/, etc., or in an alternate system, a plosive such as /p/, /t/, /k/, etc.) in a selected band 1106. If a current probability within a frequency band varies from the previous probability, one or more leaky integrators and/or logic may modify the current probability. If the current probability exceeds a previous probability, the current probability is adapted by the addition of a smoothed difference (e.g., a difference times a smoothing factor) between the current and previous probabilities thorough an adder and multiplier 1109. If a current probability is less than the previous probability a percentage difference of the current and previous probabilities is added to the current probability by an adder and multiplier 1110. While a smoothing factor and percentage may be controlled and/or programmed with each application of the consonant detector; in some systems, the smoothing factor is much smaller than the applied percentage. The smoothing factor may comprise an average difference in percent across an “n” number of audio frames. “n” may comprise one, two, three or more integer frames of audio data.

FIG. 12 is a partial diagram of the consonant detector 1200. The average probability of two, three, or more (e.g., “n” integer) audio frames is compared to the current probability of an audio frame through a weighted comparator 1202. If the ratio of consecutive ratios (e.g., %frame_n−2/%frame_n−1; %frame_n−1/%frame_n) has an increasing trend, an /s/ (or other unvoiced sound or plosive) is detected. If the ratio of consecutive ratios shows a decreasing trend an end-point of the speech interval may be declared.

One process that may adjust the voice thresholds may be based on the detection of unvoiced speech, plosives, or a consonant such as an /s/. In FIG. 13, if an /s/ is not detected in a current or previous frame and the voice thresholds have not changed during a predetermined period, the current voice thresholds and frame numbers are written to a local and/or remote memory 1302 before the voice thresholds are programmed to a predetermined level 1304. Because voice sound may have a more prominent harmonic structure than unvoiced sound and plosives, the voice thresholds may be programmed to a lower level. In some processes the voice thresholds may be dropped within a range of approximately 49% to about 76% of the current voice threshold to make the comparison more sensitive to weak harmonic structures. If an /s/ (or another unvoiced sound or plosive) is detected 1306, the voice thresholds are increased across a programmed number of audio frames 1308 before it is compared to the current thresholds 1310 and written to the local and/or remote memory. If the increased threshold and current thresholds are the same, the process ends 1312. Otherwise, the process analyzes more frames. If an /s/ is detected 1306, the process enters a wait state 1314 until an /s/ is no longer detected. When an /s/ is no longer detected the process stores the current frame number 1316 in the local and/or the remote memory and raises the voice thresholds across a programmed number of audio frames 1318. When the raised threshold and current thresholds are the same 1310, the process ends 1312. Otherwise, the process analyzes another frame of audio data.

In some processes the programmed number of audio frames comprises the difference between the originally stored frame number and the current frame number. In an alternative process, the programmed frame number comprises the number of frames occurring within a predetermined time period (e.g., may be very short such as about 100 ms). In these processes the voice threshold is raised to the previously stored current voice threshold across that time period. In an alternative process, a counter tracks the number of frames processed. The alternative process raises the voice threshold across a count of successive frames.

FIG. 14 exemplifies spectrograms of a voiced segment spoken by a male (a) and a female (b). Both segments were spoken in a substantially noise free environment and show the short duration of a vowel preceded and followed by the longer duration of high frequency consonants. Note the strength of the low frequency harmonics in (a) in comparison to the harmonic structure in (b). FIG. 15 exemplifies a spectrogram of a voiced segment of the numbers 6, 1, 2, 8, and 1 spoken in French. The articulation of the number 6 includes a short duration vowel preceded and followed by longer duration high-frequency consonant. Note that there is substantially less energy contained in the harmonics of the number 6 than in the other digits. FIG. 16 exemplifies a magnified spectrogram of the number 6. In this figure the duration of the consonants are much longer than the vowel. Their approximate occurrence is annotated near the top of the figure. In FIG. 16 the consonant that follows the vowel is approximately 400 ms long.

FIG. 17 exemplifies spectrograms of a voiced segment positioned above an output of an /s/ (or consonant detector) detector. The /s/ detector may identify more than the occurrence of an /s/ Notice how other high-frequency consonants such as the /s/ and /x/ in the numbers 6 and 7 and the /t/ in the numbers 2 and 8 are detected and accurately located by the /s/ detector. FIG. 18 exemplifies spectrogram of a voiced segment positioned above an end-point interval without an /s/ or consonant detection. The voiced segment comprises a French string spoken in a high noise condition. Notice how only the number 2 and 5 are detected and correctly end-pointed while other digits are not identified. FIG. 19 exemplifies the same voice segment of FIG. 18 positioned above end-point intervals adjusted by the /s/ or consonant detection. In this case each of the digits is captured within the interval.

FIG. 20 exemplifies spectrograms of a voiced segment positioned above an end-point interval without /s/ or consonant detection. In this example the significant energy in a vowel of the number 6 (highlighted by the arrow) trigger an end-point interval that captures the remaining sequence. If the six had less energy there is a probability that the entire segment would have been missed. FIG. 21 exemplifies the same voice segment of FIG. 20 positioned above end-point intervals adjusted by the /s/ or consonant detection. In this case each of the digits is captured within the interval.

The methods shown in FIGS. 3, 4, 10, 13, may be encoded in a signal bearing medium, a computer readable medium such as a memory, programmed within a device such as one or more integrated circuits, or processed by a controller or a computer. If the methods are performed by software, the software may reside in a memory partitioned with or interfaced to the rule module 108, voice analysis module 116, ASR engine 118, a controller, or other types of device interface. The memory may include an ordered listing of executable instructions for implementing logical functions. Logic may comprise hardware, software, or a combination. A logical function may be implemented through digital circuitry, through source code, through analog circuitry, or through an analog source such as through an electrical, audio, or video signal. The software may be embodied in any computer-readable or signal-bearing medium, for use by, or in connection with an instruction executable system, system, or device. Such a system may include a computer-based system, a processor-containing system, or another system that may selectively fetch instructions from an instruction executable system, system, or device that may also execute instructions.

A “computer-readable medium,” “machine-readable medium,” “propagated-signal” medium, and/or “signal-bearing medium” may comprise any means that contains, stores, communicates, propagates, or transports software for use by or in connection with an instruction executable system, system, or device. The machine-readable medium may selectively be, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, system, device, or propagation medium. A non-exhaustive list of examples of a machine-readable medium would include: an electrical connection “electronic” having one or more wires, a portable magnetic or optical disk, a volatile memory such as a Random Access Memory “RAM” (electronic), a Read-Only Memory “ROM” (electronic), an Erasable Programmable Read-Only Memory (EPROM or Flash memory) (electronic), or an optical fiber (optical). A machine-readable medium may also include a tangible medium upon which software is printed, as the software may be electronically stored as an image or in another format (e.g., through an optical scan), then compiled, and/or interpreted or otherwise processed. The processed medium may then be stored in a computer and/or machine memory.

While various embodiments of the inventions have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the inventions. Accordingly, the inventions are not to be restricted except in light of the attached claims and their equivalents.

INVENTORS:

Hetherington, Phillip A., Fallat, Mark

THIS PATENT IS REFERENCED BY THESE PATENTS:

Patent	Priority	Assignee	Title
10140975,	Apr 23 2014	GOOGLE LLC	Speech endpointing based on word comparisons
10269341,	Oct 19 2015	GOOGLE LLC	Speech endpointing
10546576,	Apr 23 2014	GOOGLE LLC	Speech endpointing based on word comparisons
10593352,	Jun 06 2017	GOOGLE LLC	End of query detection
10929754,	Jun 06 2017	GOOGLE LLC	Unified endpointer using multitask and multidomain learning
11004441,	Apr 23 2014	GOOGLE LLC	Speech endpointing based on word comparisons
11062696,	Oct 19 2015	GOOGLE LLC	Speech endpointing
11328736,	Jun 22 2017	WEIFANG GOERTEK MICROELECTRONICS CO ,LTD	Method and apparatus of denoising
11551709,	Jun 06 2017	GOOGLE LLC	End of query detection
11636846,	Apr 23 2014	GOOGLE LLC	Speech endpointing based on word comparisons
11676625,	Jun 06 2017	GOOGLE LLC	Unified endpointer using multitask and multidomain learning
11710477,	Oct 19 2015	GOOGLE LLC	Speech endpointing
8442831,	Oct 31 2008	KYNDRYL, INC	Sound envelope deconstruction to identify words in continuous speech
8793132,	Dec 26 2006	Microsoft Technology Licensing, LLC	Method for segmenting utterances by using partner's response
8843369,	Dec 27 2013	GOOGLE LLC	Speech endpointing based on voice profile
8942987,	Dec 11 2013	Jefferson Audio Video Systems, Inc.	Identifying qualified audio of a plurality of audio streams for display in a user interface
9607613,	Apr 23 2014	GOOGLE LLC	Speech endpointing based on word comparisons
ER1476,

THIS PATENT REFERENCES THESE PATENTS:

Patent	Priority	Assignee	Title
4435617,	Aug 13 1981	Griggs Talkwriter Corporation	Speech-controlled phonetic typewriter or display device using two-tier approach
4486900,	Mar 30 1982	AT&T Bell Laboratories	Real time pitch detection by stream processing
4531228,	Oct 20 1981	Nissan Motor Company, Limited	Speech recognition system for an automotive vehicle
4532648,	Oct 22 1981	AT & T TECHNOLOGIES, INC ,	Speech recognition system for an automotive vehicle
4630305,	Jul 01 1985	Motorola, Inc.	Automatic gain selector for a noise suppression system
4701955,	Oct 21 1982	NEC Corporation	Variable frame length vocoder
4811404,	Oct 01 1987	Motorola, Inc.	Noise suppression system
4843562,	Jun 24 1987	BROADCAST DATA SYSTEMS LIMITED PARTNERSHIP, 1515 BROADWAY, NEW YORK, NEW YORK 10036, A DE LIMITED PARTNERSHIP	Broadcast information classification system and method
4856067,	Aug 21 1986	Oki Electric Industry Co., Ltd.	Speech recognition system wherein the consonantal characteristics of input utterances are extracted
4945566,	Nov 24 1987	U S PHILIPS CORPORATION	Method of and apparatus for determining start-point and end-point of isolated utterances in a speech signal
4989248,	Jan 28 1983	Texas Instruments Incorporated	Speaker-dependent connected speech word recognition method
5027410,	Nov 10 1988	WISCONSIN ALUMNI RESEARCH FOUNDATION, MADISON, WI A NON-STOCK NON-PROFIT WI CORP	Adaptive, programmable signal processing and filtering for hearing aids
5056150,	Nov 16 1988	Institute of Acoustics, Academia Sinica	Method and apparatus for real time speech recognition with and without speaker dependency
5146539,	Nov 30 1984	Texas Instruments Incorporated	Method for utilizing formant frequencies in speech recognition
5151940,	Dec 24 1987	Fujitsu Limited	Method and apparatus for extracting isolated speech word
5152007,	Apr 23 1991	Motorola, Inc	Method and apparatus for detecting speech
5201028,	Sep 21 1990	ILLINOIS TECHNOLOGY TRANSFER, L L C	System for distinguishing or counting spoken itemized expressions
5293452,	Jul 01 1991	Texas Instruments Incorporated	Voice log-in using spoken name input
5305422,	Feb 28 1992	Panasonic Corporation of North America	Method for determining boundaries of isolated words within a speech signal
5313555,	Feb 13 1991	Sharp Kabushiki Kaisha	Lombard voice recognition method and apparatus for recognizing voices in noisy circumstance
5400409,	Dec 23 1992	Nuance Communications, Inc	Noise-reduction method for noise-affected voice channels
5408583,	Jul 26 1991	Casio Computer Co., Ltd.	Sound outputting devices using digital displacement data for a PWM sound signal
5479517,	Dec 23 1992	Nuance Communications, Inc	Method of estimating delay in noise-affected voice channels
5495415,	Nov 18 1993	Regents of the University of Michigan	Method and system for detecting a misfire of a reciprocating internal combustion engine
5502688,	Nov 23 1994	GENERAL DYNAMICS ADVANCED TECHNOLOGY SYSTEMS, INC	Feedforward neural network system for the detection and characterization of sonar signals with characteristic spectrogram textures
55201,
5526466,	Apr 14 1993	Matsushita Electric Industrial Co., Ltd.	Speech recognition apparatus
5568559,	Dec 17 1993	Canon Kabushiki Kaisha	Sound processing apparatus
5572623,	Oct 21 1992	Sextant Avionique	Method of speech detection
5584295,	Sep 01 1995	Analogic Corporation	System for measuring the period of a quasi-periodic signal
5596680,	Dec 31 1992	Apple Inc	Method and apparatus for detecting speech activity using cepstrum vectors
5617508,	Oct 05 1992	Matsushita Electric Corporation of America	Speech detection device for the detection of speech end points based on variance of frequency band limited energy
5677987,	Nov 19 1993	Matsushita Electric Industrial Co., Ltd.	Feedback detector and suppressor
5680508,	May 03 1991	Exelis Inc	Enhancement of speech coding in background noise for low-rate speech coder
5687288,	Sep 20 1994	U S PHILIPS CORPORATION	System with speaking-rate-adaptive transition values for determining words from a speech signal
5692104,	Dec 31 1992	Apple Inc	Method and apparatus for detecting end points of speech activity
5701344,	Aug 23 1995	Canon Kabushiki Kaisha	Audio processing apparatus
5732392,	Sep 25 1995	Nippon Telegraph and Telephone Corporation	Method for speech detection in a high-noise environment
5794195,	Jun 28 1994	Alcatel N.V.	Start/end point detection for word recognition
5933801,	Nov 25 1994		Method for transforming a speech signal using a pitch manipulator
5949888,	Sep 15 1995	U S BANK NATIONAL ASSOCIATION	Comfort noise generator for echo cancelers
5963901,	Dec 12 1995	Nokia Technologies Oy	Method and device for voice activity detection and a communication device
6011853,	Oct 05 1995	Nokia Technologies Oy	Equalization of speech signal in mobile phone
6021387,	Oct 21 1994	Sensory Circuits, Inc.	Speech recognition apparatus for consumer electronic applications
6029130,	Aug 20 1996	Ricoh Company, LTD	Integrated endpoint detection for improved speech recognition method and system
6098040,	Nov 07 1997	RPX CLEARINGHOUSE LLC	Method and apparatus for providing an improved feature set in speech recognition by performing noise cancellation and background masking
6163608,	Jan 09 1998	Ericsson Inc.	Methods and apparatus for providing comfort noise in communications systems
6167375,	Mar 17 1997	Kabushiki Kaisha Toshiba	Method for encoding and decoding a speech signal including background noise
6173074,	Sep 30 1997	WSOU Investments, LLC	Acoustic signature recognition and identification
6175602,	May 27 1998	Telefonaktiebolaget LM Ericsson	Signal noise reduction by spectral subtraction using linear convolution and casual filtering
6192134,	Nov 20 1997	SNAPTRACK, INC	System and method for a monolithic directional microphone array
6199035,	May 07 1997	Nokia Technologies Oy	Pitch-lag estimation in speech coding
6216103,	Oct 20 1997	Sony Corporation; Sony Electronics Inc.	Method for implementing a speech recognition system to determine speech endpoints during conditions with background noise
6240381,	Feb 17 1998	Fonix Corporation	Apparatus and methods for detecting onset of a signal
6304844,	Mar 30 2000	VERBALTEK, INC	Spelling speech recognition apparatus and method for communications
6317711,	Feb 25 1999	Ricoh Company, Ltd.	Speech segment detection and word recognition
6324509,	Feb 08 1999	Qualcomm Incorporated	Method and apparatus for accurate endpointing of speech in the presence of noise
6356868,	Oct 25 1999	MAVENIR, INC	Voiceprint identification system
6405168,	Sep 30 1999	WIAV Solutions LLC	Speaker dependent speech recognition training using simplified hidden markov modeling and robust end-point detection
6434246,	Oct 10 1995	GN RESOUND AS MAARKAERVEJ 2A	Apparatus and methods for combining audio compression and feedback cancellation in a hearing aid
6453285,	Aug 21 1998	Polycom, Inc	Speech activity detector for use in noise reduction system, and methods therefor
6487532,	Sep 24 1997	Nuance Communications, Inc	Apparatus and method for distinguishing similar-sounding utterances speech recognition
6507814,	Aug 24 1998	SAMSUNG ELECTRONICS CO , LTD	Pitch determination using speech classification and prior pitch estimation
6535851,	Mar 24 2000	SPEECHWORKS INTERNATIONAL, INC	Segmentation approach for speech recognition systems
6574592,	Mar 19 1999	Kabushiki Kaisha Toshiba	Voice detecting and voice control system
6574601,	Jan 13 1999	Alcatel Lucent	Acoustic speech recognizer system and method
6587816,	Jul 14 2000	Nuance Communications, Inc	Fast frequency-domain pitch estimation
6643619,	Oct 30 1997	Nuance Communications, Inc	Method for reducing interference in acoustic signals using an adaptive filtering method involving spectral subtraction
6687669,	Jul 19 1996	Nuance Communications, Inc	Method of reducing voice signal interference
6711540,	Sep 25 1998	MICROSEMI SEMICONDUCTOR U S INC	Tone detector with noise detection and dynamic thresholding for robust performance
6721706,	Oct 30 2000	KONINKLIJKE PHILIPS ELECTRONICS N V	Environment-responsive user interface/entertainment device that simulates personal interaction
6782363,	May 04 2001	WSOU Investments, LLC	Method and apparatus for performing real-time endpoint detection in automatic speech recognition
6822507,	Apr 26 2000	Dolby Laboratories Licensing Corporation	Adaptive speech filter
6850882,	Oct 23 2000		System for measuring velar function during speech
6859420,	Jun 26 2001	Raytheon BBN Technologies Corp	Systems and methods for adaptive wind noise rejection
6873953,	May 22 2000	Nuance Communications	Prosody based endpoint detection
6910011,	Aug 16 1999	Malikie Innovations Limited	Noisy acoustic signal enhancement
6996252,	Apr 19 2000	DIGIMARC CORPORATION AN OREGON CORPORATION	Low visibility watermark using time decay fluorescence
7117149,	Aug 30 1999	2236008 ONTARIO INC ; 8758271 CANADA INC	Sound source classification
7146319,	Mar 31 2003	Apple Inc	Phonetically based speech recognition system and method
7535859,	Oct 16 2003	MORGAN STANLEY SENIOR FUNDING, INC	Voice activity detection with adaptive noise floor tracking
20010028713,
20020071573,
20020176589,
20030040908,
20030120487,
20030216907,
20040078200,
20040138882,
20040165736,
20040167777,
20050096900,
20050114128,
20050240401,
20060034447,
20060053003,
20060074646,
20060080096,
20060100868,
20060115095,
20060116873,
20060136199,
20060178881,
20060251268,
20070033031,
20070219797,
20070288238,
CA2157496,
CA2158064,
CA2158847,
CN1042790,
EP76687,
EP543329,
EP629996,
EP750291,
EP1450353,
EP1450354,
EP1669983,
JP2000250565,
JP6269084,
JP6319193,
KR1019990077910,
KR1020010091093,
WO41169,
WO156255,
WO173761,
WO20040111996,

ASSIGNMENT RECORDS Assignment records on the USPTO

////////////////////////////////////

Executed on	Assignor	Assignee	Conveyance	Frame	Reel	Doc
Apr 16 2007	FALLAT, MARK	QNX SOFTWARE SYSTEMS WAVEMAKERS , INC	ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS	019524	0432	pdf
May 07 2007	HETHERINGTON, PHILLIP A	QNX SOFTWARE SYSTEMS WAVEMAKERS , INC	ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS	019524	0432	pdf
May 18 2007		QNX Software Systems Limited	(assignment on the face of the patent)
Mar 31 2009	HBAS MANUFACTURING, INC	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	INNOVATIVE SYSTEMS GMBH NAVIGATION-MULTIMEDIA	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	JBL Incorporated	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	LEXICON, INCORPORATED	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	MARGI SYSTEMS, INC	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	QNX SOFTWARE SYSTEMS WAVEMAKERS , INC	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	QNX SOFTWARE SYSTEMS CANADA CORPORATION	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	QNX Software Systems Co	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	QNX SOFTWARE SYSTEMS GMBH & CO KG	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	QNX SOFTWARE SYSTEMS INTERNATIONAL CORPORATION	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	XS EMBEDDED GMBH F K A HARMAN BECKER MEDIA DRIVE TECHNOLOGY GMBH	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HBAS INTERNATIONAL GMBH	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN SOFTWARE TECHNOLOGY MANAGEMENT GMBH	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN SOFTWARE TECHNOLOGY INTERNATIONAL BETEILIGUNGS GMBH	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	Harman International Industries, Incorporated	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	BECKER SERVICE-UND VERWALTUNG GMBH	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	CROWN AUDIO, INC	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN BECKER AUTOMOTIVE SYSTEMS MICHIGAN , INC	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN BECKER AUTOMOTIVE SYSTEMS HOLDING GMBH	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN BECKER AUTOMOTIVE SYSTEMS, INC	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN CONSUMER GROUP, INC	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN DEUTSCHLAND GMBH	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN FINANCIAL GROUP LLC	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	HARMAN HOLDING GMBH & CO KG	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
Mar 31 2009	Harman Music Group, Incorporated	JPMORGAN CHASE BANK, N A	SECURITY AGREEMENT	022659	0743	pdf
May 27 2010	QNX SOFTWARE SYSTEMS WAVEMAKERS , INC	QNX Software Systems Co	CONFIRMATORY ASSIGNMENT	024659	0370	pdf
Jun 01 2010	JPMORGAN CHASE BANK, N A , AS ADMINISTRATIVE AGENT	QNX SOFTWARE SYSTEMS GMBH & CO KG	PARTIAL RELEASE OF SECURITY INTEREST	024483	0045	pdf
Jun 01 2010	JPMORGAN CHASE BANK, N A , AS ADMINISTRATIVE AGENT	QNX SOFTWARE SYSTEMS WAVEMAKERS , INC	PARTIAL RELEASE OF SECURITY INTEREST	024483	0045	pdf
Jun 01 2010	JPMORGAN CHASE BANK, N A , AS ADMINISTRATIVE AGENT	Harman International Industries, Incorporated	PARTIAL RELEASE OF SECURITY INTEREST	024483	0045	pdf
Feb 17 2012	QNX Software Systems Co	QNX Software Systems Limited	CHANGE OF NAME SEE DOCUMENT FOR DETAILS	027768	0863	pdf
Apr 03 2014	QNX Software Systems Limited	8758271 CANADA INC	ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS	032607	0943	pdf
Apr 03 2014	8758271 CANADA INC	2236008 ONTARIO INC	ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS	032607	0674	pdf
Feb 21 2020	2236008 ONTARIO INC	BlackBerry Limited	ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS	053313	0315	pdf

MAINTENANCE FEES AND DATES: Maintenance records on the USPTO

Date	Maintenance Fee Events
Oct 26 2015	M1551: Payment of Maintenance Fee, 4th Year, Large Entity.
Oct 24 2019	M1552: Payment of Maintenance Fee, 8th Year, Large Entity.
Oct 24 2023	M1553: Payment of Maintenance Fee, 12th Year, Large Entity.

Date	Maintenance Schedule
Apr 24 2015	4 years fee payment window open
Oct 24 2015	6 months grace period start (w surcharge)
Apr 24 2016	patent expiry (for year 4)
Apr 24 2018	2 years to revive unintentionally abandoned end. (for year 4)
Apr 24 2019	8 years fee payment window open
Oct 24 2019	6 months grace period start (w surcharge)
Apr 24 2020	patent expiry (for year 8)
Apr 24 2022	2 years to revive unintentionally abandoned end. (for year 8)
Apr 24 2023	12 years fee payment window open
Oct 24 2023	6 months grace period start (w surcharge)
Apr 24 2024	patent expiry (for year 12)
Apr 24 2026	2 years to revive unintentionally abandoned end. (for year 12)