A method and apparatus are disclosed for providing motion estimation (ME) for large-size blocks of image data during image processing using small-size block processing logic. An embodiment method includes obtaining a large-size block for ME processing and dividing the large-size block into a plurality of small-size blocks. The large-size block comprises an integer multiple of the small-size blocks. The small-size blocks are then processed in parallel using a small-size block ME processing algorithm. An embodiment apparatus includes a processor configured to implement the method for large-size block ME processing using small-size block ME processing logic, and a shared memory register for storing at different times the 16×16 blocks.
|
9. An apparatus for implementing motion estimation (ME) for a large-size block of image data, the apparatus comprising:
a processor configured to:
obtain a 64×64 block of bytes of image data for ME processing;
divide the 64×64 block into a plurality of 16×16 blocks of data bytes; and
process the 16×16 blocks in parallel using a ME processing algorithm for 16×16 blocks,
wherein the processor is configured to process each of the 16×16 blocks using 16 clock cycles for 16 line motion searches and process a total number of 16 of the 16×16 blocks using 256 clock cycles.
1. A method for motion estimation (ME) for a large-size block of image data, the method comprising:
obtaining a large-size block for ME processing;
dividing the large-size block into a plurality of small-size blocks, wherein the small-size blocks comprise M×M blocks of data bytes, wherein m is an integer;
processing each of the small-size blocks in parallel using a small-size block ME processing algorithm using m clock cycles for m line motion searches; and
processing a total number of m of the M×M blocks using M×M clock cycles,
wherein the large-size block comprises an integer multiple of the small-size blocks.
15. A network component for video coding, the network component comprising:
a processor configured to:
obtain a large-size block of bytes of image data for motion estimation (ME);
divide the large-size block into a plurality of small-size blocks of bytes that comprise a same data, wherein the small-size blocks comprise M×M blocks of data bytes, wherein m is an integer;
process each of the small-size blocks for ME individually and in parallel using a small-size block ME processing algorithm using m clock cycles for m line motion searches;
process a total number of m of the M×M blocks using M×M clock cycles; and
a single shared register for storing at different times the small-size blocks.
3. The method of
4. The method of
5. The method of
6. The method of
7. The method of
8. The method of
10. The apparatus of
11. The apparatus of
12. The apparatus of
13. The apparatus of
14. The apparatus of
16. The network component of
17. The network component of
18. The network component of
19. The network component of
|
The present invention relates to a system and method for image processing, and, in particular embodiments, to a system and method for motion estimation for large-size block.
Video coding deals with representation of video data, for storage and/or transmission, for example for digital video. Video coding can be implemented with captured video as well as computer generated video and graphics. Goals of video coding are to accurately and compactly represent the video data, provide navigation of the video (i.e., search forwards and backwards, random access, etc.) and other additional author and content benefits, such as text (subtitles), meta information for searching/browsing and digital rights management. Video data is typically processed in blocks of data bytes or bits, where multiple blocks form an image frame. Video coding can be performed by a processor on the transmitting end (also referred to as an encoder) to compress original video into a format suitable for transmission. Video coding can also be performed by a trans-coder that converts digital-to-digital data from one encoding format to another. The encoder and trans-coder may include software components implemented via a processor or firmware. Video coding functions include motion estimation, which is a process of determining motion vectors that describe the transformation from one two-dimensional (2D) image to another.
High-Efficiency Video Coding (HEVC) is a recent video coding standard that is being developed by the Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T and ISO/IEC. The HEVC standard is incorporated herein by reference. In HEVC, the size of processed blocks (for an image frame) is relatively large, such as 64×64 blocks of data units. The processing of large-size blocks for ME is a computational-intensive operation, which can substantially reduce computation performance and/or increase hardware or chip cost and complexity.
In one embodiment, a method for motion estimation (ME) for a large-size block of image data is disclosed. The method includes obtaining a large-size block for ME processing and dividing the large-size block into a plurality of small-size blocks. The method also includes processing the small-size blocks in parallel using a small-size block ME processing algorithm. The large-size block comprises an integer multiple of the small-size blocks. In an example, the small-size blocks are 16×16 blocks of data bytes.
In another embodiment, an apparatus for implementing ME for a large-size block of image data is disclosed. The apparatus comprises a processor configured to obtain a 64×64 block of bytes of image data for ME processing and divide the 64×64 block into a plurality of 16×16 blocks of data bytes. The processor is also configured to process the 16×16 blocks in parallel using a ME processing algorithm for 16×16 blocks.
In yet another embodiment, a network component for video coding is disclosed. The network component comprises a processor configured to obtain a large-size block of bytes of image data for motion estimation (ME), divide the large-size block into a plurality of small-size blocks of bytes that comprise a same data, and process the small-size blocks for ME individually in parallel using a corresponding small-size block ME processing algorithm. The network component further comprises a single shared register for storing at different times the small-size blocks.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
The making and using of the presently preferred embodiments are discussed in detail below. It should be appreciated, however, that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the invention, and do not limit the scope of the invention.
In recent video compression standard “HEVC”, large-size blocks of image data that belong to image frames, such as 64×64, 64×32, 32×64, 32×32, 32×16, and 16×32 blocks, are used in ME. The blocks comprise bytes of data and may be represented in the form of matrices. Compared to small-size blocks (e.g., 16×16 blocks or smaller), the large-size blocks require more overhead for ME, such as in number of processor cycles (i.e., clock cycles). For example, processing a 16×16 block may take 16 cycles before starting actual motion search calculation. Using the same ME architecture in video encoder chips, a 64×64 block typically requires 64 cycles to start the actual motion search calculation. Generally, ME is performed for a plurality of lines for the same block, for example for multiples of 16 lines. Thus, the ME overhead (in number of cycles) is proportional to both the block size and the number of lines for motion search. For instance, when there are 64 lines to be processed for a 64×64 block, the number of cycles needed for ME is equal to 64×64 or 4096 cycles.
Using a typical video processing chip and logic, such as based on a 1080P60 HD format, each 64×64 block may have only 6,400 cycles that can be used for overall ME computing. Thus, the actual computing time for ME, e.g., for actual motion search calculation, is reduced significantly after using 4096 of the cycles for line motion searches (for 64 lines per block). The cycles that remain for performing actual motion search calculation may be limited and reduce ME performance in comparison to the case of small-size blocks (e.g., 16×16 blocks). To compensate for this overhead, more complex hardware or chip logic may be used, which increases chip cost and resource (e.g., power) consumption. Thus, improving motion estimation efficiency and simplifying chip logic for large-size blocks is beneficial to significantly improve performance and reduce chip cost for video coding and processing.
To decrease the time for line motion search and the chip cost and improve ME performance for large-size blocks, embodiments are disclosed herein that use fewer cycles than the current approach to efficiently process large-size blocks. An embodiment method may be implemented by an apparatus, a processor (e.g., an encoder), or a network component and includes dividing a large-size block into multiple equivalent 16×16 blocks, and then processing the individual 16×16 blocks using a standard or current ME processing method for such small-size blocks. For example, a 64×64 block may be divided into 16 small-size 16×16 blocks that represent the same data, where each 16×16 block needs 16 cycles of overhead for ME. As such, the resulting total number of cycles for processing the data of the 64×64 block becomes equal to 16×64 or 1024 cycles instead of 64×64 or 4096 cycles, which is required using standard large-size block ME processing. Using this method, the overhead for ME in number of cycles may be reduced by a ratio of about ¾ (i.e., a 75% of overhead reduction). The resulting freed-up cycles may be used for actual motion search calculation, which results in improving ME efficiency and performance. Additionally or alternatively, this reduced overhead may reduce chip complexity and logic, cost, and power consumption.
The ME processing scheme 100 typically uses 64 processor cycles to perform one line motion search for a 64×64 block. The number of lines that are considered for ME may correspond to the number of data rows of the image frame portion 110, i.e., V. Thus, the total number of cycles for line motion searches is equal to V×64 cycles. The number of data rows, V, may be a multiple of 16. For example, when V is equal to 16, the total number of cycles for line motion searches is equal to 16×64 or 1024 cycles, and when V is equal to 64, the total number of cycles is equal to 64×64 or 4096 cycles. Thus, the overhead for ME may substantially increase as the block size increases and as the number of line motion searches or V increases. Additionally, the scheme 100 uses a 64×64 8-bit register, i.e., a total of 64×64×8 or 32K bits, to store the 64×64 block data for processing. Due to the requirements above, it is more feasible to implement the scheme 100 via hardware, e.g., using a HEVC standard chip, with or without software, such as in the case of real-time processing/communications applications.
The ME processing scheme 200 may first divide the large-size block into a plurality of equivalent small-size blocks, for instance a plurality of 16×16 blocks and process the equivalent 16×16 blocks in parallel using a current small-size block ME scheme for ME in existing video coding standards, which is referred to as 16×16 micro-block ME. For example, a 64×64 block may be processed by dividing the block into 16 small-size 16×16 blocks and then processing the individual 16×16 blocks in parallel, e.g., at about the same time using time division multiplexing. Each 16×16 block may be processed using an efficient existing or standard ME processing scheme for small-size 16×16 blocks. Each 16×16 block may need 16 line motion searches, where one line motion search requires 16 processor cycles for ME. Since the resulting 16 small-size 16×16 blocks are processed in parallel, the 16 line motion searches can be implemented at about the same time. As such, the total number of cycles for all the blocks is equal to 16×16 (or 256) cycles and the overhead for ME may be substantially reduced (by about 75%) in comparison to the ME processing scheme 100. The savings in overhead (i.e., in number of cycles) may be used for actual motion search calculation to improve processing efficiency and performance. The savings in overhead may also translate into savings in chip cost and power consumption, for example while maintaining the same level or performance of the current scheme 100.
Additionally, the scheme 200 may use a 16×16 8-bit register, i.e., a total of 16×16×8 (or 2K) bits, to store the 16×16 block data for processing. Since the 16 small-size 16×16 blocks are processed in parallel, e.g., via time vision multiplexing, and a single 2K bit register can be shared to store all the blocks at different times. This corresponds to a ratio 15/16 in register size savings in comparison to the scheme 100. The savings in register size or memory further reduce cost and power consumption and simplify chip logic.
Program code, e.g., the code implementing the algorithms disclosed above, and data can be stored in a memory 420. The memory 420 can be read only memory (or ROM), a local memory such as DRAM or mass storage such as a hard drive, optical drive or other storage (which may be local or remote). While the memory 420 is illustrated functionally with a single block, it is understood that one or more hardware blocks can be used to implement this function. The memory 420 may comprise the shared register that is used to process the small-size blocks in the scheme 200 and the method 300.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
5170259, | Sep 29 1990 | VICTOR COMPANY OF JAPAN, LTD A CORP OF JAPAN | Motion compensated predictive coding/decoding system of picture signal |
5319457, | Sep 09 1991 | Hitachi, Ltd. | Variable length image coding system |
6118901, | Oct 31 1997 | National Science Council | Array architecture with data-rings for 3-step hierarchical search block matching algorithm |
7126991, | Feb 03 2003 | Tibet, Mimar | Method for programmable motion estimation in a SIMD processor |
20030020965, | |||
20040008780, | |||
20120328003, | |||
20140056355, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 01 2012 | ZHOU, FENG | FUTUREWEI TECHNOLOGIES, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 029088 | 0460 | |
Oct 02 2012 | Futurewei Technologies, Inc. | (assignment on the face of the patent) |
Date | Maintenance Fee Events |
Aug 08 2019 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Aug 09 2023 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Feb 23 2019 | 4 years fee payment window open |
Aug 23 2019 | 6 months grace period start (w surcharge) |
Feb 23 2020 | patent expiry (for year 4) |
Feb 23 2022 | 2 years to revive unintentionally abandoned end. (for year 4) |
Feb 23 2023 | 8 years fee payment window open |
Aug 23 2023 | 6 months grace period start (w surcharge) |
Feb 23 2024 | patent expiry (for year 8) |
Feb 23 2026 | 2 years to revive unintentionally abandoned end. (for year 8) |
Feb 23 2027 | 12 years fee payment window open |
Aug 23 2027 | 6 months grace period start (w surcharge) |
Feb 23 2028 | patent expiry (for year 12) |
Feb 23 2030 | 2 years to revive unintentionally abandoned end. (for year 12) |