An apparatus and method for performing a shuffle operation on packed data using computer-implemented steps is described. In one embodiment, a first packed data operand having at least two data elements is accessed. A second packed data operand having at least two data elements is accessed. One of the data elements in the first packed data operand is shuffled into a lower destination field of a destination register, and one of the data elements in the second packed data operand is shuffled into an upper destination field of the destination register.
|
23. A processor for performing a shuffle operation in response to a shuffle instruction comprising:
a decoder which decodes a single instruction specifying first and second source operands and a field of control bits; and
an execution unit which, responsive to the field of control bits, generates a resultant packed data operand comprised of packed data elements from the first and second source operands,
wherein the control bits are limited to specifying for the upper and lower halves of the resultant packed data operand, data elements from the first and second source operands, respectively.
19. A processor-implemented method for reducing the number of control hits required to shuffle packed data elements from first and second source operands, comprising the steps of:
decoding a single instruction specifying first and second source operands and a field of control bits; and
responsive to the field of control bits, generating a resultant packed data operand comprised of packed data elements from the first and second source operands,
wherein the control bits are limited to specifying for the upper and lower halves of the resultant packed data operand, data elements from the first and second source operands, respectively.
0. 36. An apparatus comprising:
a decode unit to decode a shuffle instruction into control signals, said shuffle instruction to include a first operand, a second operand, and a third operand wherein said third operand comprises of an 8-bit immediate value;
said first operand to identify a first register to hold at least two packed data elements;
said second operand to identify a memory location to hold at least two packed data elements;
said third operand is to provide selection bits to indicate which of said packed data elements in said first operand and said second operand to select and copy to a resultant register; and
an execution unit coupled to said decode unit, said execution unit responsive to said control signals and said selection bits to select a first set of data elements from said first register and to copy said first set of data elements to one or more lower destination fields of said resultant register, said execution unit further responsive to said control signals and said selection bits to select a second set of data elements from said memory location and to copy said second set of data elements to one or more upper destination fields of said resultant register.
0. 43. An apparatus comprising:
an instruction decoder to receive and decode a shuffle instruction, said shuffle instruction to include an immediate operand comprising two or more sets of control bits;
a first source register to hold a first packed data, said first packed data comprising of a first data element and a second data element;
a second source register to hold a second packed data, said second packed data comprising of a third data element and a fourth data element;
a destination register to hold a third packed data;
an execution unit coupled to said first source resister to receive said first packed data, and to said second source register to receive said second packed data; and
wherein said execution unit is further coupled to said instruction decoder to receive said two or more sets of control bits, said execution unit to select from said first source register at least one of said first and second data elements in response to a first one of said two or more sets of control bits and to copy said selected data element from said first source register to a first data field in a lower half of said destination register, and said execution unit to select from said second source register at least one of said third and fourth data elements in response to a second one of said two or more sets of control bits and to copy said selected data element from said second source register to a second data field in an upper half of said destination register.
0. 1. A computer system comprising:
a hardware unit to transmit data representing graphics to another computer or a display;
a processor coupled to the hardware unit; and
a storage device coupled to the processor and having stored therein an instruction, which when executed by the processor, causes the processor to at least,
access a first packed data operand having at least two data elements;
access a second packed data operand having at least two data elements;
select a first set of data elements from the first packed data operand;
copy each of the data elements in the first set to specified data fields located in the tower half of a destination operand;
select a second set of data elements from the second packed data operand; and
copy each of the data elements in the second set to specified data fields located in the upper half of the destination operand.
0. 2. The computer system of
0. 3. The computer system of
0. 4. A system as claimed in
0. 5. A method comprising the computer-implemented steps of:
decoding a single instruction;
in response to the step of decoding the single instruction,
accessing a first packed data operand having at least two data elements;
accessing a second packed data operand having at least two data elements;
selecting a first set of data elements from the first packed data operand;
copying each of the data elements in the first set to specified data fields located in the lower half of a destination operand;
selecting a second set of data elements from the second packed data operand; and
copying each of the data elements in the second set to specified data fields located in the upper half of the destination operand.
0. 6. The method of
0. 7. The method of
0. 8. A method as claimed in
0. 9. A method comprising the computer implemented steps of:
accessing data representative of a first three-dimensional image;
altering the data using three-dimensional geometry to generate a second
three-dimensional image, the step of altering at least including,
accessing a first packed data operand having at least two data elements;
accessing a second packed data operand having at least two data elements;
selecting a first set of data elements from the first packed data operand;
copying each of the data elements in the first set to specified data fields located in the lower half of a destination operand;
selecting a second set of data elements from the second packed data operand;
copying each of the data elements in the second set to specified data fields located in the upper half of the destination operand; and
displaying the second three-dimensional image.
0. 10. The method of
0. 11. The method of
0. 12. The method of
0. 13. A method as claimed in
0. 14. A method comprising the computer implemented steps of:
accessing data representative of a first three-dimensional image;
altering the data using three-dimensional geometry to generate a second three-dimensional image, the step of altering at least including,
accessing a first packed data operand having at least two data elements;
accessing a second packed data operand having at least two data elements;
selecting a first set of data elements from the first packed data operand;
copying each of the data elements in the first set to specified data fields located in the lower half of a destination operand;
selecting a second set of data elements from the second packed data operand;
copying each of the data elements in the second set to specified data fields located in the upper half of the destination operand; and
displaying the second three-dimensional image.
0. 15. The method of
0. 16. The method of
0. 17. The method of
0. 18. A method as claimed in
20. The method as claimed in
21. The method as claimed in
22. The method as claimed in
24. The processor as claimed in
0. 25. The method as claimed in claim 19 wherein the first and second packed data source operands and the resultant packed data operand are each comprised of at least two packed data elements.
0. 26. The method as claimed in claim 19 wherein the field of control bits is an 8-bit field.
0. 27. The method as claimed in claim 26 wherein an 8-bit immediate to fill the field of control bits is decoded with the single instruction.
0. 28. The processor of claim 23 wherein said field of control bits comprises of an 8-bit immediate value.
0. 29. The processor of claim 23 wherein said field of control bits comprises of an 8-bits.
0. 30. The processor of claim 29 wherein said first and second source operands comprise of double-precision floating-point values.
0. 31. The processor of claim 29 wherein said first and second source operands comprise single-precision floating-point values.
0. 32. The processor of claim 29 wherein said packed data elements comprise of packed double words.
0. 33. The processor of claim 29 wherein said packed data elements comprise of packed words.
0. 34. The processor of claim 29 wherein said packed data elements comprise of packed bytes.
0. 35. The processor of claim 29 wherein said first and said second operands comprise of 128-bits of packed data.
0. 37. The apparatus of claim 36 wherein said data elements of said first register and said second register comprise double-precision floating-point values.
0. 38. The apparatus of claim 36 wherein said data elements of said first register and said second register comprise of single-precision floating-point values.
0. 39. The apparatus of claim 36 wherein said packed data elements comprise of packed double words.
0. 40. The apparatus of claim 36 wherein said packed data elements comprise of packed words.
0. 41. The apparatus of claim 36 wherein said packed data elements comprise of packed bytes.
0. 42. The apparatus of claim 36 wherein said first register is also said resultant register.
0. 44. The apparatus of claim 43 wherein said immediate operand is an 8-bit immediate operand.
0. 45. The apparatus of claim 43 wherein said data elements of said first source register and said second source register comprise of double-precision floating-point values.
0. 46. The apparatus of claim 43 wherein said data elements of said first source register and said second source register comprise of single-precision floating-point values.
0. 47. The apparatus of claim 43 wherein said packed data comprise of packed double words.
0. 48. The apparatus of claim 43 wherein said packed data comprise of packed words.
0. 49. The apparatus of claim 43 wherein said packed data comprise of packed bytes.
0. 50. The apparatus of claim 43 wherein said apparatus is defined by machine readable data on a machine readable medium.
0. 51. The apparatus of claim 43 wherein said first source register is also said destination register.
0. 52. The apparatus of claim 43 wherein said first source register is the same as said second source register.
0. 53. The apparatus of claim 43 wherein said two or more sets of control bits comprise bits 0 and 1 of the immediate operand.
0. 54. The apparatus of claim 44 wherein said 8-bit immediate operand comprises bits 0 and 1 to select from said first source register which data element is copied into the lowest data field in the lower half of the destination register, and bits 4 and 5 to select from said second source register which data element is copied into the lowest data field in the upper half of the destination register.
0. 55. The apparatus of claim 44 wherein said 8-bit immediate operand comprises bits 0 through 3 to select from said first source register which data elements are copied into the lower half of the destination register, and bits 4 through 7 to select from said second source register which data elements are copied into the upper half of the destination register.
0. 56. The apparatus of claim 55 wherein said 8-bit immediate operand comprises bits 2 and 3 to select from said first source register which data element is copied into the highest data field in the lower half of the destination register, and bits 6 and 7 to select from said second source register which data element is copied into the highest data field in the upper half of the destination register.
|
|X3|X2|X1X|X0|
The process S500 then proceeds to process step S520, where numbers Y0, Y1, Y2 and Y3 are stored as data elements in a packed data item 525. For present discussion purposes, each data element is 16-bits wide and is contained in register X1, in the following order:
|Y3|Y2|Y1|Y0|
The process S500 then advances to process step S530, where a shuffle instruction is performed on the contents of register X0 (data item 515) and register X1 (data item 525) to shuffle any one of the four data elements from the first data item 515 to the lower two fields of a destination register 535, and to shuffle any one of the four data elements from the second data item 525 to the upper two fields of the destination register 535. The resulting data item 535 is as follows:
|{Y3, Y2, Y1, Y0}|{Y3, Y2, Y1, Y0}|{X3, X2, X1, X0}|{X3, X2, X1, X0}|
Accordingly, a shuffle operation is performed. Although
An 8-bit immediate value is used as a control word to indicate how data elements should be shuffled. Bits 0,1 of the control word indicate which of the four data elements in the first operand are shuffled into the first or lowest data element of the destination register. Bits 2,3 of the control word indicate which of the four data elements in the first operand are shuffled into the second data element of the destination register. Bits 4,5 of the control word indicate which of the four data elements in the second operand are shuffled into the third data element of the destination register. Bits 6,7 of the control word indicate which of the four data elements in the second operand are shuffled into the fourth data element of the destination register. For example, given a first data operand with four data elements contained in the following order:
|D|C|B|A|
and also given a second data operand with four data elements contained in the following order:
|H|G|F|E|
and also given a shuffle control word of 10001111, the result of the shuffle is as follows:
|G|E|D|D|
It will be recognized by one of ordinary skill in the art that the size of the shuffle control word may vary depending without loss of compatibility with the present invention, depending on the number of data elements in the source data operand and the number of fields in the destination register.
Accordingly, a shuffle operation is performed. Although
The shuffle instruction of the present invention may be used as part of many different applications. For example,
In one embodiment, the computer system 100 shown in
In this embodiment, the digital filter unit 718 is implemented using the processor 105 and the software 136 to perform the a digital filter. In this embodiment, the processor 105, executing the software 136, performs the digital filter using shuffle operations, and stores the filtered data 718 in storage device 110. In this manner, the digital filter is performed by the host processor of the computer system, rather than the TV broadcast signal receiver 131. As a result, the complexity of the TV broadcast signal receiver 131 is reduced. In this embodiment, the video decoder 721 may be implemented in any number of different combinations of hardware, software, and/or firmware. The audio and video data 724 can then be sorted, and/or displayed on the display 125 and the sound unit 134, respectively.
In one embodiment, the computer system 100 shown in
While several examples uses of shuffle operations have been described, it will be understood by one of ordinary skill in the art that the invention is not limited to these uses. In addition, while the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described. The method and apparatus of the invention can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting on the invention.
Roussel, Patrice, Chennupaty, Srinivas, Coke, James, Kong, Katherine, Cranford, Micheal D., Abdallah, Mohammed A.
Patent | Priority | Assignee | Title |
10216705, | Feb 17 2017 | GOOGLE LLC | Permuting in a matrix-vector processor |
10592583, | Feb 17 2017 | GOOGLE LLC | Permuting in a matrix-vector processor |
10614151, | Feb 17 2017 | GOOGLE LLC | Permuting in a matrix-vector processor |
10956537, | Feb 17 2017 | GOOGLE LLC | Permuting in a matrix-vector processor |
11748443, | Feb 17 2017 | GOOGLE LLC | Permuting in a matrix-vector processor |
9959247, | Feb 17 2017 | GOOGLE LLC | Permuting in a matrix-vector processor |
Patent | Priority | Assignee | Title |
3711692, | |||
3723715, | |||
4139899, | Oct 18 1976 | Unisys Corporation | Shift network having a mask generator and a rotator |
4161784, | Jan 05 1978 | Honeywell Information Systems, Inc. | Microprogrammable floating point arithmetic unit capable of performing arithmetic operations on long and short operands |
4393468, | Mar 26 1981 | Advanced Micro Devices, Inc. | Bit slice microprogrammable processor for signal processing applications |
4418383, | Jun 30 1980 | International Business Machines Corporation | Data flow component for processor and microprocessor systems |
4498177, | Aug 30 1982 | Sperry Corporation | M Out of N code checker circuit |
4707800, | Mar 04 1985 | Raytheon Company | Adder/substractor for variable length numbers |
4771379, | Oct 23 1985 | Mitsubishi Denki Kabushiki Kaisha | Digital signal processor with parallel multipliers |
4903228, | Nov 09 1988 | International Business Machines Corporation; INTERNATIONAL BUSINESS MACHINES CORPORATION, ARMONK, NEW YORK 10504 A CORP OF NY | Single cycle merge/logic unit |
4989168, | Nov 30 1987 | Fujitsu Limited | Multiplying unit in a computer system, capable of population counting |
5019968, | Mar 29 1988 | Intuitive Surgical, Inc | Three-dimensional vector processor |
5081698, | Feb 14 1989 | Intel Corporation | Method and apparatus for graphics display data manipulation |
5095457, | Feb 02 1989 | Samsung Electronics Co., Ltd. | Digital multiplier employing CMOS transistors |
5168571, | Jan 24 1990 | International Business Machines Corporation; INTERNATIONAL BUSINESS MACHINES CORPORATION, A CORP OF NY | System for aligning bytes of variable multi-bytes length operand based on alu byte length and a number of unprocessed byte data |
5187679, | Jun 05 1991 | International Business Machines Corporation | Generalized 7/3 counters |
5268995, | Nov 21 1990 | RYO HOLDINGS, LLC | Method for executing graphics Z-compare and pixel merge instructions in a data processor |
5321810, | Aug 21 1991 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Address method for computer graphics system |
5327543, | Sep 10 1987 | System for selectively masking operand portions for processing thereof | |
5390135, | Nov 29 1993 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Parallel shift and add circuit and method |
5408670, | Dec 18 1992 | Xerox Corporation | Performing arithmetic in parallel on composite operands with packed multi-bit components |
5423010, | Jan 24 1992 | LSI Logic Corporation | Structure and method for packing and unpacking a stream of N-bit data to and from a stream of N-bit data words |
5426783, | Nov 02 1992 | Amdahl Corporation | System for processing eight bytes or less by the move, pack and unpack instruction of the ESA/390 instruction set |
5465374, | Jan 12 1993 | International Business Machines Corporation | Processor for processing data string by byte-by-byte |
5487159, | Dec 23 1993 | Unisys Corporation | System for processing shift, mask, and merge operations in one instruction |
5497497, | Nov 03 1989 | Hewlett-Packard Company | Method and apparatus for resetting multiple processors using a common ROM |
5579253, | Sep 02 1994 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Computer multiply instruction with a subresult selection option |
5594437, | Aug 01 1994 | Freescale Semiconductor, Inc | Circuit and method of unpacking a serial bitstream |
5625374, | Sep 07 1993 | Apple Inc | Method for parallel interpolation of images |
5680161, | Apr 03 1991 | AUTODESK, Inc | Method and apparatus for high speed graphics data compression |
5781457, | Mar 08 1994 | SAMSUNG ELECTRONICS CO , LTD | Merge/mask, rotate/shift, and boolean operations from two instruction sets executed in a vectored mux on a dual-ALU |
5802336, | Dec 02 1994 | Intel Corporation | Microprocessor capable of unpacking packed data |
5819117, | Oct 10 1995 | MicroUnity Systems Engineering, Inc. | Method and system for facilitating byte ordering interfacing of a computer system |
5881259, | Oct 08 1996 | ARM Limited | Input operand size and hi/low word selection control in data processing systems |
5909572, | Dec 02 1996 | GLOBALFOUNDRIES Inc | System and method for conditionally moving an operand from a source register to a destination register |
5931945, | Apr 29 1994 | Sun Microsystems, Inc | Graphic system for masking multiple non-contiguous bytes having decode logic to selectively activate each of the control lines based on the mask register bits |
5933650, | Oct 09 1997 | ARM Finance Overseas Limited | Alignment and ordering of vector elements for single instruction multiple data processing |
6002881, | Jun 10 1997 | ARM Limited | Coprocessor data access control |
6041404, | Mar 31 1998 | Intel Corporation | Dual function system and method for shuffling packed data elements |
6058465, | Aug 19 1996 | Single-instruction-multiple-data processing in a multimedia signal processor | |
6115812, | Apr 01 1998 | Intel Corporation | Method and apparatus for efficient vertical SIMD computations |
6223277, | Nov 21 1997 | Texas Instruments Incorporated | Data processing circuit with packed data structure capability |
6243808, | Mar 08 1999 | Intel Corporation | Digital data bit order conversion using universal switch matrix comprising rows of bit swapping selector groups |
6381690, | Aug 01 1995 | Hewlett Packard Enterprise Development LP | Processor for performing subword permutations and combinations |
WO9707450, | |||
WO9709671, | |||
WO9732278, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Mar 21 2002 | Intel Corporation | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Feb 11 2015 | ASPN: Payor Number Assigned. |
Date | Maintenance Schedule |
Apr 07 2018 | 4 years fee payment window open |
Oct 07 2018 | 6 months grace period start (w surcharge) |
Apr 07 2019 | patent expiry (for year 4) |
Apr 07 2021 | 2 years to revive unintentionally abandoned end. (for year 4) |
Apr 07 2022 | 8 years fee payment window open |
Oct 07 2022 | 6 months grace period start (w surcharge) |
Apr 07 2023 | patent expiry (for year 8) |
Apr 07 2025 | 2 years to revive unintentionally abandoned end. (for year 8) |
Apr 07 2026 | 12 years fee payment window open |
Oct 07 2026 | 6 months grace period start (w surcharge) |
Apr 07 2027 | patent expiry (for year 12) |
Apr 07 2029 | 2 years to revive unintentionally abandoned end. (for year 12) |