In one embodiment, a processor includes a multiply-accumulate (mac) unit having a first path to handle execution of an instruction if a difference between at least a portion of first and second operands and a third operand is less than a threshold value, and a second path to handle the instruction execution if the difference is greater than the threshold value. Based on the difference, at least part of the third operand is to be provided to a multiplier of the mac unit or to a compressor of the second path. Other embodiments are described and claimed.
|
16. A method comprising:
receiving first, second, and third operands in a multiply accumulate (mac) unit;
determining a difference based on exponents of the first, second, and third operands;
providing at least a portion of the third operand to a multiplier datapath of the mac unit for accumulation with intermediate results of a multiplication operation on at least a portion of the first and second operands if the difference is within a threshold range; and
otherwise providing the third operand portion to a compressor for accumulation with a product output by the multiplier datapath.
8. A processor comprising:
a front end unit to fetch and decode a multiply-accumulate instruction having first, second and third operands associated therewith;
a renamer coupled to the front end unit to allocate at least one of the first, second and third operands to a register of a register file;
a multiply-accumulate (mac) unit coupled to the renamer and having a first path including a multiplier, the first path to handle execution of the multiply-accumulate instruction if a difference between at least a portion of the first and second operands and the third operand is less than a threshold value, and a second path including a compressor, the second path to handle execution of the multiply-accumulate instruction if the difference is greater than the threshold value, wherein a detector is to receive the difference and to cause a portion of the third operand to be provided to the multiplier of the first path if the difference is less than the threshold value, and otherwise the third operand is to be provided to the compressor of the second path.
1. An apparatus comprising:
multiply accumulate (mac) unit to perform a multiply and accumulate operation on first, second and third operands, the mac unit including:
an exponent compute datapath to determine a difference based on an exponent portion of the first, second, and third operands, the exponent compute datapath having a first compressor to receive the first, second and third operands and to output a first difference having a first portion and a second portion, an adder coupled to the first compressor to generate the difference using the first and second portions of the first difference, a first shifter to shift a mantissa of the third operand by a first amount if the difference is within a threshold range, a second shifter to shift the third operand mantissa by a second amount, and a third shifter to shift the third operand mantissa by a third amount, if the difference is outside the threshold range;
a multiplier including a multiplication tree, wherein the exponent compute datapath is to provide the shifted third operand mantissa to the multiplier if the difference is within the threshold range;
a second compressor to compress an output of the multiplier and the shifted third operand mantissa if the difference is outside the threshold range, wherein the multiplier output is not to be compressed in the second compressor if the difference is within the threshold range;
a first normalizer to normalize the multiplier output if the difference is within the threshold range;
a second normalizer to normalize the second compressor output if the difference is outside the threshold range;
a computation unit to receive the first and second normalizer outputs and to generate a final value for the multiply and accumulate operation therefrom.
2. The apparatus of
3. The apparatus of
4. The apparatus of
6. The apparatus of
7. The apparatus of
9. The processor of
10. The processor of
11. The processor of
12. The processor of
13. The processor of
15. The processor of
17. The method of
18. The method of
19. The method of
20. The method of
|
Modern processors include various circuitry for performing operations on data. Typically, a processor is designed according to a given instruction set architecture (ISA). Many processors have a pipelined design that can be implemented as an in-order or out-of-order processor.
In either event, instructions are obtained via front end units, which process the instructions and place them in a form to be recognized by further components of the pipeline. Typically, so-called macro-instructions are broken up into one or more micro-instructions or uops. These uops may then be executed in different execution units of a processor. That is, many processors include multiple execution units including arithmetic logic units, address generation units, floating-point units and so forth.
One common execution unit is a multiply-accumulate unit, which may be in the form of a fused floating-point multiply-accumulate (FPMAC) unit. In general, a MAC unit can perform an operation on three incoming operands to first multiply two of the operands and then accumulate the product with the third operand. Some processors use such a unit to perform more simple mathematical operations such as additions, subtractions and multiplications by appropriate selection of the third operand. Accordingly, in many processors a MAC unit may form the backbone of the execution units and may be a key circuit in determining the frequency, power and area of the processor. In addition, MAC units can be heavily used in certain applications such as graphics and many scientific and engineering applications. Thus these units should be made to be as efficient in area, power consumption, and processing speed as possible.
In various embodiments, a split path fused floating-point multiply accumulate (FPMAC) unit may be provided. Specifically, the split path may provide multiple datapaths for handling operations based on the operands. More specifically, a so-called near path and a so-called far path may be provided. The near path may be used to handle critical cases, namely those cases where a difference between exponents of the operands is within a threshold range, while the far path may be used to handle non-critical cases, namely those cases where the difference between the exponents is outside this threshold range. In this way, a performance optimal design may be realized, with optimizations in computing speed, chip area and power consumption, as will be discussed further herein.
While the scope of the present invention is not limited in this regard, in many implementations the MAC unit may be compliant for operands of a given format, e.g., a given Institute of Electrical and Electronics Engineers (IEEE) standard such as a floating point (FP) representation for performing floating-point multiply accumulate operations. Furthermore, a given implementation may be used to handle various types of incoming data, including operands that can be of single and double precision floating point format.
In various embodiments, an ISA may provide multiple user-level fused multiply-accumulate (FMA) instructions. Such FMA instructions can be used to perform fused multiply-add operations (including fused multiply-subtract and other varieties) on packed (e.g., vector) and/or scalar data elements of the instruction operands. Different FMA instructions may provide separate instructions to handle different types of arithmetic operations on the three source operands.
In one embodiment, FMA instruction syntax can be defined using three source operands, where the first source operand is updated based on the result of the arithmetic operations of the data elements. As such, the first source operand may also be the destination operand. For example, an instruction format of: opcode, x1, x2, x3 may be present, where the opcode corresponds to one of multiple user-level FMA instructions to perform a given arithmetic operation, and x1-x3 correspond to operands to be processed in the operation.
The arithmetic FMA operation performed in an FMA instruction can take one of several forms, e.g.:
r=(x*y)+z;
r=(x*y)−z;
r=−(x*y)+z; or
r=−(x*y)−z.
In an embodiment, packed FMA instructions can perform eight single-precision FMA operations or four double-precision FMA operations with 256-bit vectors. Scalar FMA instructions may only perform one arithmetic operation on a low order data element, when implemented using vector registers. The content of the rest of the data elements in the lower 128-bits of the destination operand is preserved, while the upper 128 bits of the destination operand may be filled with zero.
In an embodiment, an arithmetic FMA operation of the form, r=(x*y)+z, takes two IEEE-754-2008 single (double) precision values and multiplies them to form an infinite precision intermediate value. This intermediate value is added to a third single (double) precision value (also at infinite precision) and rounded to produce a single (double) precision result. Of course, different rounding modes and precisions may be implemented in different embodiments.
Execution units of a processor may include logic to perform integer and floating point operations. Microcode (ucode) read only memory (ROM) can store microcode for certain macro-instructions, including vector multiply-add instructions, which may be part of a packed instruction set. By including packed instructions in an instruction set of a general-purpose processor, along with associated circuitry to execute the instructions, the operations used by many multimedia applications may be performed using packed data in a general-purpose processor. Thus, many multimedia applications can be accelerated and executed more efficiently by using the full width of a processor's data bus for performing operations on packed data. This can eliminate the need to transfer smaller units of data across the processor's data bus to perform one or more operations one data element at a time. In some embodiments, the multiply-accumulate instruction can be implemented to operate on data elements having sizes of byte, word, doubleword, quadword, etc., as well as datatypes, such as single and double precision integer and floating point datatypes.
Some single instruction multiple data (SIMD) and other multimedia types of instructions are considered complex instructions. Most floating-point related instructions are also complex instructions. As such, when an instruction decoder encounters a complex macro-instruction, the microcode ROM is accessed at the appropriate location to retrieve the microcode sequence for that macro-instruction. The various micro-ops for performing that macro-instruction are communicated to, e.g., an out-of-order execution logic, which may have buffers to smooth out and re-order the flow of micro-instructions to optimize performance as they flow through the pipeline and are scheduled for execution. Allocator logic allocates buffers and resources that each uop needs in order to execute. Renaming logic may rename logical registers onto entries in a register file (e.g., physical registers).
In one embodiment, vector instructions can be executed on various packed data type representations. These data types may include a packed byte, a packed word, and a packed doubleword (dword) for 128 bits wide operands. As an example, a packed byte format can be 128 bits long and contain sixteen packed byte data elements. A byte is defined here as 8 bits of data. Information for each byte data element is stored in bit 7 through bit 0 for byte 0, bit 15 through bit 8 for byte 1, bit 23 through bit 16 for byte 2, and finally bit 120 through bit 127 for byte 15.
Generally, a data element is an individual piece of data that is stored in a single register or memory location with other data elements of the same length. In some packed data sequences, the number of data elements stored in a register can be 128 bits divided by the length in bits of an individual data element. Although the data types can be 128 bit long, embodiments of the present invention can also operate with 64 bit wide or other sized operands.
It will be appreciated that packed data formats may be further extended to other register lengths, for example, to 96-bits, 160-bits, 192-bits, 224-bits, 256-bits or more. In addition, various signed and unsigned packed data type representations can be handled in multimedia registers according to one embodiment of the present invention.
In various embodiments, efficiency may be realized by providing a split datapath within a MAC unit, e.g., a FPMAC unit. A near path may provide for insertion of a third operand, namely the so-called accumulate operand, into an early portion of multiplier hardware, via an early near path accumulate injection into a carry save adder (CSA) tree, removing a 3:2 compression from the critical path. Also, normalization operations performed on various intermediate results can be individually handled in the near and far paths. Still further, post-normalization shifting may be implemented, and a completion adder for performing a wide addition on the near path values (e.g., carry and save values) can be postponed until an end of the FPMAC unit. In addition, in various embodiments certain logic of the unit can be clock/power gated based on the exponent difference to reduce power consumption when such logic is not needed.
In general, the FPMAC may be used to perform a multiply accumulate operation that includes mantissa multiplication of two input operands (Mx, My), followed by the accumulation or addition of the third operand, Mz. In various embodiments, the operands can be represented as standard IEEE floating point normalized numbers (S, F, E), where S depicts the sign (1-bit), F is the fraction (1.F is the m-bit normalized mantissa M) and E the biased exponent (actual exponent(e)+bias, to make the representation of E positive). Multiplication of the two operands involves mantissa multiplication (Mx×My), e.g., using a carry-save reduction compressor tree-based design, and the output exponent of the product is Exy=Ex+Ey− bias. The accumulation involves alignment of the accumulate mantissa, Mz, and the multiply result, Mxy, by shifting Mz by a shift amount corresponding to an exponent difference, d=Exy−Ez. To improve performance, the exponent difference computation and alignment shift of Mz may be performed in parallel to mantissa multiplication.
Note that the operations involved in various stages of the FPMAC pipeline can differ based on the exponent difference, d. More specifically, in cases with d>1 and d←2 (so-called far path cases), a large right or left alignment shift (done in parallel to mantissa multiplication), may be performed, followed by a 3:2 compression to reduce the aligned accumulate (Mz), carry (C) and sum (S) terms coming out of the multiplier. In a completion addition only the most significant ‘m-bit’ sum is required, while the remaining bits are used for computing the carry (C), guard (G), round (R), and sticky bits (T), which may be used for rounding according to the IEEE standard. This is followed by a normalization right shift of worst case ‘m+3’ (when d=−(m+3)). A rounding unit may use the C, G, R and T bits to compute the rounded result.
Instead in cases with d={0, 1, −1, −2} (so-called near path cases), a smaller alignment shift is performed. This shift is followed by a 3:2 compression similar to the earlier case. However, these cases may generate a large number of leading 0s or 1s based on positive or negative value of the result respectively, which requires a worst case of ‘2m-bit’ normalization left shift. In an conventional operation, this would necessitate the computation of the whole ‘2m-bit’ sum (for the completion addition of C and S terms). Instead, in various embodiments to improve performance, a leading zero anticipator (LZA) may be used in parallel with normalization for purposes of sign detection for these near path cases. The normalized result is then used for rounding and the completion add.
The near path clearly forms the critical path and dominates hardware requirements, due to the presence of ‘2m-bit’ sum and ‘2m-bit’ normalize unit along with the LZA. As used herein, the term “critical path” refers to a timing critical path, meaning that this datapath flow has more operations to be performed and thus requires more time to perform these operations. Conventional implementations that perform unified handling of all cases further increase this critical path due to unnecessary inclusion of operations required in unified handling.
Embodiments thus may provide a FPMAC that performs split handling of near and far paths, and may use optimal hardware and logic stages for each of the cases, performing the bare minimal operations required, particularly in the near path. That is, various delay and area optimizations can be present in the near path. As examples, and discussed further below, the near path may provide for early injection of the near path accumulate operand Mz into the multiplication CSA tree, thus removing a 3:2 compression stage from critical path. Second, completion addition may be performed after the normalization shift for both the near and far paths, combined with a rounding unit, thus eliminating an accumulate adder from the critical path, which may provide an area savings, e.g., of a 2m-bit adder. Still further, to further reduce the near path delay, normalization shifting for the near path can be performed in parallel with the LZA on the (C, S) outputs of the CSA tree, which masks the shifting delay with the LZA computation. Yet further, sign detection of the result for conditional 2's complementing can be performed using the existing LZA components for the near path cases, thus completely eliminating a sum computation or a sign detection unit from the critical path and the hardware associated with them.
The far path is non-critical and thus may be designed based on the minimum required operations. Apart from performing minimal number of operations, the split path handling may provide significant power benefits due to the ease of clock/power gating of the near or far paths. That is, when it is determined that a near path operation is to be performed, the far path can be power/clock gated, and vice versa.
As discussed, a FPMAC datapath in accordance with an embodiment of the present invention is split into two different datapaths to separate the critical near path and the non-critical far path. Detailed explanation of a design of an embodiment is described below. Further, understand that while the implementation details are discussed in terms of a single-precision FPMAC unit, embodiments are applicable to other data types such as double precision values.
Referring now to
The determined exponent difference may be provided to an alignment shift unit 125 to control performance of a variable right/left shift on the mantissa of the third operand. As seen, for the case where it is determined that a near par path operation is to be performed (i.e., when the exponent difference Ed is within a predetermined range), a near path injection of the third operand, namely the mantissa of the third operand, can occur directly into compressor tree 118 of multiplier 110.
Still referring to
As seen further, the far path may receive the product from the compressor tree of multiplier 110 in a compression unit 130, which may be a 3:2 compression unit. As further seen, compression unit 130 may further receive the variable shift alignment output corresponding to the aligned third operand mantissa. After compression in compression unit 130, the resulting intermediate value corresponding to a carry-save output is provided to a right shifter 150, which in one embodiment may perform a normalization shift using a d-bit right shifter to perform a maximum bit shift of m+3. As seen, the least significant shifted out bits can be provided to a computation unit 155 for calculating carry and sticky bits. While not shown for ease of illustration in
More specifically, the resulting shifted intermediate values both from left shift unit 145 and right shift unit 150 may be provided to multiple levels of a selector, namely a first multiplexer 160 and a second multiplexer 165. The resulting selected output is provided to a combination unit 170, which may perform a combined addition/rounding, as well as a post-round normalization to thus generate a final result. As seen, in addition to the incoming intermediate results, unit 170 may receive sign, carry and sticky bits and to control performance of its addition/rounding operations. While shown with this particular implementation in the embodiment of
As seen in
With regard to near path detector logic 250, as shown such logic may include multiple zero bit detectors, namely a first zero detector 252 and a second zero detector 254. The detector outputs may be at a logic high when a zero value is detected, e.g., when a first most significant bits of its input is zero. As an example, the first “m−1” significant bits of the exponent path (for a single precision 8-bit exponent example, the seven MSB's) being 0s or 1s (detected using the 2 zero detectors) may generate a near path flag (Near Path), which may be used to drive injection into the multiplier, and to prevent output to the far path, when enabled. As shown in
By providing for separate right and left shifters for the far path large shift values (d←2 and d>1), and at least one other shifter for the near path (e.g., to provide small 1 or 2 bit shifts), improved efficiency may be realized. That is, the near path and far path cases can be handled separately, thereby enabling early availability of the near path shifted accumulate value, to be inserted into the multiplication CSA tree.
Thus the near path accumulate mantissa, Mz, has a small shift applied to it to be aligned with the multiplication result, Mxy. The early availability of the aligned mantissa provides an opportunity to compress the near path mantissa along with the multiplication CSA tree. Referring now to
With regard to multiplier 110, encoder 115 may be a Booth-2 encoder, the output of which is provided to a CSA tree including a plurality of stages 118a-118d, each of which may be implemented via a 4:2 compressor. As further seen, a near path insertion of an accumulate value 117 may be provided into this second compression stage 118b. At the end of the compression tree, carry and sum values may be available for the near path at block 119 (and which may be in double precision format in some embodiments). Instead for far path cases, the results from the compression tree may be provided to a compressor 130.
The sparse nature of both the double and single precision floating point multiplication trees enables the near path (C, S) results from the CSA tree to be computed without any additional delay penalty in the critical path. In other words, the multiplier may be configured as a sparse tree configuration, enabling computing efficiency. Referring now to
Thus as seen, an accumulate value (shown as ACC 23) is provided through a logic 325 that further receives the Near Path flag. As seen, in one embodiment logic 325 may include a NAND gate and an inverter, which thus is used to provide the accumulate value when active. This compressor 330 provides an output as part of the carry output as well as provides an input to a third level block 340, which in one embodiment may include a full adder followed by a half adder. As further seen, another logic 335 may provide far path accumulate portion which is input along with the output of adder 340 is to a final level compressor 350, which in the embodiment shown may be a 3:2 compressor. Note that this compressor may be of the far path, and in one embodiment corresponds to compressor 130 of
Thus as shown in
Embodiments may further provide for split handling of normalization shift operations. In one embodiment, the completion addition is performed post-normalization shifting of the (C, S) terms, combined with the rounding. Normalization before the completion addition enables computation of only the required ‘m’-bit sum and makes the design performance and hardware optimized. That is, for both the near and far paths, only an m-bit sum needs to be calculated for the completion addition, thus avoiding the need for a further 2m-bit adder for the near path.
In the separate near path normalization, an effective subtraction may at worst lead to ‘2m’ leading zeroes (or ones) when the ‘m’ bit accumulate value is equal to a ‘2m’ bit multiplication result in the near path. To determine the left shift amount in such cases with leading zeroes, a LZA may be used.
This anticipation string may thus represent the number of leading zeros and/or ones for the multiplier outputs. This string may be binary encoded using a leading digit counter (LDC), an embodiment of which is shown in
The shift amount generated by the LZA may be used by the normalization shifter to perform the left shift on the C, S terms for obtaining a normalized result. The skewed arrival times of binary encoded shift amount from LSB to MSB can be used to mask the normalization shift delay by performing the shifts upon the immediate arrival of the bits in that order. By performing the completion addition along with rounding and sign detection using the LZA, a ‘2m’-bit summation or sign detection unit can be avoided.
The other parallel path in the normalization unit deals with the far path cases where d>1 & d←2. As seen with reference back to
With reference back to
Referring now to
Still referring to
Still referring to
If the difference determination at diamond 825 indicates a difference within the threshold control passes to block 830, where the third operand, and more particularly the mantissa of the third operand may be provided to the multiplier datapath. Then at block 835, the product of the first and second operands may be accumulated with the third operand in the multiplier datapath. In this way, the need for a compression operation between the multiplication results and the third operand can be avoided.
Next, control passes to block 840 where a first normalization may be performed and a sign for the final result may be determined based on leading zeros. In this near path operation, the normalization operation may correspond to a left shift operation. Furthermore, the sign determination may be based on a leading zero analysis performed in a leading zero anticipator circuit, which may also receive the output of the multiplication datapath. Finally, at block 890 combined sum and rounding operations can be performed based on the sign value. That is, based on the sign, a 2's complement may be performed, if needed. Also note that in various embodiments this sum operation, corresponding to a completion addition, maybe of m-bit width, avoiding the need for a 2-m bit width addition.
If instead at diamond 825 it is determined that the difference is outside the threshold range, control passes to block 860 where the multiplication product may be provided along with the third operand mantissa to a compressor for accumulation. Note that here however this accumulation is performed separately from the multiplier datapath. Further, a second normalization operation may be performed based on the difference determined above (block 870). Here, the normalization may be a right shift operation. The normalized result may then be provided to block 890, discussed above for the final result to be determined based on completion addition and rounding operations. While shown with this particular implementation in the embodiment of
Thus in various embodiments, split handling of near and far paths may enable performance of the bare minimal operations required on the critical path and may thus provide a performance optimal solution. Still further, by providing for split handling, the not-taken path may be clock gated, e.g., based on a near path flag as described above. This enables turning off all the power consuming normalization shifters and logic blocks and keeping only the required blocks of computation switching, enabling a power optimal design. In addition, total logic levels in terms of basic gates may be reduced, while significantly reducing hardware complexity.
Embodiments can be implemented in many different systems. For example, embodiments can be realized in a processor such as a multicore processor. Referring now to
As shown in
Coupled between front end units 710 and execution units 720 is an out-of-order (OOO) engine 715 that may be used to receive the micro-instructions and prepare them for execution. More specifically OOO engine 715 may include various buffers to re-order micro-instruction flow and allocate various resources needed for execution, as well as to provide renaming of logical registers onto storage locations within various register files such as register file 730 and extended register file 735. Register file 730 may include separate register files for integer and floating point operations. Extended register file 735 may provide storage for vector-sized units, e.g., 256 or 512 bits per register.
Various resources may be present in execution units 720, including, for example, various integer, floating point, and single instruction multiple data (SIMD) logic units, among other specialized hardware. For example, such execution units may include one or more arithmetic logic units (ALUs) 722. In addition, a FPMAC unit 724 may be present to generate a final result of a MAC or other instruction scheduled to the unit. In various embodiments, the unit may have a split path as described above.
When operations are performed on data within the execution units, results may be provided to retirement logic, namely a reorder buffer (ROB) 740. More specifically, ROB 740 may include various arrays and logic to receive information associated with instructions that are executed. This information is then examined by ROB 740 to determine whether the instructions can be validly retired and result data committed to the architectural state of the processor, or whether one or more exceptions occurred that prevent a proper retirement of the instructions. Of course, ROB 740 may handle other operations associated with retirement.
As shown in
Embodiments may be implemented in many different system types. Referring now to
Still referring to
Furthermore, chipset 890 includes an interface 892 to couple chipset 890 with a high performance graphics engine 838, by a P-P interconnect 839. In turn, chipset 890 may be coupled to a first bus 816 via an interface 896. As shown in
Embodiments may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of non-transitory storage medium such as disk including floppy disks, optical disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.
Krishnamurthy, Ram K., Mathew, Sanu K., Srinivasan, Suresh, Ramanarayanan, Rajaraman, Erraguntla, Vasantha K.
Patent | Priority | Assignee | Title |
10019229, | Jul 02 2014 | VIA ALLIANCE SEMICONDUCTOR CO., LTD; VIA ALLIANCE SEMICONDUCTOR CO , LTD | Calculation control indicator cache |
10019230, | Jul 02 2014 | VIA ALLIANCE SEMICONDUCTOR CO., LTD; VIA ALLIANCE SEMICONDUCTOR CO , LTD | Calculation control indicator cache |
10078512, | Oct 03 2016 | VIA ALLIANCE SEMICONDUCTOR CO., LTD.; VIA ALLIANCE SEMICONDUCTOR CO , LTD | Processing denormal numbers in FMA hardware |
10481869, | Nov 10 2017 | Apple Inc.; Apple Inc | Multi-path fused multiply-add with power control |
10754582, | Mar 31 2016 | Hewlett Packard Enterprise Development LP | Assigning data to a resistive memory array based on a significance level |
11061672, | Oct 02 2015 | VIA ALLIANCE SEMICONDUCTOR CO , LTD | Chained split execution of fused compound arithmetic operations |
9519458, | Apr 08 2014 | Cadence Design Systems, Inc. | Optimized fused-multiply-add method and system |
9645792, | Aug 18 2014 | Qualcomm Incorporated | Emulation of fused multiply-add operations |
9778907, | Jul 02 2014 | VIA ALLIANCE SEMICONDUCTOR CO., LTD.; VIA ALLIANCE SEMICONDUCTOR CO , LTD | Non-atomic split-path fused multiply-accumulate |
9778908, | Jul 02 2014 | VIA ALLIANCE SEMICONDUCTOR CO., LTD.; VIA ALLIANCE SEMICONDUCTOR CO , LTD | Temporally split fused multiply-accumulate operation |
9798519, | Jul 02 2014 | VIA ALLIANCE SEMICONDUCTOR CO., LTD.; VIA ALLIANCE SEMICONDUCTOR CO , LTD | Standard format intermediate result |
9891886, | Jul 02 2014 | VIA ALLIANCE SEMICONDUCTOR CO., LTD; VIA ALLIANCE SEMICONDUCTOR CO , LTD | Split-path heuristic for performing a fused FMA operation |
9891887, | Jul 02 2014 | VIA ALLIANCE SEMICONDUCTOR CO., LTD; VIA ALLIANCE SEMICONDUCTOR CO , LTD | Subdivision of a fused compound arithmetic operation |
Patent | Priority | Assignee | Title |
8069200, | Apr 27 2005 | QSIGMA INC | Apparatus and method for implementing floating point additive and shift operations |
8078660, | Apr 10 2007 | The Board of Regents, University of Texas System | Bridge fused multiply-adder circuit |
20050228844, | |||
20080071851, | |||
20080091758, | |||
20080183791, | |||
20080256150, | |||
20090077152, | |||
20110040815, | |||
20120072703, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 16 2010 | SRINIVASAN, SURESH | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025322 | /0049 | |
Sep 16 2010 | ERRAGUNTLA, VASANTHA K | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025322 | /0049 | |
Sep 20 2010 | Intel Corporation | (assignment on the face of the patent) | / | |||
Sep 20 2010 | RAMANARAYANAN, RAJARAMAN | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025322 | /0049 | |
Nov 05 2010 | MATTHEW, SANU K | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025322 | /0049 | |
Nov 05 2010 | KRISHNAMURTHY, RAM K | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025322 | /0049 |
Date | Maintenance Fee Events |
Oct 09 2013 | ASPN: Payor Number Assigned. |
Apr 20 2017 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jun 28 2021 | REM: Maintenance Fee Reminder Mailed. |
Dec 13 2021 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Nov 05 2016 | 4 years fee payment window open |
May 05 2017 | 6 months grace period start (w surcharge) |
Nov 05 2017 | patent expiry (for year 4) |
Nov 05 2019 | 2 years to revive unintentionally abandoned end. (for year 4) |
Nov 05 2020 | 8 years fee payment window open |
May 05 2021 | 6 months grace period start (w surcharge) |
Nov 05 2021 | patent expiry (for year 8) |
Nov 05 2023 | 2 years to revive unintentionally abandoned end. (for year 8) |
Nov 05 2024 | 12 years fee payment window open |
May 05 2025 | 6 months grace period start (w surcharge) |
Nov 05 2025 | patent expiry (for year 12) |
Nov 05 2027 | 2 years to revive unintentionally abandoned end. (for year 12) |