A semiconductor chip is described having a functional unit that can execute a first instruction and execute a second instruction. The first instruction is an instruction that multiplies two operands. The second instruction is an instruction that approximates a function according to C0+C1X2+C2X22. The functional unit has a multiplier circuit. The multiplier circuit has: i) a first input to receive bits of a first operand of the first instruction and receive bits of a c1 term of the second instruction; ii) a second input to receive bits of a second operand of the first instruction and receive bits of a x2 term of the second instruction.
|
12. A method comprising:
fetching a first instruction;
decoding said first instruction;
executing said first instruction by multiplying a multiplier term and a multiplicand term with multiplier circuitry in an c1 g0">execution stage;
fetching a second instruction, said second instruction to approximate a function by executing an equation of the form C0+C1X2+C2X22;
decoding said second instruction;
executing said second instruction by multiplying a first term composed of said x2 and x22 terms with a second term composed of said c1 and C2 terms with said multiplier.
an instruction c1 g0">execution pipeline comprising:
a) instruction fetch stage circuitry;
b) instruction decode stage circuitry;
c) c1 g0">execution stage circuitry comprising a functional unit to execute a first instruction and execute a second instruction, said first instruction being an instruction that multiplies two operands, said second instruction being distinct from said first instruction and being an instruction that approximates a function according to C0+C1X2+C2X22, said functional unit having a multiplier circuit, said multiplier circuit having a first alignment of partial product terms for said first instruction and a second alignment of partial product terms for said second instruction, said second alignment having shifted partial product terms relative to said first alignment.
1. A semiconductor chip comprising:
an instruction c1 g0">execution pipeline comprising:
a) instruction fetch stage circuitry;
b) instruction decode stage circuitry;
c) c1 g0">execution stage circuitry comprising a functional unit to execute a first instruction and execute a second instruction, said first instruction being an instruction that multiplies two operands, said second instruction being distinct from said first instruction and being an instruction that approximates a function according to C0+C1X2+C2X22, said functional unit having a multiplier circuit, said multiplier circuit having:
i) a first input to receive bits of a first operand of said first instruction and receive bits of a c1 term of said second instruction, wherein a first datapath exists downstream from said first input for said first instruction and a second datapath exists downstream from said first input for said second instruction, wherein said first and second datapaths include different formatting logic;
ii) a second input to receive bits of a second operand of said first instruction and receive bits of a x2 term of said second instruction.
2. The semiconductor chip of
3. The semiconductor chip of
4. The semiconductor chip of
i) a third input to receive other bits of said first operand of said first instruction and receive bits of a C2 term of said second instruction;
ii) a fourth input to receive bits of said second operand of said first instruction and receive bits of a x22 term of said second instruction.
5. The semiconductor chip of
6. The semiconductor chip of
8. The computing system of
9. The computing system of
10. The computing system of
i) a third input to receive other bits of said first operand of said first instruction and receive bits of a C2 term of said second instruction;
ii) a fourth input to receive bits of said second operand of said first instruction and receive bits of a x22 term of said second instruction.
11. The computing system of
16. The method of
|
The field of invention relates generally to electronic computing and more specifically, to a functional unit capable of executing approximations of functions.
The formulation of
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
That is, if logic 311 and 312 were located in the execution stage 306, it would increase the propagation delay through the execution stage 306 for the calculation of the approximation. By contrast, moving logic 311 and 312 “higher up” in the pipeline to a pipeline stage that precedes the execution stage 306 permits the operation of logic 311 and/or 312 to take place in parallel (partially or completely) with the operations that are typically performed within another of the higher stage(s) so as to effectively hide the time cost of their operation. For example, if logic 311 and 312 are embedded in a scheduler, while the scheduler is scheduling an instruction that executes the approximation calculation, logic 311 may format the operand 311 and/or logic 312 may calculate the square of the X2 term.
It is worthwhile to note that a data fetch operation is typically associated with any processor architecture. Logic 311 and logic 312 may be placed anywhere downstream from the actual data fetch operation (since the operand X is fetched by the pipeline and logic units 311 and 312 subsequently act on information from the operand X). Depending on designer perspective, the data fetch operation may be viewed as its own pipeline stage that precedes the execution stage 306 or part of another stage (such as scheduler stage).
Within the execution stage 306 of
As described in more detail immediately below, the manner in which the multiplier 401 operates depends on whether a MULT/MADD instruction is being executed, or, an instruction that makes use of an approximation is being executed. Execution of the MADD and MULT instructions will first be described. Functional unit 400 includes a multiplier 401 which multiplies an integer multiplier (A) with an integer multiplicand (B). In the case of integer MULT and integer MADD instructions, the multiplier and multiplicands are input operands to the functional unit 400. In the case of floating point MULT and floating point MADD instructions, the multiplier and multiplicand are the mantissa values of floating point A and B input operands that are presented to the functional unit 400. Separate exponent calculation logic 402 is used to calculate the exponent value for the floating point MULT and MADD instructions. For MADD instructions, the addend C is also presented as an input operand to the functional unit 400. In an embodiment, for floating point MULT and MADD instructions, the A and B terms are normalized as presented to the multiplier 401 are therefore not shifted by any shift logic.
MULT instructions are essentially MADD instructions with the addend C being forced to a value of 0. As such, in an embodiment, the functional unit 400 is said to be in “normal mode” when executing a MULT instruction or a MADD instruction, and, is said to be in “extended mode” when executing an instruction that calculates an approximation of a function.
The most significant bits of the multiplier (A[msb]) are provided at input 504 and the least significant bits of the multiplier (A[lsb]) are provided at input 505. The multiplicand (B) is provided at input 506 and is divided into a most significant bits portion (B[msb]) and a least significant bits portion (B[lsb]) by normal mode formatting logic 507. Here, extended mode formatting logic 508, 509 is not used. Multiplexers 515, 516 whose channel select inputs are controlled by whether the functional unit is in normal or extended mode, enable application of formatting logic 507 and disable application of formatting logic 508, 509 when the functional unit is in normal mode. With the submission of the most significant portions of the multiplier and multiplicand being provided to selector section 502a, and, the least significant portions of the multiplier and multiplicand being provided to selector section 502b, selector section 502a will determine higher ordered partial product terms (that is, partial product terms involving only the more significant bits of A and B) and selector section 502b will determine lower ordered partial product terms (that is, partial product terms involving only the less significant bits of A and B).
As observed in
In the case of integer MADD instructions, an addend term C is entered at input 513 and injected into Wallace tree section 503b for summation. Here, a multiplexer 514 whose channel select is determined by whether the functional unit is acting in normal mode or extended mode selects input 513 (integer addend C) for normal mode. The lowest ordered partial product is not selected by multiplexer 514 and is effectively ignored. In the case of integer MULT instructions, a value of 0 is forced on the integer addend C term. In the case of integer MADD instructions, the C integer addend term is its operand value. In an embodiment, for floating point MADD instructions, the C addend term is not provided to the multiplier 501 but instead added to the multiplier output by an adder (not shown) that follows the multiplier. In an alternative embodiment, a floating point C addend term (mantissa) may be presented to input 513 for floating point MADD instructions where the mantissa is shifted by shift logic (not shown) prior to its presentation at input 513, in view of comparisons made by the exponent logic 402 of the exponent of the C term and the exponent of the AB term. Specifically, the C addend may be shifted to the left by the difference between C.exp and AB.exp if C.exp>AB.exp. Alternatively, the C addend may be shifted to the right by the difference between AB.exp and C.exp if AB.exp>C.exp. Another multiplexer 518 whose channel select is determined by whether the functional unit is in normal mode or extended mode selects the highest ordered partial product term for inclusion in the Wallace tree and ignores whatever is present at input 519.
Each of the Wallace tree sections 503a, 503b include trees of carry sum adders (“CSA”). The number of carry sum adders may vary from embodiment with the number of partial products and size of the A and B operands. Each of the Wallace tree sections 503a, 503b calculate final respective sum and carry terms that are provided to an output carry sum adder 520. The sum and carry terms produced by adder 520 are added by adder 521 to produce a final result for integer MULT and integer MADD instructions, and, a final mantissa result for floating point MULT and floating point MADD instructions. Referring back to
Referring to
Considering normal operation again briefly, the partial products are aligned to represent typical multiplication in which increasingly larger partial product terms are stacked upon one another and added. Here, as in standard multiplication, the product values of the partial products move out to the left going down the stack as more and more 0s are added to their right (noting that the addition of zeros may be accomplished in hardware simply by aligning/shifting the partial product to the left). By contrast, in enhanced mode, the partial products are shifted so that they are correctly aligned for the summation of the C0+C1X2+C2(X22) approximation.
By contrast, in enhanced mode, the partial product terms produced by the two selector sections 502a, 502b are closer in order. Thus, to effect the correct alignment, the partial products produced by selector 502b are increased (relative to normal mode) by shifting them further to the left, and, the partial products produced by selector 502a need to decreased (relative to normal mode) by shifting them to the right. Recall that, in normal mode formatting logic 507 is utilized, while, in extended mode, formatting logic 508, 509 is utilized. In an embodiment, differences between the formatting between the two modes include the shifting of the multiplicand in enhanced mode to effect the alignment described just above. Specifically, in an embodiment, as depicted in
Referring back to
As discussed above, a number of different functions can be approximated with the C0+C1X2+C2(X22) approximation. As such, a number of different instructions that make use of different approximated functions can be executed from the enhanced mode. For instance, according to one embodiment, separate instructions are supported for each of the following calculations: i) 1/X; ii) 1/(X1/2); iii) 2X; and, iv) log2(X). For each of these individual instructions, separate tables of coefficient values may exist in the look-up table 313/403. Processing for each of these instructions is as described above with the exception of the following instruction specific operations (note that in floating point form, X, can be expressed as [X.sgn][X.mant][X.exp] where X.sgn is the sign of X, X.mant is the mantissa of X and X.exp is the exponent of X).
In an embodiment of a 1/X instruction, 1/X=X−1 which, when written in floating point form, corresponds to (X.sgn)((X.mant)(2^X.exp))−1=(X.sgn)(2^−X.exp)(approx. of 1/(X.mant)). Here, coefficients for f(x)=1/x are stored in look-up table 403 and used to calculate (approx. of 1/(X.mant)). Exponent logic 402 simply presents−X.exp as the exponent of the result.
In an embodiment of a 1/(X1/2) instruction, where X.exp is understood to be unbiased and noting that X.sgn must be positive, 1/(X1/2)=X−1/2=((X.mant)(2^X.exp))−1/2=(2^−X.exp/2)(approx. of 1/(X.mant1/2)). Here, coefficients for f(x)=1/(x1/2) are stored in look-up table 403 and used to calculate (approx. of 1/(X.mant1/2)). Exponent logic 402 simply presents−X.exp/2 in the case where X.exp is even and −(X.exp−1)/2 in the case where X.exp in odd (which effectively corresponds to presenting the rounded down version of (2^−X.exp/2).
In an embodiment, in order to calculate 2X, a first instruction is executed that converts X in floating point form to a two's complement signed fixed-point number. In an embodiment, the signed-fixed point number has an 8 bit integer part I_X and a 24 bit fractional part F_X. 2X can be expressed as 2I
In an embodiment of a log2(X) instruction, log2(X)=log2((2^X.exp)(X.mant))=X.exp+[approx(log2(X.mant)]. Coefficients for f(x)=log2 (x) are looked up from the look up table and presented to the multiplier. The multiplier calculates [approx(log2(X.mant)] and an adder (not shown) that is coupled to the output of the multiplier adds the multiplier result to the X.exp term.
A processor having the functionality described above can be implemented into various computing systems as well.
The one or more processors 801 execute instructions in order to perform whatever software routines the computing system implements. The instructions frequently involve some sort of operation performed upon data. Both data and instructions are stored in system memory 803 and cache 804. Cache 804 is typically designed to have shorter latency times than system memory 803. For example, cache 804 might be integrated onto the same silicon chip(s) as the processor(s) and/or constructed with faster SRAM cells whilst system memory 803 might be constructed with slower DRAM cells. By tending to store more frequently used instructions and data in the cache 804 as opposed to the system memory 803, the overall performance efficiency of the computing system improves.
System memory 803 is deliberately made available to other components within the computing system. For example, the data received from various interfaces to the computing system (e.g., keyboard and mouse, printer port, LAN port, modem port, etc.) or retrieved from an internal storage element of the computing system (e.g., hard disk drive) are often temporarily queued into system memory 803 prior to their being operated upon by the one or more processor(s) 801 in the implementation of a software program. Similarly, data that a software program determines should be sent from the computing system to an outside entity through one of the computing system interfaces, or stored into an internal storage element, is often temporarily queued in system memory 803 prior to its being transmitted or stored.
The ICH 805 is responsible for ensuring that such data is properly passed between the system memory 803 and its appropriate corresponding computing system interface (and internal storage device if the computing system is so designed). The MCH 802 is responsible for managing the various contending requests for system memory 803 access amongst the processor(s) 801, interfaces and internal storage elements that may proximately arise in time with respect to one another.
One or more I/O devices 808 are also implemented in a typical computing system. I/O devices generally are responsible for transferring data to and/or from the computing system (e.g., a networking adapter); or, for large scale non-volatile storage within the computing system (e.g., hard disk drive). ICH 805 has bi-directional point-to-point links between itself and the observed I/O devices 808.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Fletcher, Thomas D., Hickmann, Brian J., Pineiro, Alex
Patent | Priority | Assignee | Title |
9588765, | Sep 26 2014 | Intel Corporation | Instruction and logic for multiplier selectors for merging math functions |
Patent | Priority | Assignee | Title |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 24 2010 | Intel Corporation | (assignment on the face of the patent) | / | |||
Nov 24 2010 | HICKMANN, BRIAN J | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 026671 | /0708 | |
Dec 10 2010 | FLETCHER, THOMAS | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 026671 | /0708 | |
Dec 20 2010 | PINEIRO, ALEX | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 026671 | /0708 |
Date | Maintenance Fee Events |
Mar 05 2014 | ASPN: Payor Number Assigned. |
Sep 07 2017 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Nov 08 2021 | REM: Maintenance Fee Reminder Mailed. |
Apr 25 2022 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Mar 18 2017 | 4 years fee payment window open |
Sep 18 2017 | 6 months grace period start (w surcharge) |
Mar 18 2018 | patent expiry (for year 4) |
Mar 18 2020 | 2 years to revive unintentionally abandoned end. (for year 4) |
Mar 18 2021 | 8 years fee payment window open |
Sep 18 2021 | 6 months grace period start (w surcharge) |
Mar 18 2022 | patent expiry (for year 8) |
Mar 18 2024 | 2 years to revive unintentionally abandoned end. (for year 8) |
Mar 18 2025 | 12 years fee payment window open |
Sep 18 2025 | 6 months grace period start (w surcharge) |
Mar 18 2026 | patent expiry (for year 12) |
Mar 18 2028 | 2 years to revive unintentionally abandoned end. (for year 12) |