A pipelined instruction dispatch or grouping circuit allows instruction dispatch decisions to be made over multiple processor cycles. In one embodiment, the grouping circuit performs resource allocation and data dependency checks on an instruction group, based on a state vector which includes representation of source and destination registers of instructions within said instruction group and corresponding state vectors for instruction groups of a number of preceding processor cycles.
|
1. A central processing unit, comprising:
a plurality of functional units, each functional unit adapted to execute an instruction of said central processing unit; and a grouping logic circuit, including a number of pipeline stages and receiving, at each processor cycle, a group of instructions and one or more state vectors each representing states of instructions previously received at said grouping logic circuit in a preceding processor cycle wherein, based on said state vectors, said grouping logic circuit dispatches each of said currently received instructions to be executed by one of said functional units, and provides a current state vector representing states of instructions of said currently received instructions.
2. A central processing unit as in
3. A central processing unit as in
4. A central processing unit as in
5. A central processing unit as in
6. A central processing unit as in
7. A central processing unit as in
8. A central processing unit as in
9. A central processing unit as in
|
|||||||||||||||||||
This application is a
An embodiment of the present invention is illustrated by the block diagram of
To simplify the discussion below, the state of CPU 100 relevant to grouping logic 109 is summarized by a state variable S(t), which is defined below. Of course, the state of CPU 100 includes also other variables, such as those conventionally included in the processor status word. Those skilled in the art would appreciate the use and implementation of processor states. Thus, the state S(t) at time t of CPU 100 can be represented by:
where
ALU1(t) and ALU2(t) are the states, at time t, of arithmetic logic units 101 and 102 respectively; LS(t) and LB(t) are the states, at time t, of store buffer 105 and load buffer 104 respectively; FA(t), FM(t), and FDS(t) are the states, at time t, of floating point adder 106, floating point multiplier 107 and floating point divider 108 respectively.
At any given time, the state of each functional unit can be represented by the source and destination registers specified in the instructions dispatched to the functional unit but not yet retired. Thus,
where
rs1(t), rs2(t) and rd(t) are respectively the first and second source registers, and the destination of registers of the instruction executing at time t in arithmetic logic unit 101.
Similarly, the state of arithmetic logic unit 102 can be defined as:
For pipelined functional units, such as floating point adder 106, the state is relatively more complex, consisting of the source and destination registers of the instructions in their respectively pipeline. Thus, for the pipelined units, i.e., load/store unit 103, load buffer 104, store buffer 105, floating point adder 106, and floating point multiplier 107, their respective states, at time t, LS(t), LB(t), SB(t), FA(t) and FM(t) can be represented by:
Finally, floating point divider 108's state FSD(t)
can be represented by:
State variable S(t) can be represented by a memory element, such as a register or a content addressable memory unit, at either a centralized location or in a distributed fashion. For example, in the distributed approach, the portion of state S(t) associated with a given functional unit can be implemented with the control logic of the functional unit.
In the prior art, a grouping logic circuit would determine from the current state, S(t) at time t, the next state S(t+1), which includes information necessary to dispatch the instructions of the next processor cycle at time t+1. For example, to avoid a read-after-write hazard, such a grouping circuit would exclude from the next state S(t+1) an instruction having an operand to be fetched from a register designated for storing a result of a yet incomplete instruction. As another example, such a grouping circuit would include in state S(t+1) no more than one floating point "add" instruction in each processor cycle, since only one floating point adder (i.e. floating point adder 106) is available. As discussed above, as complexity increases, the time required for propagating through the grouping logic circuit can become a critical path for the processor cycle. Thus, in accordance with the present invention, grouping logic circuit 109 is pipelined to derive, over τ processor cycles, a future state S(t+τ) based on the present state S(t). The future state S(t+τ) determines the instruction group to dispatch at time t+τ. Pipelining grouping logic 109 is possible because, as demonstrated below, (i) the values of most state variables in the state S(t+τ) can be estimated from corresponding values of state S(t) with sufficient accuracy, and (ii) for those state variables for which values can not be accurately predicted, it is relatively straightforward to provide for all possible outcomes of state S(t+τ), or to use a conservative approach (i.e. not dispatching an instruction when such an instruction could have been dispatched) with a slight penalty on performance.
The process for predicting state S(t+τ) is explained next. The following discussion will first show that most components of next state S(t+1) can be precisely determined from present state S(t), and the remaining components of state S(t) can be reasonably determined, provided that certain non-deterministic conditions are appropriately handled. By induction, it can therefore be shown that future state S(t+τ), where τ is greater than 1, can likewise be determined from state S(t).
Since an instruction in floating point adder 106 or floating point multiplier 107 completes after four processor cycles and an instruction in load/store unit 103 completes after two processor cycles, the states FA, FM and LS at time t+1 can be derived from the corresponding state S(t) at time t, the immediately preceding processor cycle. In particular, the relationship governing the source and destination registers of each instruction executing in floating point adder 106, floating point multiplier 107 and load/store unit 103 between time t+1 and time t are:
where k is the depth of the respective pipeline.
The state FSD(t+1) of floating point divider 108, in which the time required to execute an instruction can exceed an processor cycle, is determined from state FSD(t) by:
Whether or not floating point divider 108 is in its last stage can be determined from, for example, a hardware counter or a state register, which keep tracks of the number of processor cycles elapsed since the instruction in floating point divider 108 began execution.
In load buffer 104 and store buffer 105, since the pending read or write operation at the head of each queue need not complete within one processor cycle, the state LB(t+1) at time t+1 cannot be determined from the immediately previous state LB(t) at time t with certainty. However, since state LB(t+1) can only either remain the same, or reflect the movement of the pipeline by one stage, two possible approaches to determine state LB(t+1) can be used. First, a conservative approach would predict LB(t+1) to be the same as LB(t). Under this approach, when load buffer 104 is full, an instruction is not dispatched until the pipeline in load buffer 106 advances. An incorrect prediction, i.e. a load instruction completes during the processor cycle of time t, this conservative approach leads to a penalty of one processor cycle, since a load instruction could have been dispatched at time t+1. Alternatively, a more aggressive approach provides for both outcomes, i.e. load buffer 104 advances one stage, and load buffer 104 remains the same. Under this aggressive approach, grouping logic 109 is ready to dispatch a load instruction, such dispatch to be enabled by a control signal which indicates, at time t+1, whether a load instruction has in fact completed. This aggressive approach requires more a complex logic circuit than the conservative approach.
Thus, the skilled person would appreciate that state S(t+1) of CPU 100 can be predicted from state S(t). Consequently, both the number of instructions and the types of instructions that can be dispatched at time t+1 (i.e. the instruction group at time t+1) based on predicted state S(t+1) can be derived, at time t, from state S(t), subject to additional handling based on the actual state SA(t+1) at time t+1.
The above analysis can be can be extended to allow state S(t+τ) at time t+τ to be derived from state S(t) at time t. The instruction group at time t+τ can be derived from time t, provided that, for each instruction group between time t and t+τ, all instruction from that instruction group must be dispatched before any instruction from a subsequent instruction group is allowed to be dispatched (i.e. no instruction group merging).
Since instructions from different instruction groups are not merged, intra-group dependencies and inter-group dependencies can be checked in parallel. The instructions are either fetched from an instruction cache or an instruction buffer. An instruction buffer is preferable in a system in which not all accesses (e.g. branch instructions) to the instruction cache are aligned, and multiple entry points in the basic blocks of a program are allowed.
Once four candidate instructions for an instruction group are identified, intra-group data dependency checking can begin. Because of the constraint against instruction group merging described above, i.e., all instructions in an instruction group must be dispatched before an instruction from a subsequent instruction group can be dispatched, intra-group dependency checking can be accomplished in a pipelined fashion. That is, intra-group dependency checking can span more than one processor cycle and all inter-group dependency checking can occur independently of inter-group dependency checking. For the purpose of intra-group dependency check, each instruction group can be represented by:
where W is the width of the machine, and resi represents the resource utilization of instruction I. An example of a four-stage pipeline 200 is shown in FIG. 2. In
The above detailed description is provided to illustrate the specific embodiments of the present invention and is not intended to be limiting. Numerous variations and modifications within the scope of the present invention are possible. The present invention is defined by the following claims.
| Patent | Priority | Assignee | Title |
| 10564976, | Nov 30 2017 | International Business Machines Corporation | Scalable dependency matrix with multiple summary bits in an out-of-order processor |
| 10564979, | Nov 30 2017 | International Business Machines Corporation | Coalescing global completion table entries in an out-of-order processor |
| 10572264, | Nov 30 2017 | International Business Machines Corporation | Completing coalesced global completion table entries in an out-of-order processor |
| 10802829, | Nov 30 2017 | International Business Machines Corporation | Scalable dependency matrix with wake-up columns for long latency instructions in an out-of-order processor |
| 10884753, | Nov 30 2017 | International Business Machines Corporation | Issue queue with dynamic shifting between ports |
| 10901744, | Nov 30 2017 | International Business Machines Corporation | Buffered instruction dispatching to an issue queue |
| 10922087, | Nov 30 2017 | International Business Machines Corporation | Block based allocation and deallocation of issue queue entries |
| 10929140, | Nov 30 2017 | International Business Machines Corporation | Scalable dependency matrix with a single summary bit in an out-of-order processor |
| 10942747, | Nov 30 2017 | International Business Machines Corporation | Head and tail pointer manipulation in a first-in-first-out issue queue |
| 11204772, | Nov 30 2017 | International Business Machines Corporation | Coalescing global completion table entries in an out-of-order processor |
| 7454597, | Jan 02 2007 | International Business Machines Corporation | Computer processing system employing an instruction schedule cache |
| Patent | Priority | Assignee | Title |
| 5127093, | Jan 17 1989 | CRAY, INC | Computer look-ahead instruction issue control |
| 5497499, | Mar 19 1992 | HANGER SOLUTIONS, LLC | Superscalar risc instruction scheduling |
| 5560028, | Nov 05 1993 | INTERGRAPH HARDWARD TECHNOLOGIES COMPANY | Software scheduled superscalar computer architecture |
| 5594864, | Apr 29 1992 | Sun Microsystems, Inc. | Method and apparatus for unobtrusively monitoring processor states and characterizing bottlenecks in a pipelined processor executing grouped instructions |
| 5627984, | Mar 31 1993 | Intel Corporation | Apparatus and method for entry allocation for a buffer resource utilizing an internal two cycle pipeline |
| 5958042, | Jun 11 1996 | Oracle America, Inc | Grouping logic circuit in a pipelined superscalar processor |
| EP651323, |
| Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
| Apr 07 2003 | Sun Microsystems, Inc. | (assignment on the face of the patent) | / |
| Date | Maintenance Fee Events |
| Jul 09 2004 | ASPN: Payor Number Assigned. |
| Aug 23 2005 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
| Aug 23 2005 | M1554: Surcharge for Late Payment, Large Entity. |
| Jun 08 2007 | ASPN: Payor Number Assigned. |
| Jun 08 2007 | RMPN: Payer Number De-assigned. |
| Jul 22 2009 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
| Mar 14 2013 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
| Date | Maintenance Schedule |
| Sep 21 2007 | 4 years fee payment window open |
| Mar 21 2008 | 6 months grace period start (w surcharge) |
| Sep 21 2008 | patent expiry (for year 4) |
| Sep 21 2010 | 2 years to revive unintentionally abandoned end. (for year 4) |
| Sep 21 2011 | 8 years fee payment window open |
| Mar 21 2012 | 6 months grace period start (w surcharge) |
| Sep 21 2012 | patent expiry (for year 8) |
| Sep 21 2014 | 2 years to revive unintentionally abandoned end. (for year 8) |
| Sep 21 2015 | 12 years fee payment window open |
| Mar 21 2016 | 6 months grace period start (w surcharge) |
| Sep 21 2016 | patent expiry (for year 12) |
| Sep 21 2018 | 2 years to revive unintentionally abandoned end. (for year 12) |