Data dependency collapsing hardware apparatus

Data dependency collapsing hardware apparatus
RE35311

A multi-function alu (arithmetic/logic unit) for use in digital data processing facilitates the execution of instructions in parallel, thereby enhancing processor performance. The proposed apparatus reduces the instruction execution latency that results from data dependency hazards in a pipelined machine. This latency reduction is accomplished by collapsing the interlocks due to these hazards. The proposed apparatus achieves performance improvement while maintaining compatibility with previous implementations designed using an identical architecture.

PTO Wrapper PDF
Dossier Espace Google

Patent RE35311
Priority Aug 18 1992
Filed Aug 18 1994
Issued Aug 06 1996
Expiry Aug 06 2013
Inventors Vassiliadi…
Assg.orig Internatio…
Assg.curr Internatio…
Entity Large
Referenced by 17
References 8
Maint.: EXPIRED

AHAZ-COLLAPSING ALU
BRANCH HAZARD-COLLAP…
IDIOSYNCRASIES
GENERALIZATION OF TH…

10. A multifunction alu (arithmetic logic unit) for combining three operands to produce a single result in response to a pair of instructions, including:

a first set of logical elements for logically combining two operands to produce a first logical result;

an adder for arithmetically combining three operands to produce a single arithmetic result;

a circuit for inputting to the adder either all of said operands, two of said operands and a zero, one of said operands, a zero, and said first logical result, or two zeros and said first logical result;

a second set of logical elements for logically combining one of said operands with said single arithmetic result to produce a second logical result; and

a circuit for providing as an output either said arithmetic result or said second logical result.

1. In a computer architected for serial execution of a sequence of single scalar instructions in a succession of execution cycles, an apparatus for supporting parallel execution of a plurality of scalar instructions in a single instruction cycle, the apptratus comprising:

an instruction means for receiving a plurality of scalar instructions, a first of the scalar instructions producing a calculation result used as an operand by a second of the scalar instructions;

an operand means for substantially simultaneously providing a plurality of operands, at least two of said operands being used by the first and second scalar instructions;

a control means connected to the instruction means for generating control signals to indicate operations which execute the plurality of scalar instructions; and

an execution means connected to the operand means and to the control means and responsive to the control signals and to a plurality of operands including the two operands for producing, in a single execution cycle, a single result corresponding to the performance of said operations on said plurality of operands.

12. In a computer architected for serial execution of a sequence of scalar mstrucnons in a succession of execution periods, an interlock-collapsing apparatus for supporting simultaneous parallel execution of a plurality of scalar instructions, the apparatus comprising:

an instruction register means for receiving a plurality of scalar instructions for simultaneous execution, a first instruction of the plurality of scalar instructions producing a result used as an operand by a second instruction of the plurality of scalar instructions;

an operand means for substantially simultaneously providing a plurality of operands used in execution the plurality of sclar instructions;

a control means connected to the instruction register means for generating control signals to indicate operands which execute the plurality of scalar instructions; and

an interlock-collapsing execution means connected to the operand means and to the control means an and responsive to the control signals and to the plurality of operands for producing a single result corresponding to the simultaneous execution of said first and second instructions in a single execution period.

21. In a computer architected for serial execution of a sequence of scalar instructions in a succession of execution cycles, an execution apparatus for, in a single execution cycle, producing a result representing simultaneous execution of a first scalar instruction and a second scalar instruction in which the second scalar instruction requires a result produced by execution of the first scalar instruction, the execution apparatus comprising:

an instruction register means for receiving the first and second scalar instructions;

an operand means for substantially simultaneously providing a plurality of operands, at least two of the plurality of operands being used in executing the first and second scalar instructions:

a control means connected to the instruction register means for generating control signals which indicate execution of the first scalar instruction and the second scalar instruction;

a first execution means connected to the operand means and to the control means and responsive to the control signals and to the two operands for producing, in an execution cycle, a result corresponding to the execution of the first instruction; and

2. The apparatus of claim 1, wherein the execution means includes an adder which produces a single adder result in response to three operands.

3. The apparatus of claim 2, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry look ahead adder, connected to the carry save adder, which produces one output in response to the two outputs of the carry save adder.

4. The apparatus of claim 2, wherein the execution means further includes logical means connected to the operand means and to the adder for performing a logic function on the operands to produce a logic result, the adder producing said single adder result in response to the logic result and one of the operands.

5. The apparatus of claim 2, wherein the execution means further includes logic means connected to the operand means and to the adder for performing a logic function on a first and second operand to produce a logic result, the execution means producing the single result in response to the logic result and the single adder result.

6. The apparatus of claim 1 wherein the first scalar instruction is a logical instruction and the second scalar instruction is an arithmetic instruction and the execution means includes logical means for combining first and second operands to produce a logical result required by said logical instruction and arithmetic means for combining the logical result with a third operand to produce said single result, said single result being required by the arithmetic instruction.

7. The apparatus of claim 1 wherein the first scalar instruction is an arithmetic instruction and the second scalar instructlon is a logical instruction and the execution means includes arithmetic means for combining first and second operands to produce an arithmetic result required by said arithmetic instruction and logical means for combining the arithmetic result with a third operand to produce said single result, said single result being required by the logical instruction.

8. The apparatus of claim 1, wherein the first scalar instruction is an arithmetic instruction and the second scalar instruction is an arithmetic instruction and the execution means includes arithmetic means for combining the three operands to produce a single arithmetic result, said single arithmetic result being provided as said single result.

9. The apparatus of claim 1 wherein the first scalar instruction is a logical restruction and the second scalar instruction is a logical instruction and the execution means includes logical means for combining first and second operands to produce a first logical result, said first logical result required by said first logical instruction, and second logical means for combining the first logical result with a third operarand to produce a second logical result, said second logical result being required by the second scalar instruction and said second logical result being provided as said single result.

11. The multifunction alu of claim 10, wherein said adder includes:

a carry-save adder for producing two outputs in response to three operands: and

a carry look ahead adder connected to said carry save adder for producing one output in response to said two outputs.

13. The apparatus of claim 12, wherein the interlock-collapsing execution means includes and adder which produces a single adder result in response to three operands.

14. The apparatus of claim 13, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry lookahead adder connected to the carry save adder which produces one output in response to the two outputs of the carry save adder.

15. The apparatus of claim 13, wherein the interlock-collapsing execution means further includes logic means connected to the operand means and to the adder for performing a logic function on the operands to produce a logic result, the adder producing the single adder result in response to the logic result and one of the operands.

16. The apparatus of claim 13, wherein the interlock-collapsing execution means further includes logic means connected to the operand means and to the adder for performing a logic function on a first operand and a second operand to produce a logic result, the interlock-collapsing execution means producing the single result in response to the logic result and the single adder result.

17. The apparatus of claim 12, wherein the first instruction is a logical instruction and the second instruction is an arithmetic instruction and the interlock-collapsing execution means includes logical means for combining first and second operands to produce a logic result required by the logical instruction and arithmetic means for combining the logic result with a third operand to produce the single result, the single result representing exectution of the arithmetic instruction.

18. The apparatus of claim 12, wherein the first instruction is an arithmetic instruction and the second instruction is a logic instruction and the interlock-collapsing execution means includes arithmetic means for combining first and second operands to produce an arithmetic result required by said arithmetic instruction and logic means for combining the arithmetic result with a third operand to produce the single result, the single result representing execution of the logical instruction.

19. The apparatus of claim 12, wherein the first instruction is an arithmetic instruction and the second instruction is an arithmetic instruction and the interlock-collapsing execution means includes arithmetic means for combining three operands to produce a single arithmetic result, the three operands including two operands used in the execution of the first and second instructions.

20. The apparatus of claim 12, wherein the first instruction is a first logic instruction and the second instruction is a second logic instruction and the interlock-collapsing execution means includes logic means for combining first and second operands to produce a first logic result, the first logic result being required by the first logic instruction, and second logic means for combining the first logic result with a third operand to produce a second logic result, the second logic result representing execution of the second logic instruction and the second logic result being provided as the single result.

22. The apparatus of claim 21, wherein the first execution means includes an adder which produces a single adder result in response to two operands.

23. The apparatus of claim 21, wherein the second execution means includes an adder which produces a single adder result in response to three operands.

24. The apparatus of claim 23, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry lookahead adder connected to the carry save adder, which produces one output in response to the two outputs of the carry save adder.

25. In a computer system, an apparatus for supporting parallel execution of a plurality of instructions in a single execution cycle, the apparatus comprising:

an instruction means for receiving a plurality of instructions, a first of the instructions producing a calculation result used as an operand by a second of the instructions;

an operand means for substantially simultaneously providing a plurality of operands;

a control means connected to the instruction means for generating control signals to indicate operations which execute the plurality of instructions; and

an execution means connected to the operand means and to the control means and responsive to the control signals and to the plurality of operands for producing a single result corresponding to the performance of said operations, including execution of the first and second of the instructions, on said plurality of operands in a single execution cycle. 26. The apparatus of claim 25, wherein the execution means includes an adder which produces a single adder result in response to three operands in the single execution cycle. 27. The apparatus of claim 26, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry look ahead adder, connected to the carry save adder, which produces one output in response to the two outputs of the carry save adder. 28. The apparatus of claim 26, wherein the execution means fuher includes logical means connected to the operand means and to the adder for performing a logic function on the operands to produce a logic result, the adder producing said single adder result in response to the logic result and one of the operands. 29. The apparatus of claim 26, wherein the execution means further includes logic means connected to the operand means and to the adder for performing a logic function on a first and second operand to produce a logic result, the execution means producing the single result in response to the logic result and the single adder result. 30. The apparatus of claim 25, wherein the first instruction is a logical instruction and the second instruction is an arithmetic instruction and the execution means includes logical means for combining first and second operands to produce a logical result required by said logical instruction and arithmetic means for combining the logical result with a third operand to produce said single result, said single result being required by the arithmetic instruction. 31. The apparatus of claim 25 wherein the first instruction is an arithmetic instruction and the second instruction is a logical instruction and the execution means includes arithmetic means for combining first and second operands to produce an arithmetic result required by said arithmetic instruction and logical means for combining the arithmetic result with a third operand to produce said single result, said single result being required by the logical instruction. 32. The apparatus of claim 25, wherein the first instruction is an arithmetic instruction and the second instruction is an arithmetic instruction and the execution means includes arithmetic means for combining the three operands to produce a single arithmetic result, said single arithmetic result being provided as said single result. 33. The apparatus of claim 25, wherein the first instruction is a logical instruction and the second instruction is a logical instruction and the execution means includes logical means for combining first and second operands to produce a first logical result, said first logical result required by said first logical instruction, and second logical means for combining the first logical result with a third operand to produce a second logical result, said second logical result being required by the second instruction and said second ogical result being provided as said single result. 34. In a computer system, an interlocking-collapsing apparatus for supporting simultaneous parallel execution of a plurality of instructions, the apparatus comprising:

an instruction register means for receiving a plurality of instructions for simultaneous execution, a first instruction of the plurality of instructions producing a result used as an operand by a second instruction of the plurality of instructions;

an operand means for substantially simultaneously providing a plurality of operands used in executing the plurality of instructions;

a control means coupled to the instruction register means for generating control signals to indicate operations which execute the plurality of instructions; and

an interlock-collapsing execution means coupled to the operand means and to the control means and responsive to the control signals and to the plurality of operands for producing a result corresponding to the simultaneous execution of first and second instructions in a single execution period. 35. The apparatus of claim 34, wherein the interlock-collapsing execution means includes a carry save adder which produces two outputs in response to receiving three operands and a carry look ahead adder connected to the carry save adder which produces a single adder result in the execution period. 36. The apparatus of claim 35 further including logic means connected to the operand means and to the interlock-collapsing execution means for performing a logic function on the operands to produce a logic result, the interlock-collapsing execution means including an adder which produces the single adder result in response to the logic result and one of the operands. 37. The apparatus of claim 35 further including logic means connected to the operand means and to the interlock-collapsing execution means for performing a logic function on a first operand and a second operand to produce a logic result, the interlock-collapsing execution means producing the result in response to the logic result and the single adder result. 38. The apparatus of claim 34, wherein the interlock-collapsing execution means includes a carry save adder which produces two outputs in response to receiving three operands and a carry look ahead adder connected to the carry save adder which produces one output in response to the two outputs of the carry save adder. 39. The apparatus of claim 34, wherein the first instruction is a logical instruction and the second instruction is an arithmetic instruction and the interlock-collapsing execution means includes logic means for combining first and second operands to produce a logic result required by the logical instruction and arithmetic means for combining the logic result with a third operand to produce a single result, the single result representing execution of the arithmetic instruction. 40. The apparatus of claim 34, wherein the first instruction is arithmetic instruction and the second instruction is a logic instruction and the interlock-collapsing execution means includes arithmetic means for combining first and second operands to produce an arithmetic result required by said arithmetic instruction and logic means for combining the arithmetic result with a third operand to produce a single result, the single result representing execution of the logical instruction. 41. The apparatus of claim 34, wherein the first instruction is an arithmetic instruction and the second instruction is an arithmetic instruction and the interlock-collapsing execution means includes arithmetic means for combining three operands to produce a single arithmetic result, the three operands including two operands used in the execution of the first and second instructions. 42. The apparatus of claim 34, wherein the first instruction is a first logic instruction and the second instruction is a second logic instruction and the interlock-collapsing execution means includes logic means for combining first and second operands to produce a first logic result, the first logic result being required by the first logic instruction, and second logic means for combining the first logic result with a third operand to produce a second logic result, the second logic result representing execution of the second logic instruction and the second logic result being provided as the single result. 43. In a computer system, an execution apparatus for, in a single execution cycle, producing a result representing simultaneous execution of a first instruction and a second instruction in which the second instruction requires a result produced by execution of the first instruction, the execution apparatus comprising:

an instruction register means for receiving the first and second instruction;

an operand means for substantially simultaneously providing a plurality of operands, at least two of the plurality of operands being used in execution of the first and second instructions;

a control means connected to the instruction register means for generating control signals which indicate execution of the first instruction and the second instruction;

a second execution means connected to the operand means and to the control means and responsive to the control signals and to a plurality of operands including the two operands for producing, in said execution cycle, a single result corresponding to the execution of the first and second instructions. 44. The apparatus of claim 43, wherein the first execution means includes an adder which produces a single adder result in response to two operands. 45. The apparatus of claim 43, wherein the second execution means includes an adder which produces a single adder result in response to three operands. 46. The apparatus of claim 45, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry look ahead adder connected to the carry save adder, which produces one output in response to the two outputs of the carry save adder.

70 870 which decodes the first instruction in the instruction field 52 to provide register select signals to a conventional cross connect 872 which is also connected to the general purpose registers 63. The logic 870 also provides function select signals on output 874 to a conventional two-operand ALU 875. This ALU apparatus is provided for execution of the instruction in instruction field 52, while the second instruction in instruction field 54 is executed by the ALU 65. As described below, the ALU 65 can execute the second instruction whether or not one of its operands depends upon result data produced by execution of the first instruction. Both ALUs therefore operate in parallel to provide concurrent execution of two instructions, whether or not compounded.

Returning to the compounded instructions 52 and 54 and the register 50, the existence of a compounder is presumed. It is asserted that the compounder pairs or compounds the instructions from an instruction stream including a sequence of scalar instructions input to a scalar computing machine in which the compounder resides. The compounder groups instructions according to the discussion above. For example, category 1 instructions (FIG. 5) are grouped in logical/add, add/logical, logical/logical, and add/add pairs in accordance with Table 5. To each instruction of a compound set there is added a tag containing control information. The tag includes compounding bits which refer to the part of a tag used specifically to identify groups of compound instructions. Preferably, in the case of compounding two instructions, the following procedure is used to indicate where compounding occurs. In the System/370 machines, all instructions are aligned on a half word boundary and their lengths are either 2, 4 or 6 bytes. In this case, a compounding tag is needed for every half word. A one-bit tag is sufficient to indicate whether an instruction is or is not compounded. Preferably, a "1" indicates that the instruction that begins in a byte under consideration is compounded with the following instruction. A "0" indicates no compounding. The compounding bit associated with half words hat do not contain the first byte of an instruction is ignored. The compounding bit for the first byte of the second instruction in the compound pair is also ignored. Consequently, only one bit of information is needed to identify and appropriately execute compounded instructions. Thus, the tag bits 56 and 58 are sufficient to inform the decode and control logic 60 that the instructions in register fields 52 and 54 are to be compounded, that is executed in parallel. The decode and control logic 60 then inspects the instructions 52 and 54 to determine what their execution sequence is, what interlock conditions, if any obtain, and what functions are required. This determination is illustrated for category 1 instructions in FIG. 5. The decode and control logic also determines the funcnons required to collapse any data hazard interlock as per FIGS. 6A and 6B. These determinations are consolidated in FIGS. 7A and 7B. In FIGS. 7A and 7B, assuming that the decode and control logic 60 has, from the tag bits, determined that instructions in fields 52 and 54, are to be compounded, the logic 60 sends out a function select signal on output 66 indicating the desired operation according to the left-most column of FIG. 7A. The OP codes of the instructions are explicitly decoded to provide, in the function select output, the specific operations in the columns headed OP1 and OP2 of FIGS. 7A and 7B. The register select signals on output 62 route the registers in FIG. 8 by way of the cross-connect 64 as required in the AI0, AI1, and AI2 columns of FIGS. 7A and 7B. Thus, for example, assume that the first instruction in field 52 is ADD R1, R2, and that the second instruction is ADD R1,R4. The eighteenth line in FIG. 7A shows the ALU operations which the decode and control circuit indicates by OP1=+ and OP2=+, while register R2 is routed to input AI0, regtster R4 to input AI1, and register R1 to input AI2.

Refer now to FIG. 9 for an understanding of the structure and operation of the data dependency collapsing ALU 65. In FIG. 9, a three-operand, single-result adder 70, corresponding to the adder of FIG. 3 is shown. The adder 70 obtains inputs through circuits connected between the adder inputs and the ALU inputs AI0, AI1, and AI2. From the input AI2, an operand is routed through three logic functional elements 71, 72 and 73 corresponding to logical AND, logical OR, and logical EXCLUSIVE-OR, respectively. This operand is combined in these logical elements with one of the other operands and routed to AI0 or AI1 according to the setting of the multiplexer 80. The multiplexer 75 selects either the unaltered operand connected to AI2 or the output of one of the logical elements 71, 72, or 73. The input selected by the multiplexer 75 is provided to an inverter 77, and the multiplexer 78 connects to one input of the adder 70 either the output of the inverter 77 or the uninverted output of the multiplexer 75. The second input to the adder 70 is obtained from ALU input AI1 by way of a multiplexer 82 which selects either "0" or the operand connected to ALU input AI1. The output of the multiplexer is inverted through inverter 84 and the multiplexer 85 selects either the noninverted or the inverted output of the multiplexer 82 as a second operand input to the adder 70. The third input to the adder 70 is obtained from input AI0 which is inverted through inverter 87. The multiplexer 88 selects either "0", the operand input to AI0, or its inverse provided as a third input to the adder 70. The ALU output is obtained through the multiplexer 95 which selects the output of the adder 70 or the output of one of the logical elements 90, 92 or 93. The logical elements 90, 92, and 93 combine the output of the adder by means of the indicated logical operation with the operand input to AI1.

It should be evident that the function select signal consists essentially of the multiplexer select signals A B C D E F G and the "hot" 1/0 selections input to the adder 70. It will be evident that the multiplexer select signals range from a single bit for signals A, B, E, and F to two-bit signals for C, D, and G.

The states of the complex control signal (A B C D E F G 1/0 1/0) are easily derived from FIG. 7A and 7B. For example, following the ADD R1, R2 ADD R1, R4 example given above, the OP1 signal would set multiplexer signal C to select the signal present on AI2, while the F signal would select the noninverted output of the multiplexer 75, thereby providing the operand in R1 to the right-most input of the adder 70. Similarly, the multiplexer signals B and E would be set to provide the operand available at AI1 in uninverted form to the middle input of the adder 70, while the multiplexer signal D would be set to provide the operand at AI0 to the left-most input of the adder 70, without inversion. Last, the two "I/O" inputs are set appropriately for the two add operations. With these inputs, the output of the adder 70 is simply the sum of the three operands, which corresponds to the desired output of the ALU. Therefore, the control signal G would be set so that the multiplexer 95 would output the result produced by the adder 70, which would be the sum of the operands in registers R1, R2, and R3.

When instruction compounding a logical/add sequence, the logical function would be selected by the multiplexer 75 and provided through the multiplexer 78 to the adder 70, while the operand to be added to the logical operation would be guided through one of the multiplexers 85 or 88 to one of the other inputs of the adder 70, with a 0 being provided to the third input. In this case, the multiplexer 95 would be set to select the output of the adder 70 as the result.

Last, in an add/logical compound sequence, the two operands to be first added will be guided to two of the inputs of the adder 70, while the 0 will be provided to the third input. The output of the adder is instantaneously combined with the non-selected operand in logical elements 90, 92 and 93. The control signal G will be set to select the output of the element whose operation corresponds to the second instruction of the compound set.

More generally, FIG. 9 presents a logical representation of the data dependency collapsing ALU 65. In deriving this dataflow, the decision was made to not support interlocks in which the result of the first instruction is used as both operands of the second instruction. More discussion of this can be found in the "Idiosyncrasies" section. That this representation implements the other operations required by LOGICAL-ADD cornpoundings can be seen by comparing the dataflow with the function column of FIG. 5. In this column, a LOGICAL-type operation upon two operands is followed by an ADD-type operation between the LOGICAL result and a third operand. This is performed by routing the operands to be logically combined to AI0 and AI2 of FIG. 9 and through the appropriate one of logical blocks 71, 72, or 73, routing this result to the adder 70, and routing the third operand through AI1 to the adder. Inversions and provision of hot 1's or 0's are provided as part of the function select signal as required by the arithmetic operation specified. In other cases, an ADD-type operation between two operands is followed by a LOGICAL-type operation between the result of the ADD-type and a third operand. This is performed by routing the operands for the ADD-type operation to AI0 and AI2, routing these inputs to the adder, routing the output of the adder to the post-adder logical blocks 90, 92 and 93, and routing the third operand through AI3 to these post-adder logical blocks. LOGICAL-type followed by LOGICAL-type operations are performed by routing the two operands for the first LOGICAL-type to AI0 and AI2 which are routed to the pre-adder logical blocks, routing the results from the pre-adder logical blocks through the ALU without modification bv addition to zero to the post-adder logical block, and routing the third operand through AI3 to the post-adder logical block. For an ADD-type operation followed by an ADD-type operation, the three operands are routed to the inputs of the adder, and the output of the adder is presented to the output of the ALU.

The operation of the ALU 65 to execute the second instruction in instruction field 54 when there is no data dependency between the first and second instructions is straightforward. In this case, only two operands are provided to the ALU. Therefore, if the second instruction is an add instruction, the two operands will be provided to the adder 70, together with a zero in the place of the third operand, with the output of the adder being selected through the multiplexer 95 as the output of the ALU. If the second instruction is a logical instruction, the logical operation can be performed by routing the two operands to the logical elements 71, 72, and 73, selecting the appropriate output, and then flowing the result through the adder 70 by providing zeros to the other two adder inputs. In this case, the output of the adder would be equal to the logical result and would be selected by the multiplexer 95 as the output of the ALU. Alternatively, one operand can be flowed through the adder by addition of two zeros, which will result in the adder 70 providing this operand as an output. This operand is combined with the other operand in the logical elements 90, 92, and 93, with the appropriate logical element output being selected by the multiplexer 95 as the output of the ALU.

When instructions are compounded as illustrated in FIG. 8, whether or not dependency exists, the instruction in instruction field 52 of register 50 will be conventionally executed by decoding of the instruction 870, 874, selection of its operands by 70, 871, 872, and performance of the selected operation on the selected operands in the ALU 875. Since the ALU 875 is provided for execution of a single instruction, two operands are provided from the selected register through the inputs AI0 and AI1, with the indicated result being provided on the output 877.

Thus, with the configuration illustrated in FIG. 8, the data dependency collapsing ALU 65, in combination with the conventional ALU 875 supports the concurrent (or, parallel) execution of two instructions, even when a data dependency exists between the instructions.

AHAZ-COLLAPSING ALU

Address generation can also be affected by data hazards which will be referred to as address hazards, AHAZ. The following sequence represents a compounded sequence of System/370 instructions that is free of address hazards:

AR R1,R2

S R3,D(R4,R5)

where D represents a three nibble displacement. No AHAZ exists since R4 and R5 which are used in the address calculation were not altered by the preceding instruction. Address hazards do exist in the following sequences:

AR R1,R2

S R3,D(R1,R5)

AR R1,R2

S R3,D(R4,R1)

The above sequences demonstrate the compounding of an RR instruction (category 1 in FIG. 5) with RX instructions (category 9) presenting AHAZ. Other combinations include RR instructions compounded with RS and SI instructions.

For an interlock collapsing ALU, new operations arising from collapsing AHAZ interlocks must be derived by analyzing all combinations of instruction sequences and address operand conflicts. Analysis indicates that common interlocks, such as the ones contained in the above instruction sequences, can be collapsed with a four-to-one ALU.

The functions that would have to be supported by an ALU to collapse all AHAZ interlocks for a System/370 instruction level architecture are listed in FIG. 10. For those cases where four inputs are not specified, an implicit zero is to be provided. The logical diagram of an AHAZ interlock collapsing ALU defined by FIG. 10 is given in FIG. 11. A large subset, but not all, of the functions specified in FIG. 10 are supported by the illustrated ALU. This subset consists of the functions given in rows one to 21 of FIG. 10. The decision as to which functions to include is an implementation decision whose discussion is deferred to the "Idiosyncrasies" section.

As FIG. 11 shows, the illustrated ALU includes an adder 100 in which two three-input, two-output carry save adders 101 and 102 are cascaded with a two-input, single-output carry look ahead adder 103 in such a manner that the adder 100 is effectively a four-operand, single-result adder necessary for operation of the ALU in FIG. 11.

In generating FIG. 10, the complexity of the ALU structure was simplified at the expense of the control logic. This is best explained by example. Consider the two following System/370 instruction sequences:

NR R1,R2 (4)

S R3,D(R1,R5)

and

NR R1,R2 (5)

S R3,D(R4,R1).

Let the general notation for this sequence be

NR r1,r2

S r3,D(R4,R5).

For the first sequence, the address of the operand is:

OA=D+(R1∩R2)+5

while that for the second sequence is:

OA=D+R4+(R1∩R2)

To simplify the execution controls at the expense of ALU complexity, the following two operations would need to be executed by the ALU:

OA=AG10+(AG11∩AG12)+AG13

OA=AG10+AG12+(AG11∩AG13)

in which D is fed to AGI0, r2 is fed to AGI1, r4 is fed to AGI2 and r5 is supplied to AGI3. The ALU could be simplified however if the controls detect which of r4 and r5 possess a hazard with r1 and dynamically route this register to AGI2. The other register would be fed to AGI3. For this assumption, the ALU must only support the operation:

OA=AG10+(AG11∩AG12)+AG13

Trade-offs such as these are made in favor of reducing the complexity of the address generation ALU as well as the execution and branch determination ALU's.

The ALU of FIG. 11 can be substituted for the ALU 65 in FIG. 8. In this case, the decode and control logic 60 would appropriately reflect the functions of FIG. 10.

BRANCH HAZARD-COLLAPSING ALU

Similar analyses to those for the interlock collapsing ALU's for execution and address generation must be performed to derive the affects of compounding on a branch detertnmation ALU which is given by FIGS. 12 and 13. The branch determination ALU covers functions required by instructions comparing register values. This includes the branch instructions BXLE, BXH, BCT, and BCTR, in which a register value is incremented by the contents of a second register (BXLE and BXH) or is decremented by one (BCT and BCTR) before being compared with a register value (BXLE and BXH) or 0 (BCT and BCTR) to determine the result of the branch. Conditional branches are not executed by this ALU.

The ALU illustrated in FIG. 13 include a multi-stage adder 110 in which two carry save adders 111 and 112 are cascaded, with the two outputs of the carry save adder 112 providing the two inputs for the carry look ahead adder 113. This combination effectively provides the four-input, single result adder provided for the ALU of FIG. 13.

As an example of the data hazards that can occur, consider the following instruction sequence:

AL R1,D(R2,R3)

BCT R1,D(R2,R3)

Let [x] denote the contents of memory location x. The results following execution are:

R1=R1+[D+R2+R3]-1

Branch if (R1=[D+R2+R3])-1=0

This comparison could be done by performing the operation:

R1+[D+R2+R3]1.

The results of analyses for the branch determination ALU are provided in FIGS. 12 and 13 without further discussion. The functions supported by the dataflow include those specified by rows one to 25 of FIG. 12.

The ALU of FIG. 13 can be substituted for the ALU 65 in FIG. 8. In this case, the decode and control logic 60 would appropriately reflect the functions of FIG. 12.

IDIOSYNCRASIES

Some of the functions that arise from operand conflicts are more complicated than others. For example, the instruction sequence:

AR R1,R2

AR R1,R1

requires a four-to-one ALU, along with its attendant complexity, to collapse the data interlock because its execution results in:

R1=(R1+R2)+(R1+R2).

Other sequences result in operations that require additional delay to be incorporated into the ALU in order to collapse the interlock. A sequence which illustrates increased delay is:

SR R1,R2

LPR R1,R1

which results in the operation

R1=/R1-R2/.

This operation does not lend itself to parallel execution because the results of the subtraction are needed to set up the execution of the absolute value.

Rather than collapse all interlocks in the ALU. an instruction issuing logic or a preprocessor can be provided which is designed to detect instruction sequences that lead to these more complicated functions. Preprocessor detecUon avoids adding delay to the issue logic which is often a near-critical path. When such a sequence is detected, the issuing logic or preprocessor would revert to issuing the sequence in scalar mode, avoiding the need to collapse the interlock. The decision as to which instruction sequences should or should not have their interlocks collapsed is an implementation decision which is dependent upon factors beyond the scope of this invention. Nevertheless, the trade-off between ALU implementation complexity and issuing logic complexity should be noted.

Hazards present in address generation also give rise to implementation trade-offs. For example, most of the address generation interlocks can be collapsed using a four-to-one ALU as discussed previously. The following sequence

AR R1,R2

S R3,D(R1,R1);

however, does not fit in this category. For this case, a five-to-one ALU is required to collapse the AHAZ interlock because the resulting operation is:

OA=D+(R1+R2)+(R1+R2)

where OA is the resulting operand address. As before, inclusion of this function in the ALU is an implementation decision which depends on the frequency of the occurrence of such an interlock. Similar results also apply to the branch determination ALU.

GENERALIZATION OF THE ADDER

Analyses similar to those presented can be performed to derive interlock collapsing hardware for the most general case of n interlocks. For this discussion. refer to FIG. 14. Assuming simple data interlocks such as:

AR R1,R2

AR R3,R1

in which the altered register from the first instruction is used as only one of the operands of the second instruction, a (n+1) by one ALU would be required to collapse the interlock. To collapse three interlocks, for example, using the above assumption would require a four-to-one ALU This would also require an extra CSA stage in the ALU.

The increase in the number of CSA stages required in the ALU, however, is not linear. An ALU designed to handle nine operands as a single execution unit would take four CSA stages and one CLA stage. This can be seen from FIG. 14 in which each vertical line represents an adder input and each horizontal line indicates an adder. Carry-save adders are represented by the horizontal lines 200-206, while the carry look-ahead adder is represented by line 209. Each CSA adder produces two outputs from three inputs. The reduction in input streams continues from stage to stage until the last CSA reduces the streams to two. The next adder is a CLA which produces one final output from two inputs. Assuming only arithmetic operations, a one stage CLA adder, and a four stage CSA adder, the execution of nine operands as a single unit using the proposed apparatus could be accomplished, to a first order approximation, in an equivalent time as the solution proposed by Wulf in the reference cited above.

Data hazard interlocks degrade the performance obtained from pipelined machines by introducing stalls into the pipeline. Some of these interlocks can be relieved by code movement and instruction scheduling. Another proposal to reduce the degradation in performance is to define instructions that handle data interlocks. This proposal suffers from limitations on the number of interlocks that can be handled in a reasonable instruction size. In addition, this solution is not available for 370architectttre compatible machines.

In this invention, an alternative solution for relieving instruction interlocks has been presented. This invention offers the advantages of requiring no architectural changes, not requiring all possible instruction pairs and their interlocks to be architected into an instruction set, presents only modest or no impacts to the cycle time of the machine, requires less hardware than is required in the prior art solution of FIG. 1, and is compatible with System/370-architected machines.

While the invention has been particularly shown and described with reference to the preferred embodiment thereof, it will be understood by those skilled in the art that many changes in form and details may be made therein without departing from the spirit and scope of the invention.

INVENTORS:

Vassiliadis, Stamatis, Phillips, James E., Blaner, Bartholomew

THIS PATENT IS REFERENCED BY THESE PATENTS:

Patent	Priority	Assignee	Title
5963461,	Sep 04 1997	Sun Microsystems, Inc.	Multiplication apparatus and methods which generate a shift amount by which the product of the significands is shifted for normalization or denormalization
5991869,	Apr 12 1995	Advanced Micro Devices, Inc.	Superscalar microprocessor including a high speed instruction alignment unit
6112289,	Nov 09 1994	ADVANCED PROCESSOR TECHNOLOGIES LLC	Data processor
6178492,	Nov 09 1994	ADVANCED PROCESSOR TECHNOLOGIES LLC	Data processor capable of executing two instructions having operand interference at high speed in parallel
6889318,	Aug 07 2001	VERISILICON HOLDINGSCO , LTD	Instruction fusion for digital signal processor
7840627,	Dec 29 2003	XILINX, Inc.	Digital signal processing circuit having input register blocks
7840630,	Dec 29 2003	XILINX, Inc.	Arithmetic logic unit circuit
7844653,	Dec 29 2003	XILINX, Inc.	Digital signal processing circuit having a pre-adder circuit
7849119,	Dec 29 2003	XILINX, Inc.	Digital signal processing circuit having a pattern detector circuit
7853632,	Dec 29 2003	XILINX, Inc.	Architectural floorplan for a digital signal processing circuit
7853634,	Dec 29 2003	XILINX, Inc.	Digital signal processing circuit having a SIMD circuit
7853636,	Dec 29 2003	XILINX, Inc.	Digital signal processing circuit having a pattern detector circuit for convergent rounding
7860915,	Dec 29 2003	XILINX, Inc.	Digital signal processing circuit having a pattern circuit for determining termination conditions
7865542,	Dec 29 2003	XILINX, Inc.	Digital signal processing block having a wide multiplexer
7870182,	Dec 29 2003	Xilinx Inc.	Digital signal processing circuit having an adder circuit with carry-outs
7882165,	Dec 29 2003	XILINX, Inc.	Digital signal processing element having an arithmetic logic unit
8479133,	Jan 27 2009	XILINX, Inc.; Xilinx, Inc	Method of and circuit for implementing a filter in an integrated circuit

THIS PATENT REFERENCES THESE PATENTS:

Patent	Priority	Assignee	Title
4675806,	Mar 03 1982	Fujitsu Limited	Data processing unit utilizing data flow ordered execution
4754412,	Oct 07 1985	SCHLUMBERGER TECHNOLOGIES, INC	Arithmetic logic system using the output of a first alu to control the operation of a second alu
4766416,	Jul 16 1987	General Electric Company	Circuit for generating the square of a function without multipliers
4775952,	May 29 1986	Intersil Corporation	Parallel processing system apparatus
4809159,	Feb 10 1983	Omron Tateisi Electronics Co.	Control token mechanism for sequence dependent instruction execution in a multiprocessor
4819155,	Jun 01 1987	WILLIAM A WULF, 6052 GRAFTON STREET, PITTSBURGH, PENNSYLVANIA 15206	Apparatus for reading to and writing from memory streams of data while concurrently executing a plurality of data processing operations
4852040,	Mar 04 1987	NEC Corporation	Vector calculation circuit capable of rapidly carrying out vector calculation of three input vectors
5021947,	Mar 31 1986	Hughes Aircraft Company	Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing

ASSIGNMENT RECORDS Assignment records on the USPTO

Executed on	Assignor	Assignee	Conveyance	Frame	Reel	Doc
Aug 18 1994		International Business Machines Corporation	(assignment on the face of the patent)

MAINTENANCE FEES AND DATES: Maintenance records on the USPTO

Date	Maintenance Fee Events
Dec 02 1998	M184: Payment of Maintenance Fee, 8th Year, Large Entity.
Sep 28 2000	ASPN: Payor Number Assigned.

Date	Maintenance Schedule
Aug 06 1999	4 years fee payment window open
Feb 06 2000	6 months grace period start (w surcharge)
Aug 06 2000	patent expiry (for year 4)
Aug 06 2002	2 years to revive unintentionally abandoned end. (for year 4)
Aug 06 2003	8 years fee payment window open
Feb 06 2004	6 months grace period start (w surcharge)
Aug 06 2004	patent expiry (for year 8)
Aug 06 2006	2 years to revive unintentionally abandoned end. (for year 8)
Aug 06 2007	12 years fee payment window open
Feb 06 2008	6 months grace period start (w surcharge)
Aug 06 2008	patent expiry (for year 12)
Aug 06 2010	2 years to revive unintentionally abandoned end. (for year 12)