A double indirect method of accessing a block of data in a register file is used to allow efficient implementations without the use of specialized vector processing hardware. In addition, the automatic modification of the register addressing is not tied to a single vector instruction nor to repeat or loop instructions. Rather, the technique, termed register file indexing (rfi) allows full programmer flexibility in control of the block data operational facility and provides the capability to mix non-rfi instructions with rfi instructions. The block-data operation facility is embedded in the iVLIW ManArray architecture allowing its generalized use across the instruction set architecture without specialized vector instructions or being limited in use only with repeat or loop instructions. The use of rfi in a processor containing multiple heterogeneous execution units which operate in parallel, such as VLIW or iVLIW processors, allows for efficient pipelining of algorithms across multiple execution units while minimizing the number of VLIW instructions required.
|
10. A method of register file index (rfi) control comprising the steps of:
establishing an rfi control specification in rfi control registers to specify rfi control and address information for at least one register ports port used by a particular execution unit or units;
establishing rfi initialization control: ;
performing rfi update control for updating a register port address in one of the rfi control registers associated with the at least one register port;
executing an rfi instruction as part of a first indirect approach to select an instruction for execution; and
specifying the register port addresses address utilizing the updated register port address as part of a double second indirect approach to their select the specification of the register port address.
0. 36. A method of operating in both a register file index (rfi) mode and non-rfi mode, the method comprising:
receiving a first instruction having a first operand address;
receiving a signal indicating rfi mode;
calculating a second operand address based on the first operand address;
selecting the first operand address;
retrieving an operand using the first operand address;
executing the first instruction with the retrieved operand;
selecting the second operand address;
retrieving an operand using the second operand address;
executing the first instruction with the retrieved operand;
receiving a second instruction;
receiving a signal indicating non-rfi mode;
selecting a third operand address carried in the second instruction;
retrieving an operand using the third operand address; and
executing the second instruction with the retrieved operand.
1. A data processor with register file indexing comprising:
an instruction sequencer and n execution units capable of executing up to n instructions in parallel;
a plurality of register files with registers which contain data operands read and written by the n execution units, each register file having read ports to and write ports from the n execution units; and
read and write ports associated with each execution unit which have associated control circuitry and register file index (rfi) control registers which control the selection of a first addressing approach and a second indirect addressing approach and allow registers to be addressed using both a the first addressing approach in which fields of an instruction word made available to a particular execution unit directly specify addresses, and a the second indirect addressing approach in which the contents of register file index look ahead registers are utilized in specifying the addresses.
0. 22. A control circuit apparatus for operating in both a register file index (rfi) mode and non-rfi mode, the control circuit apparatus comprising:
a register file storing a plurality of operands;
an instruction register holding a first instruction and a first operand address of the register file for execution with the first instruction;
rfi circuitry for calculating and holding a second operand address of the register file for execution with the first instruction; and
a multiplexer having two inputs and an output, one of the two inputs connecting to the instruction register and the other of the two inputs connecting to the rfi circuitry, the output connecting to the register file; and in response to a signal signaling rfi mode, the multiplexer selecting the first operand address during a first execution cycle and the second operand address during a second execution cycle; and upon loading the instruction register with a second instruction having a third operand address and in response to the rfi signal signaling non-rfi mode, the multiplexer selecting the third operand address; and the selected operand address specifying the operand from the register file for use by an execution unit when executing the first instruction or the second instruction in a third execution cycle.
0. 19. A method for data processing with register file indexing (rfi), the method including:
receiving a plurality of instruction words for execution;
reading, based on a start of an rfi sequence indication stored in an rfi control register, a field in each of the plurality of instruction words to directly specify a first plurality of operand addresses of a plurality of registers, the plurality of registers as addressed by the first plurality of operand addresses containing a first plurality of data operands;
writing a second plurality of operand addresses to a look ahead register based on the first plurality of operand addresses as controlled by control circuitry and rfi control registers;
executing the plurality of instruction words in parallel utilizing the first plurality of data operands;
clearing the start of the rfi sequence indication;
specifying the second plurality of operand addresses of the plurality of registers by reading, based on the cleared start of the rfi sequence indication, the contents of the look ahead register, the plurality of registers as addressed by the second plurality of operand addresses containing a second plurality of data operands; and
executing the plurality of instruction words in parallel utilizing the second plurality of data operands.
4. The apparatus of
5. The apparatus of
6. The apparatus of
7. The apparatus of
8. The apparatus of
9. The apparatus of
11. The method of
12. The method of
13. The method of
writing control information into an rfi control register: ; and
setting a bit in an rfi reset register (RFIRR) corresponding to a particular rfi control group and particular execution unit.
14. The method of
updating an rfi look ahead register for the next cycle by adding or subtracting a constant from its the register port address stored in the look ahead register while maintaining its the register port address within a particular set of register addresses.
15. The method of
16. The method of
17. The method of
18. The method of
0. 20. The method of
adding or subtracting a constant to the first plurality of operand addresses.
0. 21. The method of
initializing an rfi control register from a register field specified in one of the plurality of instruction words.
0. 23. The control circuit apparatus of
0. 24. The control circuit apparatus of
0. 25. The control circuit apparatus of
0. 26. The control circuit apparatus of
0. 27. The control circuit apparatus of
0. 28. The control circuit apparatus of
0. 29. The control circuit apparatus of
0. 30. The control circuit apparatus of
0. 31. The control circuit apparatus of
0. 32. The control circuit apparatus of
0. 33. The control circuit apparatus of
0. 34. The control circuit apparatus of
a modulo adder circuit for calculating the second operand address based on a current operand address, the increment value, and a block size; and
a look ahead register storing the second operand address and supplying the second operand address to the multiplexer.
0. 35. The control circuit apparatus of
0. 37. The method of
0. 38. The method of
0. 39. The method of
0. 40. The method of
0. 41. The method of
receiving a third instruction;
receiving a signal indicating non-rfi mode;
selecting a fourth operand address carried in the third instruction;
retrieving an operand address using the fourth operand address; and
executing a third instruction, the third instruction operating in a non-rfi mode.
0. 42. The method of
0. 43. The method of
0. 44. The method of
initializing the second operand address prior to rfi operation with a pre-setup initial second operand address.
0. 45. The method of
updating the second operand address according to an increment value.
0. 46. The method of
calculating in a modulo adder circuit the second operand address based on a current value of the second operand address, the increment value, and a block size.
0. 47. The method of
calculating the second operand address Rnext upon each receipt of the signal indicating rfi mode according to Rnext=((Rcurrent+k)mod M)+Q*M, wherein Rcurrent is the current value of the second operand address prior to calculating, k is an increment value, M is a block size, and Q is a floor quotient └Rs/M┘ for a starting register Rs and wherein Rcurrent is equal to Rs for the first calculation of an rfi sequence.
|
The present application claims the benefit of U.S. Provisional Application Ser. No. 60/077,766 filed Mar. 12, 1998 and entitled “Register File Indexing Methods and Apparatus for Providing Indirect Control of Register in a VLIW Processor.”
The present invention relates generally to improvements in very long instruction word (VLIW) processing, and more particularly to advantageous register file indexing (RFI) techniques for providing indirect control of register addressing in a VLIW processor.
One important processor model is that of vector processing. This model has been used in prior art super computers for many years. Typical features of this model are the use of specialized vector instructions, specialized vector hardware, and the ability to efficiently operate on blocks of data. It is this very ability to operate typically only on vector data types that makes the model inflexible and unable to efficiently handle diverse processing requirements. In addition, in prior art vector processors, support for control scalar processing was typically done in separate hardware or in a separate control processor. Another processor model is the prior art very long instruction word (VLIW) processor model which represents a parallel processing model based on the concatenation of standard uniprocessor type single function operations into a long instruction word with no specialized multicycle vector processing facilities. To efficiently operate a block-data vector pipeline, it is important to have an efficient interface to deliver the individual vector elements. For this purpose, a successful class of prior art vector machines have been register based. The register based vector processors provide high performance registers for the vector elements allowing efficient access of the elements by the functional execution units. A single vector instruction tied to an implementation specific vector length value causes a block data multicycle operation. In addition, many vector machines have provided a chaining facility where operations on the individual vector elements are directly routed to other vector functional units to improve performance. These previous features and capabilities provide the background for the present invention. It is an object of the present invention to incorporate scalar, VLIW, and flexible vector processing capabilities efficiently in an indirect VLIW processor.
In typical reduced instruction set computer (RISC) and VLIW processors, the access of register operands is determined from short instruction word (SIW) bit-fields that represent the register address of operands stored in a register file. In register-based vector processors, specialized hardware is used. This hardware is initiated by a single vector instruction and automates the accessing of vector elements (operand data) from the dedicated vector registers. The multicycle execution on the block of data is also automated.
In the prior art, there have also been specialized hardware techniques used to support the automatic accessing or register operand data. For example, U.S. Pat. No. 5,680,600 which describes a technique for accessing a register file using a loop or repeat instruction to automate the register file addressing. This approach ties the register addressing to a loop or repeat instruction which causes a load or store instruction to be repeated while directing the register address to increment through a register file's address space. An electronic circuit is specified for reducing controller memory requirements for multiple sequential instructions. Thus, this prior art approach appears to be applied only to load and store type operations invoked by a special loop or repeat instruction. As such, it is not readily applicable to indirect VLIW ManArray processors as addressed further below.
A ManArray family of processors may suitably consist of multiple “indirect VLIW” (iVLIW) processors and processor elements (PEs) that utilize a fixed length short instruction word (SIW) of 32-bits. An SIW may be executed individually by one of up to eight execution units per processor and in synchronism in multiple PEs in a SIMD mode of operation. Another type of SIW is able to reference a VLIW indirectly to cause the issuance of up to eight SIW instructions in parallel in each processor and in synchronism in multiple PEs to be executed in parallel.
Operands are stored in register files and each execution unit has one or more read and write ports connected to the register file or files. In most processors, the registers selected for each port are addressed using bit fields in the instruction. With the indirect VLIW technique employed in the ManArray processor, the SIWs making up a VLIW are stored in a VLIW memory. Since each SIW fixes a register operand field by definition for a single operation on register accessed operand data, multiple VLIWs are required whenever a single operand field must be different as required by a processing algorithm. Thus, a suitable register file indexing technique for operation on blocks of data for use in conjunction with such processors and extendible more generally to parallel array processors will be highly advantageous.
This operand-data fixed register specification problem is solved by the present invention by providing a compact means of achieving pipelined computation on blocks of data using indirect VLIW instructions. A double indirect method of accessing the block of data in a register file is used to allow efficient implementations without the use of specialized vector processing hardware. In addition, the automatic modification of the register addressing is not tied to a single vector instruction, nor to repeat or loop instructions. Rather, the present technique, termed register file indexing (RFI) allows full programmer flexibility in control of the block data operational facility and provides the capability to mix non-RFI instructions with RFI instructions. The block-data operation facility is embedded in the iVLIW ManArray architecture allowing its generalized use across the instruction set architecture without specialized vector instructions, and without being limited to use only with repeat or loop instructions. Utilizing the present invention, chaining operations are inherently available without any direct routing between functional units further simplifying implementations. In addition, the present register file indexing architecture reduces the VLIW memory requirements which can be particularly significant depending on the types of algorithms to be coded.
Further, when expressed as unrolled loops of VLIW instructions, many computations exhibit clear register usage patterns. These patterns are characteristic of computational pipelines and can be taken advantage of with the ManArray indirect vector processing embedded in an indirect VLIW processor as adapted as described further herein.
Among its other aspects, the present invention provides a unique initialization method for generating an operand register address, a unique double-indirect execution mechanism, a unique controlling method, and allows a register file to be partitioned into independent circular buffers. It also allows the mixing of RFI and non-RFI instructions, and a scaleable design applicable to multiple array organizations of VLIW processing elements. As addressed in further detail below, the invention reduces both the VLIW memory and, as a consequence, SIW memory requirements for parallel instruction execution in an iVLIW array processor.
These and other features, aspects and advantages of the invention will be apparent to those skilled in the art from the following detailed description taken together with the accompanying drawings.
Further details of a presently preferred ManArray architecture for use in conjunction with the present invention are found in U.S. patent application Ser. No. 08/885,310 filed Jun. 30, 1997, U.S. patent application Ser. No. 08/949,122 filed Oct. 10, 1997, U.S. patent application Ser. No. 09/169,255 filed Oct. 9, 1998, U.S. patent application Ser. No. 09/169,256 filed Oct. 9, 1998, U.S. patent application Ser. No. 09/169,072 filed Oct. 9, 1998, U.S. patent application Ser. No. 09/187,539 filed Nov. 6, 1998, U.S. patent application Ser. No. 09/205,558 filed Dec. 4, 1998, U.S. patent application Ser. No. 09/215,081 filed Dec. 18, 1998, U.S. patent application Ser. No. 09/228,374 filed Jan. 12, 1999, and U.S. patent application Ser. No. 09/238,446 filed Jan. 28, 1999, as well as, Provisional Application Serial No. 60/092,130 entitled “Methods and Apparatus for Instruction Addressing in Indirect VLIW Processors” filed Jul. 9, 1998, Provisional Application Serial No. 60/103,712 entitled “Efficient Complex Multiplication and Fast Fourier Transform (FFT) Implementation on the ManArray” filed Oct. 9, 1998, Provisional Application Ser. No. 60/106,867 entitled “Methods and Apparatus for Improved Motion Estimation for Video Encoding” filed Nov. 3, 1998, Provisional Application Serial No. 60/113,637 entitled “Methods and Apparatus for Providing Direct Memory Access (DMA) Engine” filed Dec. 23, 1998, and Provisional Application Serial No. 60/113,555 entitled “Methods and Apparatus Providing Transfer Control” filed Dec. 23, 1998, respectively, and incorporated by reference herein in their entirety.
In a presently preferred embodiment of the present invention, a ManArray 2×2 iVLIW single instruction multiple data stream (SIMD) processor 100 shown in
In this exemplary system, common elements are used throughout to simplify the explanation, though actual implementations are not so limited. For example, the execution units 131 in the combined SP/PE0101 can be separated into a set of execution units optimized for the control function, e.g., fixed point execution units, and the PE0, as well as the other PEs 151, 153 and 155, can be optimized for a floating point application. For the purposes of this description, it is assumed that the execution units 131 are of the same type in the SP/PE0 and the other PEs. In a similar manner, SP/PE0 and the other PEs use a five instruction slot iVLIW architecture which contains a very long instruction word memory (VIM) memory 109 and an instruction decode and VIM controller function unit 107 which receives instructions as dispatched from the SP/PE0's I-Fetch unit 103 and generates the VIM addresses-and-control signals 108 required to access the iVLIWs, identified by the letters SLAMD in 109, stored in the VIM. The ManArray pipeline design provides an indirect VLIW memory access mechanism without increasing branch latency by providing a dynamically reconfigurable instruction pipeline for the indirect execute iVLIW (XV) instructions as described in further detail in U.S. patent application Ser. No. 09/228,374 entitled “Methods and Apparatus to Dynamically Reconfigure the Instruction Pipeline of an Indirect Very long Instruction Word Scalable Processor”. The loading of the iVLIWs is described in further detail in U.S. patent application Ser. No. 09/187,539 entitled “Methods and Apparatus for Efficient Synchronous MIMD Operations with iVLIW PE-to-PE Communication”. Also contained in the SP/PE0 and the other PEs is a common PE configurable register file 127 which is described in further detail in U.S. patent application Ser. No. 09/169,255 entitled “Methods and Apparatus for Dynamic Instruction Controlled Reconfiguration Register File with Extended Precision”.
Due to the combined nature of the SP/PE0, the data memory interface controller 125 must handle the data processing needs of both the SP controller, with SP data in memory 121, and PE0, with PE0 data in memory 123. The SP/PE0 controller 125 also is the source of the data that is sent over the 32-bit broadcast data bus 126. The other PEs 151, 153, and 155 contain common physical data memory units 123′, 123″ and 123′″ though the data stored in them is generally different as required by the local processing done on each PE. The interface to these PE data memories is also a common design in PEs 1, 2, and 3 and indicated by PE local memory and data bus interface logic 157, 157′ and 157″. Interconnecting the PEs for data transfer communications is the cluster switch 171 more completely described in U.S. patent application Ser. No. 08/885,310 entitled “Manifold Array Processor”, U.S. application Ser. No. 09/949,122 entitled “Methods and Apparatus for Manifold Array Processing”, and U.S. application Ser. No. 09/169,256 entitled “Methods and Apparatus for ManArray PE-to-PE Switch Control”. The interface to a host processor, other peripheral devices, and/or external memory can be done in many ways. The primary mechanism shown for completeness is contained in the DMA control unit 181 that provides a scalable ManArray data bus 183 that connects to devices and interface units external to the ManArray core. The DMA control unit 181 provides the data flow and bus arbitration mechanisms needed for these external devices to interface to the ManArray core memories via bus 185.
All of the above noted patents are assigned to the assignee of the present invention and incorporated herein by reference in their entirety.
Turning now to specific details of the ManArray processor apparatus as adapted to the present invention, this approach advantageously provides an efficient and flexible block-data operation capability through a double indirect mechanism.
Register File Indexing Programming View
Register file indexing (RFI) in accordance with one aspect of the present invention refers to methods and apparatus in each processing element and in the array controller for addressing the operand register file through a double indirect mechanism rather than directly through fields of an SIW, or through specialized vector instructions and vector hardware or with a required repeat or loop instruction. Each execution unit operates read and write ports of one or more register files. A read or write port consists of register selection address and control lines supplied to the register file, a data bus for register data being read from the register file for a read port, and a data bus for register data being written to the register file for a write port. The inputs to the register selection logic of these ports typically came only from bit-fields of the instruction being executed as shown in the prior art apparatus of FIG. 1B. In
In addition to this typical method for register selection, RFI operation in accordance with the present invention allows each register file port of each execution unit to also be independently controlled through a double indirect mechanism using simple control circuitry as addressed further below.
RFI Operation
RFI operation may advantageously be embedded in the ManArray iVLIW architecture and invoked by a double indirect mechanism. An exemplary execute VLIW (XV) instruction 200 having 32 bit encoding format 201 is shown in
In further detail, the XV instruction 200 is used to indirectly cause individual instruction slots of a specified SP or PE VLIW Memory (VIM) to be executed. The VIM address is computed as the sum of a base VIM address register Vb (V0 or V1) plus an unsigned 8-bit offset VIMOFFS. Any combination of individual instruction slots may be executed via the execute slot parameter ‘E={SLAMD}’, where S=Store Unit (SU), L=Load Unit (LU), A=Arithmetic Logic Unit (ALU), M=Multiply-Accumulate Unit (MAU), and D=Data Select Unit (DSU). A blank ‘E=’parameter does not execute any slots. The unit affecting flags (UAF) parameter ‘F=[AMDN] overrides the UAF specified for the VLIW when it was loaded via a load VLIW (LV) instruction. The override selects which arithmetic instruction slot (A=ALU, M=MAU, D=DSU) or none (N=NONE) is allowed to set condition flags for this execution of the VLIW. The override does not affect the UAF setting specified via the LV instruction. A blank ‘F=’ selects the UAF specified when the VLIW was loaded. The register file indexing (RFI) parameter ‘R=[01N] is used to enable or disable RFI for this XV's indirect execution of the instruction slots. With 'R=0’ (the RFI operation bits 202=00 in FIG. 2A), RFI operation is enabled and the RFI control register group 0 is selected. With ‘R=1’ (the bits 202=01), RFI operation is enabled and the RFI Control Register group 1 is selected. With ‘R=N’ (the bits 202=11), RFI operation is disabled.
The XV instruction with RFI enabled causes a second indirect operation to be initiated. The second indirect operation comes into play on the next XV instruction that is executed, wherein the register port addresses are indirectly specified through automatically incrementing hardware controlled in a manner specified by separate RFI control parameters. The RFI operation is described below, in the context of the ManArray pipeline, primarily concerned with the decode and execute phases of the pipeline, RFI control consists of four parts: 1) RFI control specification; 2) RFI initialization control; 3) RFI update control; and 4) RFI instruction execution.
RFI Control Specification
RFI control specification is preferably performed through RFI control registers. Each control register specifies all the RFI control information for the register ports used by a particular execution unit. There is a control field in the control register for each port and this field specifies whether or not the RFI operation is enabled for that particular port and, if enabled, specifies the RFI register update policy.
The RFI control registers are accessed through a ManArray miscellaneous register file (MRF) 300 illustrated in FIG. 3A. This register file is unique in that additional registers can be added within the restricted MRF address space by address mapping additional registers to a single MRF address. The MRF extension registers 305 and 315, shown in
MRFX Addr1
MRF Extension Register Address-1. This field
402 (FIG. 4A)
contains the address of a register within the MRF
extension register group-1 of FIG.3B. When the
MRFXDR1 302 of
MRFX1 register in
address is the target of the read or write operation.
MRFX Addr2
MRF Extension Register Address-2. This field
406 (FIG. 4A)
contains the address of a register within the MRF
Extension register group-2 of FIG.3C. When the
MRFXD2 303 of
MRFX2 register in
address is the target of the read or write operation.
AutoIncrement
When set, this bit causes the MRFX Address field
(AJ1 or AJ2)
1 402 or field 2 406 of
404 or 408 (FIG. 4A)
after each read or write access to the MRFXDR1
302 or MRFXDR2 303 of
MRFX Data
A Load/Store or DSU operation (COPY, BIT op)
(MRFX1 or MRFX2)
which targets the MRFXDR1 302 or MRFXDR2
420 (FIG. 4B)
303 of
whose address is contained in bits [2:0] of the
MRFXAR1 402 or bits[8:6] MRFXAR2 406 of
FIG. 4A. If the auto increment bit 404 or 408 of
the selected MRFXAR is set, then the access will
also cause the address in the MRFXAR1 or
MRFXAR2 to be incremented after the access.
In a presently preferred embodiment, five execution units have RFI control.
The registers are used in two control groups (510-540), two save and restore context registers (550 and 560), and one register 580 to control the initialization of the RFI controls for each control group. A reserved register 570 is also shown. The first control group 0 includes RFIDLS0310 and RFIAM0320 in FIG. 3C. Further details are shown in registers 510 and 520 of FIG. 5. The second control group 1 includes RFIDLS1330 and RFIAM1340 with further details in registers 530 and 540.
When an iVLIW is executed, one of the control groups is specified in the XV instruction via bits 21 and 20, the RFI bits 202 of instruction 200 of
Specifically, in control group 0, RFIDLS0510 in
Note that the control parameters may have any format that allows a required set of control information to be represented, as the invention does not require a particular format. An exemplary format 600 for a register file port is shown in greater detail in FIG. 6. The RFI parameters are encoded into 4-bits as shown in columns 601 and 602. This control information specifies the type of update to be applied to generate the address of the next register to be selected on the next RFI instruction execution. In the presently preferred embodiment, the control parameters are used to select an update increment value 603 to be added to the register address, and to specify the maximum sequential (incrementing by one) register file address range (RFBS) that can be selected 604. As described further below, the starting register along with these parameters determines the actual register set which may be selected by the index. Columns 605-611 are used to describe the operation of the indirect vector apparatus shown in
RFI Initialization Control
RFI initialization takes place in two steps, which are best understood with reference to
First, control information as illustrated in
For purposes of clarity, the LIM data path from instruction register 804 H0 halfword bits 15-0 is not shown. This data path is selectively controlled to load the H0 halfword of the LIM instruction to either the low or high halfword portion of any of the MRF extension registers listed in FIG. 5. For example, a LIM instruction could cause the loading of its H0 halfword to the H1 portion of the RFIAM0 register 520 of FIG. 5. In reference to the common arithmetic RFI port control logic of
The word form of the LIM instruction loads a signed-extended 17-bit immediate value into the target register. The 17-bit signed value may be any value in the range −65536 to 65535. The encoding for the word form of LIM puts the magnitude of the value into the IMM16 field and the sign bit is the LOC field bits 23 and 22 shown in FIG. 7A. LOC field determines if the upper halfword is filled with all one or all zero bits.
In the second step of RFI initialization, a start bit, e.g. bit 583 for the DSU 854, is set in the RFI Start Register, RFIStart of
Upon the next issuance of an RFI XV instruction, the operands are indirectly specified from the RFI logic. This is the second indirect specification in the operational sequence. The first indirect specification is through the RFI XV instruction which indirectly specified the SIW and the second indirect specification is through the RFI logic as set up via the RFI control parameters. In order to accomplish this, operation update control register 0810, update adder logic 830, indexed port look ahead register 820, multiplexers 814 and 822, and update control logic 824 are used to generate the updated port address to be used in following RFI instruction executions.
The basic concept is that the address output 811 of the multiplexer 814 is available early enough in the decode cycle so that the update adder logic 830 can update the address based upon the update control logic 824 signals. The updated address 819 is selected by mux control signals 815 to pass through multiplexer 822 and loaded into the index port look ahead register 820 at the end of decode at the same time the present port address 811 is loaded into the port address register 816. On the next RFI instruction, the look ahead register value 821 is used in place of the fetched SIW operand port address value and latched into the port address register 816 for the next execute cycle, while the update adder logic is again preparing the next port address to be used. After the first RFI instruction following the setting of the RFI start bit(s), the start bit(s) are cleared causing subsequent RFI instructions to have their SIW operand registers selected by corresponding indexed port look ahead registers. The start bit and mux control block 812 provide the control for determining whether an instruction's registers are selected by instruction fields or by RFI indexed port look ahead registers. Its inputs come from the instruction opcode 807, the update control register 0810, and an RFI enable signal 825. These signals along with pipeline control signals (not shown) indicating an instruction's progress in the pipeline, determine the register selection source via the multiplexer 814.
The use of the indexed port look ahead register 820 allows non-RFI instructions to be intermixed between RFI operations without affecting the RFI register address sequence. When a non-RFI instruction is detected, the RFI logic preserves the required RFI state while the non-RFI instructions are executing.
RFI Update Control
When an RFI operation is invoked, the address of one or more registers in the register file 818 is supplied by the RFI logic. This logic updates the register address for the next cycle by adding or subtracting a constant from an address available in the early stages of the decode cycle while maintaining the generated port address within a particular set of register addresses. In the presently preferred embodiment, this is done by specifying an increment value and a register file block size (RFBS) 604 as shown in
Rnext=((Rcurrent+k)mod M)+Q*M.
Because the remainder of Rs/M is ignored due to the floor operation, the value of Q*M≠Rs.
As an example, assume that the starting register port address is 5, i.e. Rs=R5 which also equals Rcurrent for the first operation. Also, assume the update increment is k=2, and the RFBS is M=8. In
In a VLIW processor, it is possible to have all ports of the register file under RFI control for a single instruction, such as the presently described XV instruction. Since the RFI port logic is independent between execution units, the ports can be individually controlled by SIW execution-unit-specific instructions. This means that if another instruction or group of instructions requires independent RFI control (i.e. a different set of control parameters) in addition to the XV instruction, another group of control registers could be assigned. Since the RFI set up latency is relatively small, the control register set as described in
Another register file indexing apparatus 1100 is shown in FIG. 11. This RFI mechanism still uses the double indirect mechanism outlined in the other RFI approaches discussed relative to
The operation of the apparatus 1100 of
In addition to the XV RFI enabling apparatus, other means of enabling RFI are used. The purpose of this additional mechanism is to decouple the RFI sequencing from only being used in the VLIW (XV) programming model. It is desired to support block load, block store, and block move operations with single instruction execution, which can be independently done in the SP or concurrently in the PEs. Rather than use additional bits in SIWs to specify this operation, though this is not precluded by this invention, an alternate indirect mechanism to enable RFI is used. This savings in bits in the SIWs allows better use of the instruction format for standard operation encoding while not precluding the ability to achieve the RFI functionality provided by the present invention. This alternative mechanism operates with any SIW that can address a specific location in the MRF. Though multiple locations in the MRF could be provided for this purpose, there are other uses in specific implementations which may preclude this. For the purposes of describing this alternate RFI enabling mechanism, one location in the MRF is used, as shown for RFILSD 304 in FIG. 3A.
To use the RFI enabling mechanism, the hardware decode logic is extended to generate the RFI enable signal not only when an XV RFI instruction is received but also whenever a load, store, or DSU instruction is received in the SP or PE instruction register which specifies the RFILSD address as the load Rt, store Rs, or DSU Rt or Rs operands. Prior to using this alternate·RFI enabling mechanism, the RFI control registers are required to be set up specifying the initial registers to be used in a block load, store, or DSU operation. No start bit is used in this alternate RFI enabling mechanism as the starting address of the block sequence is stored in the port control registers. Upon receiving a load, store, or DSU instruction, which uses the RFILDS bits as an operand address, the RFI mode is enabled and each register operand address is substituted with the pre-setup port (operand) addresses by the RFI port logic as shown in the representative RFI logic of
RFI Instruction Execution
RFI operation is enabled through control information contained in instruction words. This control information is used to specify whether conventional register address selection fields (operand address fields contained in the instruction) are to be used or whether the RFI selection of registers is to be used. In the presently preferred embodiment, the control information in the instruction, indirect VLIW XV instruction bits 21 and 20202 of
It is noted that the ManArray processor finishes the execution phase of its pipeline with a write back to the register file. This approach allows the next cycle after the write-back cycle to use the results in the next operation. By judicious programming, chaining of vector operations is then inherent in the architecture. No separate bypass paths need be provided in the execution units to support chaining.
A discussion concerning an exemplary use of RFI in accordance with the present invention is now presented to illustrate several advantageous aspects of the invention. Assuming an increment value of 1, RFBS value (M) a power of 2, starting register R2, the register addresses alternate between two registers, an even register R2 and its corresponding odd register (address+1) R3. For RFBS=4, the register addresses cycle among 4 values with an increment of 1. The following table shows some address sequences.
Register
Start
File Block
Register
Increment
Size
Sequence
R2
1
2
R2, R3, R2, R3, . . .
R2
1
4
R2, R3, R0, R1, R2, . . .
R5
1
4
R5, R6, R7, R4, R5, . . .
R5
2
4
R5, R7, R5, R7, . . .
R5
2
8
R5, R7, R1, R3, R5, . . .
R6
2
8
R6, R0, R2, R4, R6, . . .
R0
1
1
R0, R1, R2, R3, . . . R31, R0, R1 . . .
for non-Load/Store units
R0, R1, R2, R3, . . . R63 (cycles ALL
registers) for Load/Store units
Assume it is desired to calculate a simple matrix-vector multiplication on a 4-PE SIMD VLIW ManArray processor such as processor 100 of FIG. 1A. Further assume that the following instruction types are available.
Pseudo
Instructions
Operation
LDB RN, PJ+
Load Broadcast: Loads from a memory location
specified by the address register PJ in SP memory and
stores the value into register RN of each PE (all receive
the same value. PJ is post-incremented by 1.
MAC RT, RX, RY
Multiply-Accumulate: All PEs execute in SIMD
fashion the operation RT = RT + (RX * RY)
ST RS, PJ+
Store: All PEs store source register RS to local PE
memory location specified by PJ PJ is post-
incremented by 1.
REP N, M
Execute the following N instruction M times
Also, assume that a 4×4 matrix A is distributed to the 4 PEs, PE0, PE1, PE2 and PE3, such that each PE contains a row of the matrix in registers R4, R5, R6 and R7 (PE0 gets row 0, PE1 gets row 1, etc.) as shown in the following table.
Register →
R4
R5
R6
R7
PE0
a00
a01
a02
a03
PE1
a10
a11
a12
a13
PE2
a20
a21
a22
a23
PE3
a30
a31
a32
a33
If a sequence of 4×1 vectors are read in from main (SP) memory 105, multiplied by the matrix and the results stored in local PE memory 123, 123′, 123″ and 123″′, an appropriate sequential algorithm might appear as follows if it is assumed R2 is zero initially:
LDB R0, P0+
;load first element of input vector, x0
MAC R2, R4, R0
;accumulate product: ai0 * x0 (I is row index and
PE ID)
LDB R0, P0+
;load second element of input vector, x1
MAC R2, R5, R0
; accumulate product: ai1 * x1
LDB R0, P0+
;load third element of input vector, x2
MAC R2, R6, R0
; accumulate product: ai2 * x2
LDB R0, P0+
;load last element of input vector, x3
MAC R2, R7, R0
; accumulate product: ai3 * x3
ST R2, P1+
;store results: each local memory gets an element
of ;output vector
Performing this algorithm with VLIW instructions yields:
VLIW
SIW
SIW
Execute Action
LDB R0, P0+
;Load
1
LDB R0, P0+
MAC R2, R4, R0
;Load PEs and MAC x0 *
a[i][0]
2
LDB R0, P0+
MAC R2, R5, R0
;Load PEs and MAC x1 *
a[i][1]
3
LDB R0, P0+
MAC R2, R6, R0
;Load PEs and MAC x2 *
a[i][2]
4
LDB R0, P0+
MAC R2, R7, R0
;Load PEs and MAC x3 *
a[i][3]
ST R2, P1+
;All PEs store Store result
This requires 4 VLIW-type instructions, plus a single load LDB and a single store ST instruction, even though the only difference between these VLIW instructions is the second register specification of the MAC instruction.
Now if the example is performed using RFI, the process is as follows: Assume R2 and R0 are both initialized to zero and register file indexing is used with the following parameters associated with the VLIW indirectly executed by an XV instruction:
Execution Unit Register Port
Increment
RFBS
Load Write Port
0
1
MAU Rx Readport
1
4
Now the code can be written in compact VLIW form where the second register RFI sequence starts with R7→R4→R5→R6→R7, etc.
VLIW
LD RFIC, P1, ctrl
;Initialize RFI control for
MAU reg port
REP 1, 5
;Repeat 1 instruction 5
times
1
LDB R0, P0+
MAC R2, R7, R0
;Load and MAC: first
;MAC is 0 and last
;load reads into next
;vector (or garbage)
ST
R2, P1+
;Store results
The net effect is to reduce 9 instructions to 4 instructions. The fact that fewer VLIWs are used, reduces the number of iVLIWs executed and also the number of VLIWs that must be loaded in the ManArray architecture. These savings are indirect, but not insignificant since the VLIW memory (VIM) represents an expensive on chip resource. The RFI operation reduces the amount of VLIW memory needed, thus allowing for less-expensive chips.
While the present invention has been disclosed in the context of various aspects of presently preferred embodiments, it will be recognized that the invention may be suitably applied to other environments and applications consistent with the claims which follow.
Marchand, Patrick R., Pechanek, Gerald George, Barry, Edwin Franklin
Patent | Priority | Assignee | Title |
10869108, | Sep 29 2008 | PATENT ARMORY INC | Parallel signal processing system and method |
11714620, | Jan 14 2022 | Triad National Security, LLC | Decoupling loop dependencies using buffers to enable pipelining of loops |
9697004, | Feb 12 2004 | SOCIONEXT INC | Very-long instruction word (VLIW) processor and compiler for executing instructions in parallel |
Patent | Priority | Assignee | Title |
5321821, | Dec 26 1988 | Mitsubishi Denki Kabushiki Kaisha | System for processing parameters in instructions of different format to execute the instructions using same microinstructions |
5485629, | Jan 22 1993 | Intel Corporation | Method and apparatus for executing control flow instructions in a control flow pipeline in parallel with arithmetic instructions being executed in arithmetic pipelines |
5495598, | Dec 23 1993 | Unisys Corporation | Stuck fault detection for branch instruction condition signals |
5517628, | Oct 31 1985 | Biax Corporation | Computer with instructions that use an address field to select among multiple condition code registers |
5649135, | Jan 17 1995 | IBM Corporation | Parallel processing system and method using surrogate instructions |
5671382, | Nov 21 1986 | Hitachi, Ltd. | Information processing system and information processing method for executing instructions in parallel |
5680600, | Oct 13 1989 | Texas Instruments Incorporated | Electronic circuit for reducing controller memory requirements |
5696922, | Dec 10 1993 | Hewlett Packard Enterprise Development LP | Recursive address centrifuge for distributed memory massively parallel processing systems |
5721854, | Nov 02 1993 | International Business Machines Corporation | Method and apparatus for dynamic conversion of computer instructions |
5752072, | May 09 1996 | International Business Machines Corporation | Sorting scheme without compare and branch instructions |
5826096, | Sep 30 1993 | Apple Computer, Inc. | Minimal instruction set computer architecture and multiple instruction issue method |
5890222, | Jan 04 1995 | International Business Machines Corporation | Method and system for addressing registers in a data processing unit in an indirect addressing mode |
6023252, | Apr 05 1995 | CITIZEN WATCH CO , LTD | Liquid crystal display device |
6081884, | Jan 05 1998 | GLOBALFOUNDRIES Inc | Embedding two different instruction sets within a single long instruction word using predecode bits |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 03 2004 | Altera Corporation | (assignment on the face of the patent) | / | |||
Aug 24 2006 | PTS Corporation | Altera Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 018184 | /0423 |
Date | Maintenance Fee Events |
Apr 12 2010 | REM: Maintenance Fee Reminder Mailed. |
Apr 22 2010 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Apr 22 2010 | M1555: 7.5 yr surcharge - late pmt w/in 6 mo, Large Entity. |
Feb 25 2014 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Nov 24 2012 | 4 years fee payment window open |
May 24 2013 | 6 months grace period start (w surcharge) |
Nov 24 2013 | patent expiry (for year 4) |
Nov 24 2015 | 2 years to revive unintentionally abandoned end. (for year 4) |
Nov 24 2016 | 8 years fee payment window open |
May 24 2017 | 6 months grace period start (w surcharge) |
Nov 24 2017 | patent expiry (for year 8) |
Nov 24 2019 | 2 years to revive unintentionally abandoned end. (for year 8) |
Nov 24 2020 | 12 years fee payment window open |
May 24 2021 | 6 months grace period start (w surcharge) |
Nov 24 2021 | patent expiry (for year 12) |
Nov 24 2023 | 2 years to revive unintentionally abandoned end. (for year 12) |