A SIMD machine employing a plurality of parallel processor (pes) in which communications hazards are eliminated in an efficient manner. An indirect Very long instruction word instruction memory (vim) is employed along with execute and delimiter instructions. A masking mechanism may be employed to control which pes have their VIMs loaded. Further, a receive model of operation is preferably employed. In one aspect, each pe operates to control a switch that selects from which pe it receives. The present invention addresses a better machine organization for execution of parallel algorithms that reduces hardware cost and complexity while maintaining the best characteristics of both SIMD and MIMD machines and minimizing communication latency. This invention brings a level of MIMD computational autonomy to SIMD indirect Very long instruction word (iVLIW) processing elements while maintaining the single thread of control used in the SIMD machine organization. Consequently, the term Synchronous-MIMD (SMIMD) is used to describe the present approach.
|
1. An indirect very long instruction word (vliw) processing system comprising:
a first processing element (pe) having a vliw instruction memory (vim) for storing function instructions in slots within a vim memory location;
a first register for storing a control instruction and a function instruction, the function instruction having a plurality of definition bits defining both a the control instruction type and an execution unit type of the function instruction;
a predecoder for decoding the plurality of definition bits; and
a load mechanism for loading the function instruction in one of said slots in vim based upon both said decoding, and a control instruction defining a load operation.
0. 38. A processing system comprising:
a first processing element (pe) including a first instruction memory for storing a first very long instruction word (vliw) to be executed by said first pe; and
a second processing element (pe) including a second instruction memory for storing a second vliw to be executed by said second pe, said second vliw and said first vliw defining different operations;
wherein the first vliw and the second vliw are both stored at the same address location in each memory;
wherein the first pe and the second pe are operable for simultaneously executing the first vliw and the second vliw, respectively, in response to each pe receiving an execute very long instruction word (vliw) instruction.
0. 26. A processing system comprising:
a plurality of processing elements (pes) communicatively connected to each other, each of said pes including a very long instruction word (vliw) memory (vim) for storing vliws to be executed by each pe; and
a sequence processor (SP) operable for concurrently initiating indirect execution of a vliw stored at a first address in the vim of each pe, in response to the SP issuing an indirect instruction to initiate concurrent execution by each pe, each pe of said plurality of pes concurrently executing the vliw stored at the first address in the vim associated with each pe, and
at least one of said plurality of pes concurrently executing a vliw at the first address of its vim which defines a different operation from a vliw concurrently executed by another pe of said plurality of pes.
0. 45. A processing method for a processing system comprising a first processing element (pe) including a first very long instruction word memory (vim), the first pe communicatively connected to a second pe including a second vim, the method comprising:
loading a first function instruction in the first vim at a first address;
loading a second function instruction in the second vim at the first address;
receiving an execute vliw instruction; and
concurrently executing the first function instruction by the first pe and the second function instruction by the second pe, in response to the received execute vliw instruction;
wherein the first function instruction stored in the first vim at the first address and the second function instruction stored in the second vim at the first address define different operations.
2. The system of
3. The system of
4. The system of
5. The system of
6. The system of
7. The system of
8. The system of
9. The system of
10. The system of
11. The system of
12. The system of
13. The system of
14. The system of
15. The system of
16. The system of
17. The system of
18. The system of
19. The system of
20. The system of
21. The system of
22. The system of
23. The system of
24. The system of
25. The system of
0. 27. The processing system of
the plurality of pes execute instructions which define the same operation.
0. 28. The processing system of
concurrently initiating the execution of instructions stored in a vliw at a third address in the VIMs such that the first pe executes a first instruction stored in a vliw at the third address in the vim associated with the first pe, and the second pe executes a second instruction stored in a vliw at the third address in the vim associated with the second pe.
0. 29. The processing system of
the first instruction and the second instruction define different operations.
0. 30. The processing system of
the first instruction and the second instruction define the same operation.
0. 31. The processing system of
0. 32. The processing system of
0. 33. The processing system of
0. 34. The processing system of
0. 35. The processing system of
0. 36. The processing system of
each pe includes a base address register; and
each pe determines the first address utilizing the base address register associated with each pe and an offset value contained in the execute vliw instruction.
0. 37. The processing system of
each pe is operable to receive data from other pes; and
each pe is operable to control from which pe data is received.
0. 39. The processing system of
0. 40. The processing system of
each pe includes a base address register; and
each pe determines the first address utilizing both the base address register associated with each pe and an offset value contained in the execute vliw instruction.
0. 41. The processing system of
the first and second instructions comprise very long instruction word (vliw) instructions; and
each vliw instruction comprises a plurality of simplex instructions.
0. 42. The processing system of
each pe comprises a plurality of execution units; and
each simplex instruction is adapted for being executed by at least one of the execution units.
0. 43. The processing system of
an instruction register for storing the execute vliw instruction; and
a predecoder for decoding if the instruction stored in the instruction register in an execute vliw instruction.
0. 44. The processing system of
0. 46. The method of
receiving a load vliw instruction which contains an address offset;
predecoding the load vliw instruction; and
determining the first address utilizing the address offset and the base address register.
0. 47. The method of
receiving the first function instruction; and
predecoding the first function instruction to determine into which slot the first instruction is to be loaded.
0. 48. The method of
determining if any of said plurality of slots are to be disabled; and
if any of said plurality of slots are to be disabled, loading a disable bit in a storage bit for each slot which is to be disabled.
0. 49. The method of
0. 50. The method of
removing the at least one group bit and the at least one unit field bit from the first function instruction before the first function instruction is loaded into the first vim; and
adding at least one replacement bit to the first function instruction.
0. 51. The method of
predecoding the execute vliw instruction; and
determining the first address utilizing the address offset and the base address register.
0. 52. The processing method of
masking the first pe to be enabled; and
masking the second pe to be disabled.
0. 53. The processing method of
masking the first pe to be disabled; and
masking the second pe to be enabled.
|
The present application is a continuation of Ser. No. 09/187,539, filed on Nov. 6, 1998, now U.S. Pat. No. 6,151,668.
The present invention claims the benefit of U.S. Provisional Application Ser. No. 60/064,619 entitled “Methods and Apparatus for Efficient Synchronous MIMD VLIW Communication” and filed Nov. 7, 1997.
For any Single Instruction Multiple Data stream (SIMD) machine with a given number of parallel processing elements, there will exist algorithms which cannot make efficient use of the available parallel processing elements, or in other words, the available computing resources. Multiple Instruction Multiple Data stream (MIMD) class machines execute some of these algorithms with more efficiency but require additional hardware to support a separate instruction stream on each processor and lose performance due to communication latency with lightly coupled program implementations. The present invention addresses a better machine organization for execution of these algorithms that reduces hardware cost and complexity while maintaining the best characteristics of both SIMD and MIMD machines and minimizing communication latency. The present invention provides a level of MIMD computational autonomy to SIMD indirect Very Long Instruction Word (iVLIW) processing elements while maintaining the single thread of control used in the SIMD machine organization. Consequently, the term Synchronous-MIMD (SMIMD) is used to describe the invention.
There are two primary parallel programming models, the SIMD and the MIMD models. In the SIMD model, there is a single program thread which controls multiple processing elements (PEs) in a synchronous lock-step mode. Each PE executes the same instruction but on different data. This is in contrast to the MIMD model where multiple program threads of control exist and any inter-processor operations must contend with the latency that occurs when communicating between the multiple processors due to requirements to synchronize the independent program threads prior to communicating. The problem with SIMD is that not all algorithms can make efficient use of the available parallelism existing in the processor. The amount of parallelism inherent in different algorithms varies leading to difficulties in efficiently implementing a wide variety of algorithms on SIMD machines. The problem with MIMD machines is the latency of communications between multiple processors leading to difficulties in efficiently synchronizing processors to cooperate on the processing of an algorithm. Typically, MIMD machines also incur a greater cost of implementation as compared to SIMD machines since each MIMD PE must have its own instruction sequencing mechanism which can amount to a significant amount of hardware. MIMD machines also have an inherently greater complexity of programming control required to manage the independent parallel processing elements. Consequently, levels of programming complexity and communication latency occur in a variety of contexts when parallel processing elements are employed. It will be highly advantageous to efficiently address such problems as discussed in greater detail below.
The present invention is preferably used in conjunction with the ManArray architecture various aspects of which are described in greater detail in U.S. patent application Ser. No. 08/885,310 filed Jun. 30, 1997, now U.S. Pat. No. 6,023,753, U.S. Ser. No. 08/949,122 filed Oct. 10, 1997, now U.S. Pat. No. 6,167,502, U.S. Ser. No. 09/169,255 filed Oct. 9, 1998, now U.S. Pat. No. 6,343,356, U.S. Ser. No. 09/169,256 filed Oct. 9, 1998 now U.S. Pat. No. 6,167,501 and U.S. Ser. No. 09/169,072 filed Oct. 9, 1998, now U.S. Pat. No. 6,219,776, Provisional Application Ser. No. 60/067,511 entitled “Method and Apparatus for Dynamically Modifying Instructions in a Very Long Instruction Word Processor” filed Dec. 4, 1997, Provisional Application Ser. No. 60/068,021 entitled “Methods and Apparatus for Scalable Instruction Set Architecture” filed Dec. 18, 1997, Provisional Application Ser. No. 60/071,248 entitled “Methods and Apparatus to Dynamically Expand the Instruction Pipeline of a Very Long Instruction Word Processor” filed Jan. 12, 1998, Provisional Application Ser. No. 60/072,915 entitled “Methods and Apparatus to Support Conditional Execution in a VLIW-Based Array Processor with Subword Execution” filed Jan. 28, 1998, Provisional Application Ser. No. 60/077,766 entitled “Register File Indexing Methods and Apparatus for Providing Indirect Control of Register in a VLIW Processor”, filed Mar. 12, 1998, Provisional Application Ser. No. 60/092,130 entitled “Methods and Apparatus for Instruction Addressing in Indirect VLIW Processors” filed on Jul. 9, 1998, Provisional Application Ser. No. 60/103,712 entitled “Efficient Complex Multiplexing and Fast Fourier Transform (FFT) Implementation on the ManArray” filed on Oct. 9, 1998, and Provisional Application Ser. No. 60/106,867 entitled “Methods and Apparatus for Improved Motion Estimation for Video Encoding” filed on Nov. 3, 1998, respectively, all of which are assigned to the assignee of the present invention and incorporated herein in their entirety.
A ManArray processor suitable for use in conjunction with ManArray indirect Very Long Instruction Words (iVLIWs) in accordance with the present invention may be implemented as an array processor that has a Sequence Processor (SP) acting as an array controller for a scalable array of Processing Elements (PEs) to provide an indirect Very Long Instruction Word architecture. Indirect Very Long Instruction Words (iVLIWs) in accordance with the present invention may be compared in an iVLIW Instruction Memory (VIM) by the SIMD array controller. Sequence Processor or SP. Preferably, VIM exists in each Processing Element or PE and contains a plurality of iVLIWs. After an iVLIW is composed in VIM, another SP instruction, designated XV for “execute iVLIW” in the preferred embodiment, concurrently executes the iVLIW at an identical VIM address in all PEs. If all PE VIMs contain the same instructions, SIMD operation occurs. A one-to-one mapping exists between the XV instruction and the single identical iVLIW that exists in each PE.
To increase the efficiency of certain algorithms running on the ManArray, it is possible to operate indirectly on VLIW instructions stored in a VLIW memory with the indirect execution initiated by an execute VLIW (XV) instruction and with different VLIW instructions stored in the multiple PEs at the same VLIW memory address. When the SP instruction causes this set of iVLIWs to execute concurrently across all PEs, Synchronous MIMD or SMIMD operation occurs. A one-to-many mapping exists between the XV instruction and the multiple different iVLIWs that exist in each PE. No specialized synchronization mechanism is necessary since the multiple different iVLIW executions are instigated synchronously by the single controlling point SP with the issuance of the XV instruction. Due to the use of a Receive Model to govern communication between PEs and a ManArray network, the communication latency characteristic common to MIMD operations is avoided as discussed further below. Additionally, since there is only one synchronous locus of execution, additional MIMD hardware for separate program flow in each PE is not required. In this way, the machine is organized to support SMIMD operations at a reduced hardware cost while minimizing communication latency.
A ManArray indirect VLIW or iVLIW is preferably loaded under program control, although the alternatives of direct memory access (DMA) loading of the iVLIWs and implementing a section of VIM address space with ROM containing fixed iVLIWs are not precluded. To maintain a certain level of dynamic program flexibility, a portion of VIM, if not all of the VIM, will typically be of the random access type of memory. To load the random access type of VIM, a delimiter instruction, LV for Load iVLIW, specifies that a certain number of instructions that follow the delimiter are to be loaded into the VIM rather than executed. For SIMD operation, each PE gets the same instructions for each VIM address. To set up for SMIMD operation it is necessary to load different instructions at the same VIM address in each PE.
In the presently preferred embodiment, this is achieved by a masking mechanism that functions such that the loading of VIM only occurs on PEs that are masked ON. PEs that are masked OFF do not execute the delimiter instruction and therefore do not load the specified set of instructions that follow the delimiter into the VIM. Alternatively, different instructions could be loaded in parallel from the PE local memory or the VIM could be the target of a DMA transfer. Another alternative for loading different instructions into the same VIM address is through the use of a second LV instruction, LV2, which has a second 32-bit control word that follows the LV instruction. The first and second control words rearrange the bits between them so that a PE label can be added. This second LV2 approach does not require the PEs to be masked and may provide some advantages in different system implementations. By selectively loading different instructions into the same VIM address on different PEs, the ManArray is set up for the SMIMD operation.
One problem encountered when implementing SMIMD operation is in dealing with inter-processing element communication. In SIMD mode, all PEs in the array are executing the same instruction. Typically, these SIMD PE-to-PE communications instructions are thought of as assigning a Send Model. That is to say, the SIMD Send Model communication instructions indicate in which direction or to which target PE, each PE should send its data. When a communication instruction such as SEND-WEST is encountered, each PE sends data to the PE topologically defined as being its western neighbor. The Send Model specifies both sender and receiver PEs. In the SEND-WEST example, each PE sends its data to its West PE and receives data from its East PE. In SIMD mode, this is not a problem.
In SMIMD mode of operation, using a Send Model, it is possible for multiple processing elements to all attempt to send data to the same neighbor. This attempt presents a hazardous situation because processing elements such as those in the ManArray may be defined as having only one receive port, capable of receiving from only one other processing element at a time. When each processing element is defined as having one receipt port, such an attempted operation cannot complete successfully and results in a communication hazard.
To avoid the communication hazard described above, a Receive Model is used for the communication between PEs. Using the Receive Model, each processing element controls a switch that selects from which processing element it receives. It is impossible for communication hazards to occur because it is impossible for any two processing elements to contend for the same receive port. By definition, each PE controls its own receive port and makes data available without target PE specification. For any meaningful communication to occur between processing elements using the Receive Model, the PEs must be programmed to cooperate in the receiving of the data that is made available. Using Synchronous MIMD (SMIMD), this is guaranteed to occur if the cooperating instructions all exist at the same iVLIW location. Without SMIMD, a complex mechanism would be necessary to synchronize communications and use the Receive Model.
A more complete understanding of the present invention, as well as further features and advantages of the invention will be apparent from the following Detailed Description and the accompanying drawings.
FIGS. 4F1 and 4F2 illustrate slot storage for three Synchronous MIMD iVLIWs in a 2×2 ManArray configuration;
One set of presently preferred indirect Very Long Instruction Word (iVLIW) control instructions for use in conjunction with the present invention is described in detail below.
The SP 102 and each PE 104 in the ManArray architecture as adapted for use in accordance with the present invention contains a quantity of iVLIW memory (VIM) 106 as shown in FIG. 1. Each VIM 106 contains storage space to hold multiple VLIW instruction Addresses 103, and each Address is capable of storing up to eight simplex instructions. Presently preferred implementations allow each iVLIW instruction to contain up to five simplex instructions: one associated with each of the Store Unit 108, Load Unit 110, Arithmetic Logic Unit 112 (ALU), Multiply-Accumulate Unit 114 (MAU), and Data-Select Unit 116 (DSU) 116. For example, an iVLIW instruction at VIM address “i” 105 contains the five instructions SLAMD.
iVLIW instructions can be loaded into an array of PE VIMs collectively, or, by using special instructions to mask a PE or PEs, each PE VIM can be loaded individually. The iVLIW instructions in VIM are accessed for execution through the Execute VLIW (XV) instruction, which, when executed as a single instruction, causes the simultaneous execution of the simplex instructions located at the VIM memory address. An XV instruction can cause the simultaneous execution of:
Only two control instructions are necessary to load/modify iVLIW memories, and to execute iVLIW instructions. They are:
The LV instruction 400 shown in
Any combination of individual instruction slots may be disabled via the disable slot parameter ‘d={SLAMD}’, where S=Store Unit (SU), L=Load Unit (LU), A=Arithmetic Logic Unit (ALU), M=Multiply-Accumulate Unit (MAU) and D=Data Select Unit (DSU). A blank ‘D=’parameter does not disable any slots. Specified slots are disabled prior to any instructions that are loaded.
The number of instructions to load are specified utilizing an InstrCnt parameter. For the present implementation, valid values are 0-5. The next InstrCnt instructions following LV are loaded into the specified VIM. The Unit Affecting Flags (UAF) parameter ‘F=[AMD]’ selects which arithmetic instruction slot (A=ALU, M=MAU, D=DSU) is allowed to set condition flags for the specified VIM when it is executed. A blank ‘F=’ selects the ALU instruction slot. During processing of the LV instruction no arithmetic flags are affected and the number of cycles is one plus the number of instructions loaded.
The XV instruction 425 shown in
Any combination of individual instruction slots may be executed via the execute slot parameter ‘E={SLAMD}’, where S=Store Unit (SU), L=Load Unit (LU), A=Arithmetic Logic Unit (ALU), M=Multiply-Accumulate Unit (MAU), D=Data Select Unit (DSU). A blank ‘E=’ parameter does not execute any slots. The Unit Affecting Flags (UAF) parameter ‘F={AMDN}’ overrides the UAF specified for the VLIW when it was loaded via the LV instruction. The override selects which arithmetic instruction slot (A=ALU, M=MAU, D=DSU) or none (N=NONE) is allowed to set condition flags for this execution of the VLIW. The override does not affect the UAF setting specified by the LV instruction. A blank ‘F=’ selects the UAF specified when the VLIW was loaded.
Condition flags are set by the individual simplex instruction in the slot specified by the setting of the ‘F= parameter from the original LV instruction or as overridden by an ‘F=[AMD]’ parameter in the XV instruction. Condition flags are not affected when ‘F=N’. Operation occurs in one cycle. Pipeline considerations must be taken into account based upon the individual simplex instructions in each of the slots that are executed. Descriptions of individual fields in these iVLIW instructions are shown in
The ADD instruction 450 shown in
Individual, Group, and “Synchronous MIMD” PE iVLIW Operations
The LV and XV instructions may be used to load, modify, disable, or execute iVLIW instructions in individual PEs or PE groups defined by the programmer. To do this, individual PEs are enabled or disabled by an instruction which modifies a Control Register located in each PE which, among other things, enables or disables each PE. To load and operate an individual PE or a group of PEs, the control registers are modified to enable individual PE(s), and to disable all others. Normal iVLIW instructions will then operate only on PEs that are enabled.
Referring to
Upon receipt of an XV instruction in IR1510, the VIM address 511 is calculated by use of the specified Vb register 502 added by adder 504 to the offset value included in the XV instruction via path 503. The resulting VIM Address 507 is passed through multiplexer 508 to address the VIM. The iVLIW at the specified address is read out of the VIM 516 and passes through the multiplexers 530, 532, 534, 536, and 538, to the IR2 registers 514. As an alternative to minimize the read VIM access timing critical path, the output of VIM 516 can be latched into a register whose output is passed through a multiplexer prior to the decode state logic.
For execution of the XV instruction, the IR2MUX1 control signal 533 in conjunction with the pre-decode XVc1 control signal 517 causes all the IR2 multiplexers, 530, 532, 534, 536, and 538, to select the VIM output paths, 541, 543, 545, 547, and 549. At this point, the five individual decode and execution stages of the pipeline, 540, 542, 544, 546, and 548, are completed in synchrony providing the iVLIW parallel execution performance. To allow a single 32-bit instruction to execute by itself in the PE or SP, the bypass VIM path 533 is shown. For example, when a simplex ADD instruction is received into IR1510 for parallel array execution, the pre-decode function 512 generates the IR2MUX1533 control signal, which in conjunction with the instruction type pre-decode signal, 523 in the case of an ADD, and lack of an XV 517 or LV 515 active control signal, causes the ALU multiplexer 534 to select the bypass path 535.
Since a ManArray can be configured with a varying number of PEs,
It is noted, that in the previous discussion, covered by
To allow a single 32-bit instruction to execute by itself in the iVLIW PE or iVLIW SP, the bypass VIM path 835 is shown in FIG. 8A. For example, when a simplex ADD instruction is received into IR1810 for parallel array execution, the pre-decode function 812 generates the IR2MUX2833 control signal, which in conjunction with the instruction type pre-decode signal, 823 in the case of an ADD, and lack of an XV 817 or LV 815 active control signal, causes the ALU multiplexer 834 to select the bypass path 835. Since as described herein, the bypass operation is to occur during a full stage of the pipeline, it is possible to replace the group bits and the unit field bits in the bypassed instructions as they enter the IR2 latch stage. This is indicated in
It is noted that alternative formats for VIM iVLIW storage are possible and may be preferable depending upon technology and design considerations. For example,
In a processor consisting of an SP controller 102 as in
In attempting to extend the Send Model into the SMIMD mode, toher problems may occur. One such problem is that in SMIMD mode it is possible for multiple processing elements to all attempt to send data to a single PE, since each PE can receive a dfiferent inter-PE communication instruction. The two attributes of the SIMD Send Model break down immediately, naemly having a common inter-PE instruction and specifying both source and target, or, in other words, both sender and receiver. It is a communications hazard to have more than one PE target the same PE in a SIMD model with single cycle communication. This communication hazard is shown in
This arrangement is shown for a 2×2 array processor 1100 in
For example, VIM entry number 29 in PE2495 is loaded with the four instructions li.p.w R3, A1+, A7, fmpy.pm.1fw R5, R2, R31, fadd.pa.1fw R9, R7, R5, and pexchg.pd.w R8, R0, 2×2_PE3. These instructions are those found in the next to last row of FIG. 4F. That same VIM entry (29) contains different instructions in PEs 0, 1, and 3, as can be seen by the rows corresponding to thes PEs on VIM entry 29, for PE0491, PE2493, and PE3497.
The following example 1-1 shows the sequence of instructions which load the PE VIM memories as defined in FIG. 4F. Note that PE Masking is used in order to load different instructions into different PE VIMs at the same address.
! first load in instructions common to PEs 1, 2, 3
lim.s.h0 SCR1, 1
! mask off PEO in order to load in 1, 2, 3
lim.s.h0 VAR, 0
! load VIM base address reg v0 with zero
lv.p v0, 27, 2, d=, f=
! load VIM entry v0+27 (=27) with the
! next two instructions; disable no
! instrs; default flag setting to ALU
li.p.w R1, A1+, A7
! load instruction into LU
fmpy.pm.1fw R6, R3, R31
! mpy instruction into MAU
lv.p v0, 28, 2, d=, f=
! load VIM entry v0+28 (=28) with the
! next two instructions; disable no
! instrs; default flag setting to ALU
li.p.w R2, A1+, A7
! load instruction into LU
fmpy.pm.1fw R4, R1, R31
! mpy instruction into MAU
lv.p v0, 29, 2, d=, f=
! load VIM entry v0+29 (=29) with the
! next two instructions; disable no
! instrs; default flag setting to ALU
li.p.w R3, A1+, A7
! load instruction into LU
fmpy.pm.1fw R5, R2, R31
! mpy instruction into MAU
! now load in instructions unique to PEO
lim.s.h0 SCR1, 14
! mask off PEs 1, 2, 3 to load PEO
nop
! one cycle delay to set mask
lv.p v0, 27, 1, d=!mad, f=
! load VIM entry v0+27 (=27) with the
! next instruction; disable instrs
! in LU, MAU, ALU, DSU slots; default
! flag setting to ALU
si.p.w R1, A2+, R28
! store instruction into SU
lv.p v0, 28, 1, d=!mad, f=
! load VIM entry v0+28 (=28) with the
! next instruction; disable instrs
! in LU, MAU, ALU, DSU slots; default
! flag setting to ALU
si.p.w R1, A2+, R28
! store instruction into SU
lv.p v0, 29, 1, d=!mad, f=
! load VIM entry v0+29 (=29) with the
! next instruction; disable instrs
! in LU, MAU, ALU, DSU slots; default
! flag setting to ALU
si.p.w R1, A2+, R28
! store instruction into SU
! now load in instructions unique to PE1
lim.s.h0 SCR1, 13
! mask off PEs 0, 2, 3 to load PE1
nop
! one cycle delay to set mask
iv.p v0, 27, 3, d=, f=
! load VIM entry v0+27 (=27) with the
! next three instructions; disable no
! instrs; default flag setting to ALU
fadd.pa.1fw R10, R9, R8
! add instruction into ALU
pexchg.pd.w R7, R0, 2x2_PE3
! per comm instruction into DSU
si.p.w R10, +A2, A6
! store instruction into SU
lv.p v0, 28, 2, d=s, f=
! load VIM entry v0+28 (=28) with the
! next two instructions; disable instr
! in SU slot; default flag setting to ALU
fadd.pa.1fw R9, R7, R4
! add instruction into ALU
pexchg.pd.w R8, R5, 2x2_PE2
! per comm instruction into DSU
lv.p v0, 29, 3, d=, f=
! load VIM entry v0+29 (=29) with the
! next three instructions; disable no
! instrs; default flag setting to ALU
fcmpLE.pa.1fw R10, R0
! compare instruction into ALU
pexchg.pd.w R15, R6, 2x2_PE1
! pe comm instruction into DSU
t.sii.p.w R0, A2+, 0
! store instruction into SU
! now load in instructions unique to PE2
lim.s.h0 SCR1, 11
! mask off PEs 0, 1, 3 to load PE2
nop
! one cycle delay to set mask
lv.p v0, 27, 3, d=, f=
! load VIM entry v0+27 (=27) with the
! next three instructions; disable no
! instrs; default flag setting to ALU
fcmpLE.pa.1fw R10, R0
! compare instruction into ALU
pexchg.pd.w R15, R6, 2x2_PE2
! pe comm instruction into DSU
t.sii.p.w R0, A2+, 0
! store instruction into SU
lv.p v0, 28, 3, d=, f=
! load VIM entry v0+28 (=28) with the
! next three instructions; disable no
! instrs; default flag setting to ALU
fadd.pa.1fw R10, R9, R8
! add instruction into ALU
pexchg.pd.w R7, R4, 2x2_PE1
! pe comm instruction into DSU
si.p.w R10, +A2, A6
! store instruction into SU
lv.p v0, 29, 2, d=s, f=
! load VIM entry v0+29 (=29) with the
! next two instructions; disable instr
! in SU slot; default flag setting to ALU
fadd.pa.1fw R9, R7, R5
! add instruction into ALU
pexchg.pd.w R8, R0, 2x2_PE3
! pe comm instruction into DSU
! now load in instructions unique to PE3
lim.s.h0 SCR1, 7
! mask off PEs 0, 1, 2 to load PE3
nop
! one cycle delay to set mask
lv.p v0, 27, 2, d=s, f=
! load VIM entry v0+27 (=27) with the
! next two instructions; disable instr
! in SU slot; default flag setting to ALU
fadd.pa.1fw R9, R7, R6
! add instruction into ALU
pexchg.pd.w R8, R4, 2x2_PE2
! pe comm instruction into DSU
lv.p v0, 28, 2, d=d, f=
! load VIM entry v0+28 (=28) with the
! next 2 instructions; disable instr in
! DSU slot; default flag setting to ALU
fcmpLE.pa.1fw R10, R0
! compare instruction into ALU
t.sii.p.w R0, A2+, 0
! store instruction into SU
lv.p v0, 29, 3, d=, f=
! load VIM entry v0+29 (=29) with the
! next three instructions; disable no
! instrs; default flag setting to ALU
fadd.pa.1fw R10, R9, R8
! add instruction into ALU
pexchg.pd.w R7, R5, 2x2_PE1
! pe comm instruction into DSU
si.p.w R10, +A2, A6
! store instruction into SU
lim.s.h0 SCR1, 0
! reset PE mask so all PEs are on
nop
! one cycle delay to set mask
The following example 1-2 shows the sequence of instructions which execute the PE VIM entries as loaded by the example 1-1 code in FIG. 4F. Note that no PE Masking is necessary. The specified VIM entry is executed in each of the PEs, PE0, PE1, PE2, and PE3.
! address register, loop, and other setup would be here
. . .
! startup VLIW execution
! f= parameter indicates default to LV flag setting
xv.p v0, 27, e=l, f=
! execute VIM entry V0+27, LU only
xv.p v0, 28, e=lm, f=
! execute VIM entry V0+28, LU, MAU only
xv.p v0, 29, e=lm, f=
! execute VIM entry V0+29, LU, MAU only
xv.p v0, 27, e=lmd, f=
! execute VIM entry V0+27, LU, MAU,
DSU only
xv.p v0, 28, e=lamd, f=
! execute VIM entry V0+28, all units
except SU
xv.p v0, 29, e=lamd, f=
! execute VIM entry V0+29, all units
except SU
xv.p v0, 27, e=lamd, f=
! execute VIM entry V0+27, all units
except SU
xv.p v0, 28, e=lamd, f=
! execute VIM entry V0+28, all units
except SU
xv.p v0, 29, e=lamd, f=
! execute VIM entry V0+29, all units
except SU
! loop body - mechanism to enable looping has been previously set up
loop_begin: xv.p v0, 27, e=slamd, f=
! execute v0+27, all units
xv.p v0, 28, e=slamd, f=
! execute v0+28, all units
loop_end: xv.p v0, 29, e=slamd, f=
! execute v0+29, all units
Description of Exemplary Algorithms Being Performed
The iVLIWs defined in
In order to avoid redundant calculations or idle PEs, the iVLIWs operate on three variable vectors at a time. Due to the distribution of the vector components over the PEs, it is not feasible to use PE0 to compute a 4th vector dot product. PE0 is advantageously employed instead to take care of some setup for afuture algorithm stage. This can be seen in the iVLIW load slots, as vector 1 is loaded in iVLIW 27 (component-wise across the PEs, as described above), vector 2 is loaded in iVLIW 28, and vector 3 is loaded in iVLIW 29 (lo.p.w R*, A1+, A7). PE1 computes the x component of the dot product for each ofthe three vectors. PE2 computes the y component, and PE3 computes the z component (ftopy.pm.1fw R*, R*, R31). At this point, the communication among the PEs must occur in order to get the y and z components of the vector 1 dot product to PE1, and x and z components of the vector 2 dot product to PE2, and the x and y components of the vector 3 dot product to PE3. This communication occurs in the DSU via the pexchg instruction. In this way, each PE is summing (fadd.pa.1fw R9, R7, R* and fadd.pa.1fw R10, R9, R8) the components of a unique dot product result simultaneously. These results are then stored (si.p.w. R10, +A2, A6) into PE memories. Note that each PE will compute and store every third result. The final set of results are then accessed in round-robin fashion from PEs 1, 2, and 3.
Additionally, each PE performs a comparison (fcmpLE.pa.1fw R10, R0) of its dot product result with zero (held in PE register R0), and conditionally stores a zero (t.sii.p.w R0, A2+, 0) in place of the computed dot product if that dot product was negative. In other words, it is determined if the comparison is R10 less than R0? is true. This implementation of adot product with removal of negative values is used, for example, in lighting calculations for 3D graphics applications.
While the present invention has been disclosed in the context of presently preferred methods and apparatus for carrying out the invention, variuos alternative implementations and variations wil be readily apparetn to those of ordinary skill in the art. By way of example, the present invention does not preclude the ability to load an instruction into VIM and also execute the instruction. This capability was deemed an unnecesary complication for the presently preferred programming model among other considerations such as instruction formats and hardware complexity. Consequently, the Load iVLIW delimiter approach was chosen.
Morris, Grayson, Revilla, Juan Guillermo, Pechanek, Gerald George, Strube, David, Drabenstott, Thomas L.
Patent | Priority | Assignee | Title |
7890735, | Aug 30 2004 | Texas Instruments Incorporated | Multi-threading processors, integrated circuit devices, systems, and processes of operation and manufacture |
8078833, | May 29 2008 | AXIS SEMICONDUCTOR, INC | Microprocessor with highly configurable pipeline and executional unit internal hierarchal structures, optimizable for different types of computational functions |
8095775, | Nov 21 2007 | CAVIUM INTERNATIONAL; MARVELL ASIA PTE, LTD | Instruction pointers in very long instruction words |
8099583, | Oct 06 2007 | AXIS SEMICONDUCTOR, INC | Method of and apparatus and architecture for real time signal processing by switch-controlled programmable processor configuring and flexible pipeline and parallel processing |
8117357, | Jun 22 1999 | Altera Corporation | System core for transferring data between an external device and memory |
8181003, | May 29 2008 | AXIS SEMICONDUCTOR, INC | Instruction set design, control and communication in programmable microprocessor cores and the like |
8296479, | Jun 22 1999 | Altera Corporation | System core for transferring data between an external device and memory |
8397000, | Jun 22 1999 | Altera Corporation | System core for transferring data between an external device and memory |
8884920, | May 25 2011 | Marvell International Ltd. | Programmatic sensing of capacitive sensors |
9015504, | Aug 30 2004 | Texas Instruments Incorporated | Managing power of thread pipelines according to clock frequency and voltage specified in thread registers |
9069553, | Sep 06 2011 | MARVELL INTERNATIONAL LTD; CAVIUM INTERNATIONAL; MARVELL ASIA PTE, LTD | Switching tasks between heterogeneous cores |
9098694, | Jul 06 2011 | MARVELL INTERNATIONAL, INC ; MARVELL INTERNATIONAL LTD | Clone-resistant logic |
9354890, | Oct 23 2007 | CAVIUM INTERNATIONAL; MARVELL ASIA PTE, LTD | Call stack structure for enabling execution of code outside of a subroutine and between call stack frames |
9389869, | Aug 30 2004 | Texas Instruments Incorporated | Multithreaded processor with plurality of scoreboards each issuing to plurality of pipelines |
9442758, | Jan 21 2008 | CAVIUM INTERNATIONAL; MARVELL ASIA PTE, LTD | Dynamic processor core switching |
9582443, | Feb 12 2010 | CAVIUM INTERNATIONAL; MARVELL ASIA PTE, LTD | Serial control channel processor for executing time-based instructions |
Patent | Priority | Assignee | Title |
4979096, | Mar 08 1986 | Hitachi Ltd. | Multiprocessor system |
5680597, | Jan 26 1995 | International Business Machines Corporation; IBM Corporation | System with flexible local control for modifying same instruction partially in different processor of a SIMD computer system to execute dissimilar sequences of instructions |
5930508, | Dec 16 1996 | SAMSUNG ELECTRONICS CO , LTD | Method for storing and decoding instructions for a microprocessor having a plurality of function units |
5951674, | Mar 23 1995 | International Business Machines Corporation | Object-code compatible representation of very long instruction word programs |
5963745, | Nov 13 1990 | International Business Machines Corporation | APAP I/O programmable router |
5968160, | Sep 07 1990 | Hitachi, Ltd. | Method and apparatus for processing data in multiple modes in accordance with parallelism of program by using cache memory |
6094715, | Nov 13 1990 | International Business Machine Corporation | SIMD/MIMD processing synchronization |
6122722, | Dec 29 1992 | FOOTHILLS IP LLC | VLIW processor with less instruction issue slots than functional units |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 21 2004 | Altera Corp. | (assignment on the face of the patent) | / | |||
Aug 24 2006 | PTS Corporation | Altera Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 018184 | /0423 |
Date | Maintenance Fee Events |
Feb 25 2014 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Sep 14 2013 | 4 years fee payment window open |
Mar 14 2014 | 6 months grace period start (w surcharge) |
Sep 14 2014 | patent expiry (for year 4) |
Sep 14 2016 | 2 years to revive unintentionally abandoned end. (for year 4) |
Sep 14 2017 | 8 years fee payment window open |
Mar 14 2018 | 6 months grace period start (w surcharge) |
Sep 14 2018 | patent expiry (for year 8) |
Sep 14 2020 | 2 years to revive unintentionally abandoned end. (for year 8) |
Sep 14 2021 | 12 years fee payment window open |
Mar 14 2022 | 6 months grace period start (w surcharge) |
Sep 14 2022 | patent expiry (for year 12) |
Sep 14 2024 | 2 years to revive unintentionally abandoned end. (for year 12) |