Techniques are described for providing mechanisms of data distribution to and collection of data from multiple memories in a data processing system. The system may suitably be a manifold array (ManArray) processing system employing an array of processing elements. Virtual to physical processing element (PE) identifier translation is employed in conjunction with a ManArray PE interconnection topology to support a variety of communication models, such as hypercube and such. Also, PE addressing nodes are based upon logically nested parameterized loops. Mechanisms for updating loop parameters, as well as exemplary instruction formats are also described.
|
0. 26. A method of accessing local memory of a plurality of processing elements (PEs), the method comprising:
receiving a transfer instruction for transferring data between system memory and the local memory of a plurality of processing elements (PEs);
running a process containing a set of nested loops, the set of nested loops having a plurality of parameters to be assigned values of fields carried in the transfer instruction;
decoding the transfer instruction to assign field values to the plurality of parameters;
assigning the field values to the plurality of parameters in order to control PE selection and address generation for accessing a memory location in local memory of each selected PE; and
generating addresses to access local memory of each PE in a defined pattern.
0. 18. An apparatus for accessing local memory of a plurality of processing elements (PEs), the apparatus comprising:
a transfer controller running a process containing a set of nested loops, the set of nested loops having a plurality of parameters to be specified by a transfer instruction, the plurality of parameters, when assigned, control PE selection and address generation for accessing a memory location in local memory of each selected PE; and
a means for receiving the transfer instruction for transferring data between system memory and local memory of the plurality of PEs, the transfer instruction having fields which specify values for the plurality of parameters, the transfer instruction indicating an addressing mode, the addressing mode specifying a particular pattern of accessing local memory of the plurality of PEs, wherein the transfer controller decodes the transfer instruction to assign values to the plurality of parameters, the process generating addresses for accessing a memory location in local memory of each selected PE in a particular pattern, wherein the particular pattern is based on the assigned parameters.
0. 1. An apparatus for performing virtual identification (VID) to physical identification (PID) translation for data elements to be accessed within local memory of a processing element (PE) whereby a direct memory access (DMA) controller can access PE local memories according to their VIDs, the apparatus comprising:
an array of multiple PEs each having local PE memory;
a DMA controller; and
a memory maintained in the DMA controller for storing a processing element VID-to-PID table mapping processing element VIDs to processing element PIDs utilized by the DMA controller to access local memories according to their VIDs.
0. 2. The apparatus of
0. 3. The apparatus of
0. 4. The apparatus of
0. 5. The apparatus of
0. 6. The apparatus of
0. 7. The apparatus of
0. 8. The apparatus of
0. 9. The apparatus of
a local memory interface unit for each processing element (PE) storing a VID for each PE.
0. 10. The apparatus of
0. 11. The apparatus of
0. 12. The apparatus of
0. 13. A processing apparatus comprising:
a plurality of processing elements (PEs) communicatively connected by a bus, each PE comprising a register storing a virtual identification number (VID) identifying the PE; and
a direct memory access (DMA) controller connected to the bus for accessing local data memory of the PEs, each data access at least partially identified by a VID;
wherein during a common data to access multiple PEs, a PE responds to the data access if the VID stored in the register matches the VID of the data access.
0. 14. The processing apparatus of
0. 15. The processing apparatus of
0. 16. The processing apparatus of
0. 17. The processing apparatus of
0. 19. The apparatus of
0. 20. The apparatus of
0. 21. The apparatus of
0. 22. The apparatus of
0. 23. The apparatus of
0. 24. The apparatus of
0. 25. The apparatus of
0. 27. The method of
0. 28. The method of
0. 29. The method of
0. 30. The method of
|
for i=0, 1, 2, . . . etc., and where base_address, stride and hold are parameters, and where division is integer division in which any remainder is discarded.
An “address generation unit (AGU)” is a hardware module that generates a sequence of addresses (a data access pattern) according to a programmed address mode.
“EOT” means “end-of-transfer” and refers to the state when a transfer execution unit (described in the following text) has completed its most recent transfer instruction by transferring the number of elements specified by the instruction's transfer count field.
The term “host processor” as used in the following description is any processor or device which can write control commands and read status from the DMA controller and/or which can respond to DMA controller messages and signals. In general, a host processor interacts with a DMA controller to control and synchronize the flow of data between devices and memories in the system in such a way as to avoid overrun and underrun conditions at the sources and destinations of data transfers.
The present invention provides a set of flexible addressing modes for supporting efficient data transfers to and from multiple memories, together with methods and apparatus for allowing data accesses to be directed to PEs according to virtual as opposed to physical IDs. This section describes an exemplary DMA controller and a system environment in which the present inventions may be effectively used. The following sections describe PE memory addressing, virtual-to-physical PE ID translation and its purpose, and a set of PE memory addressing modes or “PE addressing modes” which support numerous parallel algorithms with highly efficient data transfer.
In this representative system, the DMA controller also connects to two system busses, a system control bus (SCB) 235 and a system data bus (SDB) 240. The DMA controller is designed to transfer data between devices on the SDB 240, such as a system memory 250 and the DSP 203 local memories 210-215. The SCB 235 is used by an SCB master such as the DSP 203 or a host control processor (HCP) 245 to program the DMA controller 201 with read and write addresses and registers to initiate control operations and read status. The SCB 235 is also used by the DMA controller 201 to send synchronization messages to other SCB bus slaves such as the DSP control registers 225 and a host I/O block 255. Some registers in these slaves can be polled by the DSP and HCP to receive status from the DMA. Alternatively, DMA writes to some of these slave addresses can be programmed to cause interrupts to the DSP and/or HCP allowing DMA controller messages to be handled by interrupt service routines.
Each transfer controller within a ManArray DMA controller is designed to fetch its own stream of DMA instructions. DMA instructions are of five basic types: transfer; branch; load; synchronization; and state control. The branch, load, synchronization, and state control types of instructions are collectively referred to as “control instructions”, and distinguished from the transfer instructions which actually perform data transfers. DMA instructions are typically of multi-word length and require a variable number of cycles to execute although several control instructions require only a single word to specify. Although the presently preferred embodiment supports multiple DMA instruction types as described in further detail in U.S. patent application Ser. No. 09/471,217 filed Dec. 23, 1999, now U.S. Pat. No. 6,260,082, and incorporated by reference in its entirety herein, the present invention focuses on instructions and mechanisms which provide for flexible and efficient data transfers to and from multiple memories.
Referring further to system 400 of
A “transfer-system-inbound” (TSI) instruction moves data from the SDB 470 to the IDQ 405 and is executed by the STU. A “transfer-core-inbound” (TCI) instruction moves data from the IDQ 405 to the DMA Bus 425 and is executed by the CTU. A “transfer-core-outbound” (TCO) instruction moves data from the DMA Bus 425 to the ODQ 406 and is executed by the CTU. A “transfer-system-outbound” (TSO) instruction moves data from the ODQ 406 to the SDB 470 and is executed by the STU. Two transfer instructions are required to move data between an SDB system memory and one or more SP or PE local memories on the DMA bus, and both instructions are executed concurrently: a TSI, TCI pair or a TSO, TCO pair.
The address parameter of STU transfer instructions TSI and TSO refers to addresses on the SDB while the address parameter of CTU transfer instructions refers to addresses on the DMA bus to PE and SP local memories.
While there are six memories 210, 211, 212, 213, 214, and 215 shown in
The ManArray architecture supports a unique interconnection network between processing elements (PEs) which uses PE virtual IDs (VIDs) to support useful single-cycle communication paths, for example, torus or hypercube paths. In some array organizations, the PE's physical and virtual IDs are equal. The VIDs are used in the architecture to specify the pattern for data distribution and collection. When data is distributed according to the pattern established by VID assignment, then efficient inter-PE communication required by the programmer becomes available. As an example, if a programmer needs to establish a hypercube connectivity for a 16 PE ManArray processor, the data will be distributed according to a VID assignment in such a manner that the physical switch connections allow data to be transferred between PEs as though the switch topology were a hypercube even if the switch connections between physical PEs do not support the fill hyper-cube interconnect. The present invention describes two approaches whereby the DMA controller can access PE memories according to their VIDs, effectively mapping PE virtual IDs to PE physical IDs (PIDs). The first uses VID-to-PID translation within the CTU of a transfer controller. This translation can be performed either through table-lookup, or through logic permutations on the VID. The second approach associates a VID with a PE by providing a programmable register within the PE or the PE local memory interface unit (LMIU),
VID to PID Translation within the DMA Controller
With this approach, a PE VID-to-PID table is maintained in the DMA controller so that data may be distributed to the ManArray according to a programmer's view of the array. In the preferred embodiment, this table is maintained in the CTU of each transfer controller.
The approach of
In the presently preferred embodiment, a lookup table is used to perform the VID-to-PID translation. Two approaches are provided for initializing the translation table. The first is through a DMA instruction 800, shown in FIG. 8. When executed, DMA instruction 800 loads a PETABLE register 900 which is illustrated in FIG. 9. The second approach is through a direct write of the PETABLE register 900 via the SCB.
PE Virtual IDs Stored in Local Memory Interface Units
The second approach to directing data access according to PE VID relies on distributing the PE VIDs to each PE local memory interface unit (LMIU). The VID for each PE might reside in a register either in the PE itself or in its LMIU. In this case, there is no translation table or logic in the DMA lane controllers. In common with the preceding approach, there is a PE ID component of the DMA bus which is driven by the transfer controllers and used by the LMIUs to compare for a match with the locally visible PE VID. When a match is detected in a PE, then it accepts the access which may be either a write or a read request. Means for updating the VIDs stored locally in the LMIUs may be provided through the use of registers visible in the PE register address space, or through a PE instruction which broadcasts the table to all PEs, who then select their VID using their hard-coded PID stored locally. This approach has advantages when VIDs are used for other purposes than just data distribution and collection by a DMA controller.
CTU Addressing Modes
A CTU 408 shown in
Flexible PE Addressing Modes through Parameterizable Logical Loops
Many algorithms which are distributed across multiple PEs require complex data access patterns to achieve peak efficiency. The basis for our loop-based PE addressing modes is a logical view of data access consisting of a set of nested loops in which one component of the PE memory address is assigned to be updated at the end of each loop. As stated above, a PE memory address consists of three components called “address components”, a PE virtual ID (VID), a base value (Base) and an index value (Index). This model requires the following: a mechanism for assigning address components to logical loops; a mechanism for initializing address components; and a mechanism for updating address components; and a mechanism for indicating a loop's exit condition.
Assignment of an address component to a loop specifies the order in which the three address components are updated. In an embodiment which uses a three-loop model, there are six possible orders for updating address components (i.e. six ways to re-order VID, Base and Index). The base and index components are defined to be ordered in this embodiment so that the index is always updated prior to the base, which reduces the number of possible orderings to three, since base and index are summed to form an offset into PE memory, allowing loop assignments that update the base before the index is redundant. An exemplary loop assignment is: update VID on inner loop; update index on middle loop; and update base on outer loop.
Thus, as PE addresses are generated, the VID component updates first (inner loop). When all VIDs have been used (VID loop exit condition has been reached), then the VID is reinitialized, the index is updated, and the VID loop is reentered. This looping continues until the number of index updates is exhausted (Index loop exit condition has been reached) at which point the index is reinitialized, the base is updated, the index loop is reentered, then the VID loop is reentered. This further looping continues until the transfer count is exhausted.
Updating an address component is performed by selecting a new value for the component either based on the old value (e.g. new=old+1) or by some other means, such as by table lookup. A loop exit condition specifies what causes the loop to exit to the next-most outer loop in the model.
In summary, three different aspects of loop control are used to vary the sequence in which PE memories may be accessed. These are:
The following aspects of the loop formulation are noted. When the requested number of accesses are made (TC in
The functions used to update an address (see UpdateAddress( ) in
The function used to update the loop control variable, UpdateLoopControl( ), may be performed as part of the address update or as a separate operation as shown in
The function used to check for loop termination simply tests the loop termination variable for an end of loop condition. This condition may be a particular count value or the state of a mask register.
The initialization of address parameters (see Initialize( ) function:
The following discussion addresses instruction formats and describes PE addressing modes for one embodiment of the invention. It will be recognized other instruction encodings may be used consistent with the teachings of the present invention. In the preferred embodiment, a transfer controller reads transfer instructions from a local memory and decodes them. Transfer instructions come in two types, those for the STU and those for the CTU. The STU transfer instructions specify the addressing mode and transfer count for accesses to the system data bus while CTU transfer instructions specify the addressing mode and transfer count for accesses to the DMA bus and all SP and PE memories. The instruction formats addressed below are only those instructions which control special PE memory addressing for the CTU. Instruction mnemonics are used to indicate the instruction type and addressing mode. “TCI” stands for “transfer, core-inbound”, while “TCO” stands for “transfer, core-outbound”. “TCx” stands for either TCI or TCO. The following PE addressing modes are described as illustrative of the present invention: PE Block-Cyclic, PE Select-Index, PE Select-PE, and PE Select-Index-PE.
PE Block-Cyclic Addressing
PE blockcyclic addressing provides the basic framework for all of the PE addressing modes. A Loop parameter specifies the assignment of address components to loops: BIP, BPI, or PBI.
The operation of the PE select-index address mode is similar to the PE blockcyclic address mode except that rather than updating the index component of the address by adding a constant to it, the instruction specifies a table of index update values which are used sequentially to update the index.
An index select parameter allows finer-grained control over a sequence of index values to be accessed. In the example, this is done using a table of eight 4-bit index-update (IU) values. Each time the index loop is updated, an IU value is added to the effective address. These update values are accessed from the table sequentially starting from IU0 for IUCount updates. After IUCount updates, the index update loop is complete and the next outer loop (B or P) is activated. On the next entry of the index loop, IU values are accessed starting at the beginning of the table.
PE Select-PE Addressing
The operation of the PE Select-PE address mode is similar to the PE blockcyclic address mode except that rather than updating the PE VID component of the address by adding 1 to it, the instruction specifies a table of bit vectors, where each bit vector specifies the PE's to select for access. A bit set to “1” in a bit vector indicates, by its bit position, the VID of the PE to access. Bits in each bit vector are scanned from right to left (least to most significant when viewed in a first instruction format such as instruction format 1900 of FIG. 19). When there are no more “1” bits in a vector, the PE loop exits. The next iteration of the loop uses the next bit vector in the table.
The PE select fields together with the use of the PE translate table allow out of order access to PEs across multiple passes through them.
PE Select-Index-PE Addressing
This addressing mode combines both select-index and select-PE addressing. An exemplary instruction format 2100 is shown in FIG. 21. This form of addressing provides for complex-periodic data access patterns. An exemplary access pattern table 2200 for the PE-select-index-PE address mode is shown in FIG. 22.
Patent | Priority | Assignee | Title |
8713217, | Apr 14 2009 | NEC Corporation | Permitting access of slave device from master device based on process ID's |
Patent | Priority | Assignee | Title |
3593306, | |||
4538241, | Jul 14 1983 | Unisys Corporation | Address translation buffer |
4783736, | Jul 22 1985 | ALLIANT COMPUTER SYSTEMS CORPORATION, ACTON, MASSACHUSETTS, A CORP OF DE | Digital computer with multisection cache |
4794521, | Jul 22 1985 | ALLIANT COMPUTER SYSTEMS CORPORATION, ACTON, MA , A CORP OF MA | Digital computer with cache capable of concurrently handling multiple accesses from parallel processors |
5165023, | Dec 17 1986 | MASSACHUSETTS INSTITUTE OF TECHNOLOGY, A MA CORP | Parallel processing system with processor array and network communications system for transmitting messages of variable length |
5301287, | Mar 12 1990 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | User scheduled direct memory access using virtual addresses |
5418970, | Dec 17 1986 | Massachusetts Institute of Technology | Parallel processing system with processor array with processing elements addressing associated memories using host supplied address value and base register content |
5579493, | Dec 13 1993 | Hitachi, Ltd. | System with loop buffer and repeat control circuit having stack for storing control information |
5655151, | Jan 28 1994 | Apple Inc | DMA controller having a plurality of DMA channels each having multiple register sets storing different information controlling respective data transfer |
5659798, | Feb 02 1996 | TRUSTEES OF PRINCETON UNIVERSITY, THE | Method and system for initiating and loading DMA controller registers by using user-level programs |
5698913, | Jun 15 1995 | Kabushiki Kaisha Toshiba; Railway Technical Research Institute | Outer-rotor type electric rotary machine and electric motor vehicle using the machine |
5758182, | May 15 1995 | CORILLI CAPITAL LIMITED LIABILITY COMPANY | DMA controller translates virtual I/O device address received directly from application program command to physical i/o device address of I/O device on device bus |
5784706, | Dec 13 1993 | Hewlett Packard Enterprise Development LP | Virtual to logical to physical address translation for distributed memory massively parallel processing systems |
5802554, | Feb 28 1995 | Panasonic Corporation of North America | Method and system for reducing memory access latency by providing fine grain direct access to flash memory concurrent with a block transfer therefrom |
5802604, | Jun 06 1988 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Method for addressing page tables in virtual memory |
5828856, | Jan 28 1994 | Apple Inc | Dual bus concurrent multi-channel direct memory access controller and method |
5828903, | Sep 30 1994 | Intel Corporation | System for performing DMA transfer with a pipeline control switching such that the first storage area contains location of a buffer for subsequent transfer |
5860025, | Jul 09 1996 | GLOBALFOUNDRIES Inc | Precharging an output peripheral for a direct memory access operation |
5864876, | Jan 06 1997 | Creative Technology, Ltd | DMA device with local page table |
5890201, | Feb 19 1993 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Content addressable memory having memory cells storing don't care states for address translation |
5958048, | Oct 18 1996 | Elbrus International Limited | Architectural support for software pipelining of nested loops |
6047307, | Dec 13 1994 | Microsoft Technology Licensing, LLC | Providing application programs with unmediated access to a contested hardware resource |
6058437, | Aug 04 1997 | UNILOC 2017 LLC | D.M.A. device that handles cache misses by managing an address of an area allotted via a daemon processor |
6081854, | Mar 26 1998 | Nvidia Corporation | System for providing fast transfers to input/output device by assuring commands from only one application program reside in FIFO |
6145076, | Mar 14 1997 | WSOU Investments, LLC | System for executing nested software loops with tracking of loop nesting level |
6256683, | Dec 23 1998 | Altera Corporation | Methods and apparatus for providing direct memory access control |
6260082, | Dec 23 1998 | Altera Corporation | Methods and apparatus for providing data transfer control |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 22 2006 | Altera Corporation | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Feb 25 2014 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Oct 26 2013 | 4 years fee payment window open |
Apr 26 2014 | 6 months grace period start (w surcharge) |
Oct 26 2014 | patent expiry (for year 4) |
Oct 26 2016 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 26 2017 | 8 years fee payment window open |
Apr 26 2018 | 6 months grace period start (w surcharge) |
Oct 26 2018 | patent expiry (for year 8) |
Oct 26 2020 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 26 2021 | 12 years fee payment window open |
Apr 26 2022 | 6 months grace period start (w surcharge) |
Oct 26 2022 | patent expiry (for year 12) |
Oct 26 2024 | 2 years to revive unintentionally abandoned end. (for year 12) |