In a parallel processor, a local area and an overlap area are assigned to the memory of each processing element (PE), and each PE makes calculations to update the data in both areas at the runtime. If the data in the overlap area is updated in processes closed in the PEs, the data transfer between adjacent PEs can be reduced and the parallel processes can be performed at a high speed.
|
1. A program converting device for use in an information processing device for converting an input program into a program to be executed in a parallel processor comprising a plurality of processing elements and a communications network, comprising:
a detecting means for detecting unit to detect in the input program a loop portion in which optimization can be realized using an overlap area; and a setting means for converting unit to convert the input program by assigning an overlap area to a processing element processing the loop portion and generating a code based on which data in the overlap area is calculated updated through calculations in multiple iterations using only the data in the processing element, and for outputting a converted program.
6. A program converting device for use in an information processing device for converting an input program into a program to be executed in a parallel processor comprising a plurality of processing elements and a communications network, comprising:
a detecting means for detecting unit to detect in the input program a loop portion in which optimization can be realized using an overlap area; and a setting means for converting unit to convert the input program by assigning an overlap area to a processing element processing the loop portion and generating a code based on which data is updated in the overlap area through single direction data transfer and data in the processing element is shifted in a direction of the data transfer, and for outputting a converted program.
11. A program converting device for use in an information processing device for converting an input program into a program to be executed in a parallel processor comprising a plurality of processing elements and a communications network, comprising:
a detecting means for unit detecting in the input program a loop portion in which optimization can be realized using an overlap area; and a setting means for unit converting the input program by assigning an overlap area to a processing element processing the loop portion and generating a code based on which data in the overlap area is calculated updated through calculations in multiple iterations using only the data in the processing element and a code passed on which the data in the overlap area is updated through single direction data transfer, and for outputting a converted program.
2. The program converting device according to
3. The program converting device according to
4. The program converting device according to
further comprising: a size determining means for unit determining an optimum size of the overlap area after estimating process time for said loop portion, and wherein said setting means unit assigns the overlap area of a size determined by said size determining means unit to the processing element processing said loop portion.
5. The program converting device according to
7. The program converting device according to
further comprising: a communications change means for changing unit to change data dependency in processing said loop portion from bi-directional transfer to single direction transfer and for generating a subscript of an array in which the single direction data transfer can be performed, and wherein said setting means unit generates a code for use in updating the data using the subscript in said array generated by said communications change means unit.
8. The program converting device according to
9. The program converting device according to
10. The program converting device according to
|
1. Field of the Invention
The present invention relates to a method of executing a program at a high-speed through a distributed-memory parallel processor, and more specifically to a data updating method using an overlap area and a program converting device for converting a data update program.
2. Description of the Related Art
Recently, a parallel processor draws people's attention as a system of realizing a high-speed processor such as a super-computer in the form of a plurality of processing elements (hereinafter referred to as PE or processors) connected through a network. In realizing a high processing performance using such a parallel processor, it is an important problem to reduce the overheads of the data communications to the lowest possible level. One of the effective PE of reducing the overheads of the data communications between processors is to use an overlap area for specific applications.
The time required for data communications depends on the number of times of packet communications rather than the total volume of data. Therefore, integrating the communications and representing the messages by vector (S, WP, K. Kennedy, and C. WP "Compiler WP for Fortran D on WP Distributed-Memory Machines," in Proc. WP '91 pp. 86-100, Nov. 1991.) are important in reducing the communications overheads. An overlap area is a special type of buffer area for receiving vector data, and is assigned such that it encompasses a local data area (local area) to be used in computing data internally. The data value of the overlap area is determined by the adjacent processor.
The processor p(2, 2) has a considerably large area of a(64:129, 64:129) including an overlap area so that, when a(i, j) is calculated, the adjacent a(i, j+1), a(i, j-1), a(i+1, j), and a(i-1, j) can be locally accessed.
Without an overlap area, data should be read from adjacent processors in the DO loop and a small volume of data are frequently communicated, resulting in a large communications overheads. However, having an overlap area allows the latest data to be copied to the overlap area by collectively transferring data before an updating process. Therefore, data can be locally updated and the communications overheads can be considerably reduced.
Thus, the overlap area can be explicitly specified by VPP Fortran ("Realization and Evaluation of VPP Fortran Process System for AP1000" Vol. 93-HPC-48-2, pp. 9-16. Aug. 1993 published at SWOPP Tomonoura '93 HPC Conference by Tatsuya Sindoh, Hidetoshi Iwashita, Doi, and Jun-ichi Ogiwara). A certain compiler automatically generates an overlap area as a form of the optimization.
The data transmission patterns for performing parallel processes can be classified into two types. One is a single direction data transfer SDDT, and the other is a bi-directional data transfer BDDT.
In
The word "INIT" indicates an initial state and "Update" indicates the data communications between adjacent PEs to update the overlap area. Iter 1, 2, 3, and 4 indicate parallel processes for the update of data at each iteration of the DO loop. In
However, the data update method using the conventional overlap area has the following problems.
Each processor forming part of the parallel processor should update the data in the overlap area into the latest value before making a calculation using the data value of the overlap area. The update process is performed by reading the latest value from the adjacent processor through the communications between processors. In parallel processors, the overheads are heavy for a rise time. Therefore, the time required for the communications process depends on the number of times of data transfers rather than the amount of transferred data. If an overlap area is updated each time a calculation is made using the overlap area, then each communications rise time is accompanied by overheads.
In a parallel processor connected through a torus network such as an AP1000 ("An Architecture of Highly Parallel Computer AP1000," by H. Ishihata, T. Horie, T. Shimizu, and S. Kato, in Proc. IEEE Pacific Rim Conf. on Communications, Computers, and Signal Processing, pp. 13-16, May 1991), the SDDT excels to the BDDT in characteristic because the SDDT can reduce the time of data transfers and the overheads required in a synchronization process between adjacent processors more than the BDDT. However, in the conventional data update process as shown in
3. Summary of the Invention
The present invention aims at updating data with the overheads for the communications between PEs reduced in the distributed-memory parallel processors, and providing a program converting device for generating a data updating program.
The program converting device according to the present invention is provided in an information processing device, and converts an input program into the program for a parallel processor. The program converting device is provided with a detecting unit, setting unit, size determining unit, and a communications change unit.
The detecting unit detects a portion including the description of the loop where optimization can be realized using an overlap area in the input program. The setting unit assigns an overlap area to the memory of the PE for processing the program at the description of the loop, generates a program code for calculating the data in the area, and then adds it to the initial program. Thus, each PE updates the data in the local area managed by the PE, and also updates the data in the overlap area managed by other PEs at the runtime of the program converted by the parallel processor. The overlap area updated by the closed calculation in each PE requires no data transfer for update, thereby improving the efficiency in parallel process.
The size determining unit estimate the runtime for the description of the loop and determines the optimum size of the overlap area. Normally, the larger the overlap area is, the smaller number of times the data is transferred while the longer time is taken for updating the data in the area. If the size of an overlap area is fixed such that the runtime is the shortest possible, the data update process can be efficiently performed.
The communications change unit checks the data dependency at the detected portion of the description of the loop. If the data is dependent bi-directionally, the description should be rewritten such that the data is dependent in a single direction, and subscripts are generated in the arrangement optimum for data transfer. Thus, each PE only has to communicate with the adjacent PE corresponding to either upper limit or lower limit of the subscripts in the array, thereby successfully, reducing the overheads of the communications.
Thus, the overlap area has been updated using the data transferred externally. However, it is updated in a calculation process in each PE, thereby reducing the overheads for the communications and performing the parallel process at a high speed.
The embodiments of the present invention is described in detail by referring to the attached drawings.
The detecting unit 1 detects a loop including one to be possibly optimized using an overlap area from the input program.
The setting unit 2 assigns an overlap area to a PE for processing the loop, generates a code for calculating the data in the overlap area, and outputs a converted program.
The size determining unit 3 estimates the runtime for processing the loop, determines the optimum size of the overlap area. The setting unit 2 assigns the overlap area having the size determined by the size determining unit 3 to the PE for processing the loop.
The communications change unit 4 changes the data dependency of the process of the loop from the bi-directional dependency to the uni-directional dependency to generate the subscripts of the array through which data is transferred uni-directionally (SDDT). The setting unit 2 generates a code for updating data through the uni-directional data transfer and adds the generated code to the converted program.
The detecting unit 1, setting unit 2, size determining unit 3, and communications change unit 4 shown in
The detecting unit 1 scans an input program and detects a loop to which an overlap area can be applied. For example, the portion where a calculation process capable of performing a parallel process by a plurality of PEs is encompassed by the serial DO loop is detected as shown in
The setting unit 2 assigns an overlap area to each PE for sharing the process for the detected loop, and generates a code for calculating and updating the data in the overlap area. Thus, at the runtime of a converted program, each PE locally calculates the data in the overlap area as well as the data in the local area. Therefore, the time of communications in which data in the overlap area is updated can be reduced, thereby also reducing the communications overheads.
The size determining unit 3 estimates the runtime for processing the loop, and determines, for example, the minimal size of the overlap area. If the overlap area of the optimum size is assigned to each PE, the throughput of the parallel processor can be considerably improved.
The setting unit 2 generates a code for updating data through the SDDT, act through the BDDT. As a result, each PE does not have to communicate with both adjacent PEs at the upper limit and lower limit of the subscript in the array. Therefore, the synchronizing process is not required for the PE not to communicate with, and saving the overheads for the synchronization.
The communications change unit 4 checks the data-dependent vector in the array data used in the above described loop, converts the data into uni-directional transfer from bi-directional transfer, and generates a subscript in the array to perform the SDDT. The setting unit 2 generates a code for updating data using the subscripts in the converted array.
Thus, a program for use in a parallel processor is generated by the program converting device shown in FIG. 7. As a result, the overheads for the communications can be reduced.
Two methods are used for the embodiments of the present invention. One is to use an extended overlap area, and the other is -to update data through the SDDT of the overlap area.
An extended overlap area is described by referring to
An extended overlap area is a data storage area which is multiple of common overlap areas. Using an extended overlap area reduces the total number of times of the communications performed to update the overlap area when a process effectively using the overlap area is repeated for plural times in a loop of the loops performed in parallel in a program. In the program according to the Jacobi relaxation shown in
For example, PE1 updates the elements in the range of a(9:16) of the array a, holds the elements in the local area, and contains the storage area as an extended overlap area for the elements in the range of a(6:8) and a(17:19) (INIT), PE1 first establishes communications between PE0 and PE2 and updates the data in the extended overlap area as the latest data (Update).
Then, the first calculation is made using the data in the range of a(6:19) to update the data a(9:16) in the local area and the data a (7:8) and a (17:18) in the extended overlap area (Iter 1). Then, the second calculation is made using the data in the range of a (7:18) to update the data a (9:16) in the local area and the data a (8) and a (17) in the extended overlap area (Iter 2). Then, the second calculation is made using the data in the range of a (8:17) to update only the data a (9:16) (Iter 3).
Since all data in the extended overlap area are dirty, that is, unavailable, PE1 establishes communications again with the adjacent PE to update the extended overlap area and similarly repeats the data updating process of and after Iter 4.
In
When the conventional overlap area is compared with the extended overlap area of the present invention, the total quantity of data transferred for the update of the overlap area remains the same. Since the communications overhead time depends more seriously on the time of transfer than the total quantity of the transferred data, the communications overheads can be more efficiently reduced by using the extended overlap.
Since each PE should make a calculation on a part of the extended overlap area referred to by the subsequent iteration, the calculation and process for the present PE are the same as those for the update of data in the adjacent PE. Therefore, the larger the size of the extended overlap area is, the more heavily the parallelism of the processes is impaired. As an extreme example, if an extended overlap area is set such that a single PE stores the data for all PEs, no communications are required with no parallelism obtained, though. To determine the optimum size of the extended overlap area, the communications overheads and the process parallelism should be traded off.
If a process starts according to
Then, it is checked whether or not any other DO loops exist (step S1-6). If yes, the processes in and after step S1-1 is repeated. If not, the process terminates. If the determination results are "No" in steps S1-2, S1-3, and S1-4, control is passed to the process in step S1-6.
As a result of the extended overlap area applicable portion detecting process shown in
In
Assuming that the calculation time for a unit area (1 element) in a local area or an extended overlap area is a, the overhead time (prologue and epilogue time) taken for the activation and termination of one data transfer process is c, and that the time taken for a data transfer per unit area is d, then the data transfer time is calculated by the equation c+d×size of transferred area.
In the optimization using an extended overlap area, communications are established first for e times of data update (e iteration). The data to be communicated is the extended overlap area shown as a shadowed portion in FIG. 13. The area (number of elements) of this portion is 4ew(ew+l). The communications are established 8 times between the eight PEs processing array elements in the upper, lower, right, left, and in four diagonal directions. Therefore, the total communications time required for the e iteration is calculated by the following equation:
The communications time for an iteration is obtained by dividing equation (1) by e as follows.
Then, the time taken for the calculations for the update of data is estimated. Since the number of calculation elements in the local area is 12, the calculation time required for e iteration of the calculation for the local area is ae12. If the size (width) of the updated area in the extended overlap area is kw in each iteration, the number of calculation elements in the extended overlap area is 4kw(kw+1), where k is a parameter representing how many times of the conventional overlap area the update portion is in the extended overlap area in each iteration. If data is locally updated without communications, the extended overlap area sequentially becomes dirty from outside to inside for each iteration, the width of the updated portion in the extended overlap area decreases by w each time. Accordingly, the calculation time taken for calculating the extended overlap area during the e iteration is obtained by the following equation;
Since the calculation time taken for calculating the extended overlap area at the e-k-th iteration is 4akw(kw+1), the sum of k from 1 to e-1 is calculated by equation (3). The calculation time for an iteration is obtained by adding the calculation time for the local area for e iterations to the calculation time for the extended overlap area and then by dividing the sum by e as follows.
According to equations (2) and (4), the runtime Tier (e) for one serial loop using the extended overlap area is represented as a function of e as follows.
If the calculation time is estimated, the host computer 11 determines the optimum size of the extended overlap area according to the estimate result (step S3). The optimum size of the extended overlap area refers to the size for the shortest possible runtime.
For example, assuming that, in equation (5), the coefficient of the term of e2 is s, the coefficient of the term of e is t, the coefficient of the term of 1/e is u, and the term O is V for e, then equation (5) is rewritten as follows.
The obtained eO is a value of the extension parameter for optimizing the size of the extended overlap area. The size of the extended overlap area is provided by the eOw.
If the optimum size of the extended overlap area is determined, the host computer 11 assigns the extended overlap area of the size to each PE (step S4).
Thus, the extended overlap area of each PE is assigned the data, of the size of the extended overlap area, of the local area for another PE.
If the process is step S4 is completed, the host computer 11 inserts a program code for use in calculating an extended overlap area (step S5). Thus, a code is generated such that the range of the process of each PE in a parallel loop can be extended by the size of the extended overlap area. The generated code is put into the program.
For example, the range of the indices in an array managed by the PE1 shown in
Then, a program code is inserted to update the extended overlap area (step S6). In this process, a code is generated such that communications are established each time a serial loop encompassing a parallel loop is repeated for the times indicated by the extension parameter to update the data in the extended overlap area. Then, the code is put into the program of each PE.
For example, communications are established for each iteration of three serial loops in the example shown in FIG. 10. In the example shown in
The update by the SDDT of the overlap area is described below by referring to
The storage position of the data is shifted for each iteration into one direction in a torus form in the system using the SDDT. To convert the conventional system in which an overlap area is updated by the BDDT into the system using the SDDT, the data required to obtain a new value is sent to the adjacent PE for one direction of the two-directional communications instead of receiving the data required to calculate the new value from the adjacent PE. As a result, the communications are established uni-directionally and the overlap area can be provided for only one side of the local area.
For example, at the initial state, PE1 holds the elements in the range of a (9:16) in the array a, and has the storage area for the elements in the range of a (7:8) as an overlap area (INIT). Then, the PE1 receives the data from the PE0, updates the data in the overlap area, and transmits the data in the range of a (15:16) to the PE2 (Update).
Then, the PE1 makes the first calculation using the data in the range of a (7:16), updates the data in a (8:15) (Iter 1), and stores the data after shifting the storage position in the communications direction by 1. At this time, the data in a (16) initially stored by the PE1 is updated in parallel by the PE2. Since the data in the overlap area have become all dirty, the PE1 established uni-directional communications between the PE1 and the adjacent PE, updates the extended overlap area, and performs the data update process for Iter 2.
Since repeating these processes sequentially shifts the storage positions of all data over the PE0 through PE3 in a torus form, after data update process for Iter 4, the data in a (27:28) of the PE3 is transferred to the overlap area of the PE0.
The result of the data updated by the SDDT shown in
When the process starts as shown in
Then, a computational transformation is made (step S12). In this process, the position of the data calculated according to the count of the outer serial loop is shifted and the SDDT is used in updating the overlap areas.
In the computational transformation, a loop nest for determining a computational space is converted such that all data-dependent vectors can be positive in the direction along the axis of the processor array 13. For the loop where the data-dependency is represented by a distance vector, the computational space conversion can be performed as an application of unimodular transformation (M. E. Wolf and M. S. Lam. "A loop transformation theory and an algorithm to maximize parallelism," in IEEE Transaction on Parallel and Distributed Systems, pp. 452-471, Oct. 1991). In this case, the transform matrix T can be represented as follows with the dimension of the array set to m, and with the parameter of the skew in each dimension set to a1, a2, . . . , am.
The skew vector S containing the parameters a1, a2, . . . , am of equation (8) can be defined as follows.
In the program shown in
Next, a conversion matrix T in which the time axis is removed to make the program loop nest fully permutable is obtained. Since the array is one-dimensional, the conversion matrix T in equation (8) forms a 2×2 matrix (2 rows by 2 columns) and the following equation exists.
With T set as shown above, the distance vector (1,-1) and (1, 1) are converted as follows.
Equation (12) indicates that the distance vector (1, -1) is converted by T into the distance vector (1, a1-1). Equation (13) indicates that the distance vector (1, 1) is converted by T into the distance vector (1, a1+1).
To make the loop nest fully permutable, both components a1-1 and a1+1 of the converted distance vector should be equal to or larger than 0. This condition is represented by the following equation.
The minimum value of a1, satisfying the conditions to make the loop nest permutable is 1. The "T" in equation (11) is represented as follows.
The converted distance vectors obtained from equations (12) and (13) are (1, 0) and (1, 2) respectively, and the skew vector S (a scalar in this case) is S=a1=1 by equation (9).
If the "T" in equation (15) is applied to (i, j) of the program shown in
When the computational space transform is completed, the host computer 11 performs an index transformation (step S13). In this process, the data layout is shifted according to the changes in data dependency.
With the changes in data dependency, the data layout in the memory space of each PE should be aligned into the mapping for the calculation process. However, it cannot be aligned into the mapping in which the calculation position is shifted with the static data layout declared in the data parallel language such as the HPF, thereby disabling the SDDT. As a result, the relationship between the virtual array and the actual array should be changed with time so that the data alignment to each PE can be shifted for each iteration of the serial loop.
Assuming that the subscript vector of the m-dimensional virtual array is Iν1, Iν2, . . . , Iνm) and the subscript vector of the corresponding actual array is Ip=(Ip1, Ip2, . . . , Ipm), the index transform from Iν to Ip is represented as follows.
However, the time step t is used for the subscript in the virtual array before update while the time step t+1 is used for the subscript in the virtual array after update. After performing such indexing processes, the storage positions of all elements in an actual array can be sifted in each time step. However, since the elements at the upper and lower limits of the virtual array are not calculated and not updated unless new values are assigned, the storage positions should be shifted with the values of the elements stored. A code is inserted to ensure such consistency before applying the index conversion process.
where the conversion by equation (18) replaces all j's in the program with J after converting j appearing in the equations shown in
According to the latest program shown in
After the index conversion process, the host computer 11 inserts a code for use in restoring the data layout (step S14), and terminates the process. To restore data layout refers to a process of returning the storage position of each element shifted in the data update process using the SDDT to the initial position specified by the programmer.
For example, in the data update process shown in
In
For example, at the initial state, PE1 holds the elements in the range of a (9:16) in the array a in the local area, and has the storage area for the elements in the range of a (5:8) as an overlap area (INIT). Then, the PE1 receives the data from the PE0, updates the data in the overlap area, and transmits the data in the range of a (13:16) to the PE2 (Update).
Then, the PE1 makes the first calculation using the data in the range of a (5:16), updates the data in a (6:15) (Iter 1), and stores the data after shifting the storage position in the communications direction by 1. At this time, the data in a (16) initially stored by the PE1 is updated in parallel by the PE2. Then, the PE1 makes the second calculation using the data in the range of a (6:15), updates the data in a (7:14) (Iter 2), and stores the data after shifting the storage position in the communications direction by 1. At this time, the data in a (15) initially stored by the PE1 is updated in parallel by the PE2. Since the data in the extended overlap area have become all dirty, the PE1 established uni-directional communications between the PE1 and the adjacent PE, updates the extended overlap area, and performs the data update process for Iter 3.
According to the data update shown in
According to the present invention, the overhead synchronously used with the communications can be reduced when a parallel process is performed using an overlap area in a distributed-memory parallel processor, thereby realizing a high speed parallel process.
Patent | Priority | Assignee | Title |
7099812, | Sep 24 1999 | Apple Inc | Grid that tracks the occurrence of a N-dimensional matrix of combinatorial events in a simulation using a linear index |
Patent | Priority | Assignee | Title |
5303357, | Apr 05 1991 | Kabushiki Kaisha Toshiba | Loop optimization system |
5442790, | May 24 1991 | The Trustees of Princeton University | Optimizing compiler for computers |
5457799, | Mar 01 1994 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Optimizer for program loops |
5581762, | May 18 1993 | Fujitsu Limited | Compiling apparatus having a function to analyze overlaps of memory addresses of two or more data expressions and a compiling method |
5596732, | Oct 19 1993 | Fujitsu Limited | Method of optimizing instruction sequence of compiler |
5634059, | Oct 13 1993 | MATSUSHITA ELECTRIC INDUSTRIAL CO , LTD | Device and method for parallelizing compilation optimizing data transmission |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Dec 21 1999 | Fujitsu Limited | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Nov 14 2005 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Jan 30 2007 | ASPN: Payor Number Assigned. |
Jan 30 2007 | RMPN: Payer Number De-assigned. |
Jan 11 2010 | REM: Maintenance Fee Reminder Mailed. |
Jun 04 2010 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Dec 23 2006 | 4 years fee payment window open |
Jun 23 2007 | 6 months grace period start (w surcharge) |
Dec 23 2007 | patent expiry (for year 4) |
Dec 23 2009 | 2 years to revive unintentionally abandoned end. (for year 4) |
Dec 23 2010 | 8 years fee payment window open |
Jun 23 2011 | 6 months grace period start (w surcharge) |
Dec 23 2011 | patent expiry (for year 8) |
Dec 23 2013 | 2 years to revive unintentionally abandoned end. (for year 8) |
Dec 23 2014 | 12 years fee payment window open |
Jun 23 2015 | 6 months grace period start (w surcharge) |
Dec 23 2015 | patent expiry (for year 12) |
Dec 23 2017 | 2 years to revive unintentionally abandoned end. (for year 12) |