A method and apparatus for dynamically rerouting node processes on the compute nodes of a massively parallel computer system using hint bits to route around failed nodes or congested networks without restarting applications executing on the system. When a node has a failure or there are indications that it may fail, the application software on the system is suspended while the data on the failed node is moved to a backup node. The torus network traffic is routed around the failed node and traffic for the failed node is rerouted to the backup node. The application can then resume operation without restarting from the beginning.
|
1. A parallel computer system comprising:
a plurality of nodes connected by a three dimensional torus network with network connections to six adjacent nodes in X+, X−, Y+, Y−, Z+ and Z− directions that correspond to directions of a three dimensional cartesian array;
a node/network monitoring mechanism that monitors the nodes and network connections of the parallel computer system and creates a problem list of nodes and network connections; and
torus network hardware on a node that dynamically routes a data packet over the three dimensional torus network based on hint bits, wherein the hint bits are a plurality of binary values in the data packet with a binary value for each network direction from the node, namely the X+, X−, Y+, Y−, Z+ and Z− directions;
wherein a set hint bit indicates to the torus network hardware a direction to direct the data packet on the three dimensional torus network to avoid the nodes and network connections in the problem list.
8. A computer-readable program product for execution on a parallel computer system with a plurality of nodes connected by a three dimensional torus network with network connections to six adjacent nodes in X+, X−, Y+, Y−, Z+ and Z− directions that correspond to directions of a three dimensional cartesian array comprising:
a node/network monitoring mechanism that monitors the nodes and network connections of the parallel computer system and creates a problem list of nodes and network connections; and
a node routing mechanism that dynamically routes a data packet over the three dimensional torus network using hint bits wherein the hint bits are a plurality of values in a data packet with a binary value for each network direction from the node, namely the X+, X−, Y+, Y−, Z+ and Z− directions that when set indicates to the node routing mechanism a direction to direct the data packet over the three dimensional torus network to avoid the nodes and network connections in the problem list; and
computer storage media having computer program instructions operable for causing a computer to execute the node/network monitoring mechanism and the node routing mechanism.
4. A computer implemented method for dynamically rerouting node processes on compute nodes connected by a three dimensional torus network in a parallel computer system using hint bits without restarting applications executing on the system, where the method comprises the steps of:
(A) monitoring the nodes and the three dimensional torus network for problems and identifying problem nodes and problem network connections in a problem list, wherein the three dimensional torus network has network connections to six adjacent nodes in X+, X−, Y+, Y−, Z+ and Z− directions that correspond to directions of a three dimensional cartesian array;
(B) detecting when the problem list is updated;
(C) pausing execution of nodes executing an application that is using the torus network;
(D) setting at least one of the hint bits to isolate a node or network connection in the problem list, wherein the hint bits are a plurality of binary values in a data packet with a value for each network direction from the node, namely the X+, X−, Y+, Y−, Z+ and Z−directions, and the hint bits indicate a direction to direct the data packet over the network connections to avoid the nodes and network connections in the problem list; and
(E) notifying all nodes paused in step (C) to resume execution.
2. The parallel computer system of
3. The parallel computer system of
5. The computer implemented method of
6. The computer implemented method of
migrating the process of at least one failed nodes to at least one backup node.
7. The computer implemented method of
9. The program product of
10. The program product of
|
1. Technical Field
This invention generally relates to fault recovery in a parallel computing system, and more specifically relates to an apparatus and method for dynamically rerouting node traffic on the compute nodes of a massively parallel computer system using hint bits without restarting applications executing on a massively parallel super computer.
2. Background Art
Efficient fault recovery is important to decrease down time and repair costs for sophisticated computer systems. On parallel computer systems with a large number of compute nodes, a failure of a single component may cause a large portion, or the entire computer to be taken off line for repair. Restarting an application may waste a considerable amount of processing time prior to the failure.
Massively parallel computer systems are one type of parallel computer system that have a large number of interconnected compute nodes. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack.
The Blue Gene/L supercomputer communicates over several communication networks. The 65,536 computational nodes are arranged into both a logical tree network and a 3-dimensional torus network. The logical tree network connects the computational nodes in a tree structure so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer. Since the compute nodes are arranged in a torus and tree network that require communication with adjacent nodes, a hardware failure of a single node can bring a large portion of the system to a standstill until the faulty hardware can be repaired. For example, a single node failure could render inoperable a complete section of the torus network, where a section of the torus network in the Blue Gene/L system is a half a rack or 512 nodes. Further, all the hardware assigned to the partition of the failure may also need to be taken off line until the failure is corrected.
On large parallel computer systems in the prior art, a failure of a single node during execution often requires that the software application be restarted from the beginning or from a saved checkpoint. When a failure event occurs, it would be advantageous to be able to move the processing of a failed node to another node so that the application can resume on the backup hardware with minimal delay to increase the overall system efficiency. Without a way to more effectively recover from failed or failing nodes, parallel computer systems will continue to waste potential computer processing time that increases operating costs.
An apparatus and method is described for dynamically rerouting node traffic on the compute nodes of a massively parallel computer system using hint bits to route around failed nodes or congested networks without restarting applications executing on the system. When a node has a failure or there are indications that it may fail, the application software on the system is suspended while the data on the failed node is moved to a backup node. The torus network traffic is routed around the failed node and traffic for the failed node is rerouted to the backup node. Similarly, network traffic can be routed around a congested network.
The examples and disclosure are directed to the Blue Gene architecture but extend to any parallel computer system with multiple processors arranged in a network structure where the node hardware handles cut through traffic from other nodes.
The foregoing and other features and advantages will be apparent from the following more particular description, as illustrated in the accompanying drawings.
The disclosure will be described in conjunction with the appended drawings, where like designations denote like elements, and:
The disclosure and claims herein are directed to an apparatus and method for dynamically rerouting node traffic on the compute nodes of a massively parallel computer system using hint bits without restarting applications executing on the system. When a node has a failure or there are indications that it may fail, the application software on the system is suspended while the data on the failed node is moved to a backup node. The torus network traffic is routed around the failed node and traffic for the failed node is rerouted to the backup node. The examples will be described with respect to the Blue Gene/L massively parallel computer being developed by International Business Machines Corporation (IBM).
The Blue Gene/L computer system structure can be described as a compute node core with an I/O node surface, where communication to 1024 compute nodes 110 is handled by each I/O node that has an I/O processor 170 connected to the service node 140. The I/O nodes have no local storage. The I/O nodes are connected to the compute nodes through the logical tree network and also have functional wide area network capabilities through a functional network (not shown). The functional network is connected to an I/O processor (or Blue Gene/L link chip) 170 located on a node board 120 that handles communication from the service node 160 to a number of nodes. The Blue Gene/L system has one or more I/O processors 170 on an I/O board (not shown) connected to the node board 120. The I/O processors can be configured to communicate with 8, 32 or 64 nodes. The connections to the I/O nodes are similar to the connections to the compute node except the I/O nodes are not connected to the torus network.
Again referring to
The service node 140 manages the control system network 150 dedicated to system management. The control system network 150 includes a private 100-Mb/s Ethernet connected to an Ido chip 180 located on a node board 120 that handles communication from the service node 160 to a number of nodes. This network is sometime referred to as the JTAG network since it communicates using the JTAG protocol. All control, test, and bring-up of the compute nodes 110 on the node board 120 is governed through the JTAG port communicating with the service node. In addition, the service node 140 includes a node/network monitor 142 that maintains a problem list 144 that indicates nodes that have failed, may be failing, or network links to avoid. The node/network monitor comprises software in the service node 140 but may be assisted by operating system software executing on the nodes of the system.
The Blue Gene/L supercomputer communicates over several communication networks.
The Blue Gene/L torus interconnect connects each node to its six nearest neighbors (X+, X−, Y+, Y−, Z+, Z−) in a logical 3D Cartesian array. The connections to the six neighbors is done at the node level, and at the midplane level. Each midplane is a 8×8×8 array of nodes. The six faces (X+, X−, Y+, Y−, Z+, Z−) of the node array in the midplane are each 8×8=64 nodes in size. Each torus network signal from the 64 nodes on each of the six faces is communicated through the link cards (not shown) connected to the midplane to the corresponding nodes in adjacent midplanes. The signals of each face may also be routed back to the inputs of the same midplane on the opposite face when the midplane is used in a partition with a depth of one midplane in any dimension.
Again referring to
The node compute chip 112, illustrated in
The torus network hardware 392 described above directs variable-size packets of data across the various torus networks.
Again referring to
As introduced above, the header 512 includes six “hint” bits 516. The hint bits 516 indicate the directions in which the packet may be routed in the three dimensions of the torus network. The hint bits are defined in XYZ order as follows: X+X−Y+Y−Z+Z−. For example, hint bits of 100100 mean that the packet can be routed in the x+ and y− directions. Either the x+ or the x− hint bits can be set, but not both, because one set bit indicated which direction to direct the packet in that dimension. The default would be for all hint bits to be unset or 0 to indicate that the packet can be sent in any direction.
In torus networks, there is typically a dimension order in which data flows between nodes. The dimension order in the examples herein is assumed to be XYZ, but other orders could also be used. The dimension order of XYZ means that data will flow from a node first in the X dimension, then through nodes in the Y dimension, then in the Z dimension. The XYZ hint bits are used in routing in the XYZ dimensions respectively.
Each node maintains a set of software-configurable registers that control the torus functions (not shown). For example, a set of registers contains the coordinates of its neighbors. Hint bits are set to 0 when a packet leaves a node in a direction such that it will arrive at its destination in that dimension, as determined by the neighbor coordinate registers. These hint bits appear early in the header so that arbitration may be efficiently pipelined. The hint bits can be initialized by either software or hardware; if done by hardware, a set of two registers per dimension is used to determine the appropriate directions. These registers can be configured to provide minimal hop routing. The routing is accomplished entirely by examining the hint bits and virtual channels; i.e., there are no routing tables. Packets may be either dynamically or deterministically dimension-ordered (xyz) routed. That is, they can follow a path of least congestion based on other traffic, or they can be routed on a fixed path. Besides point-to-point packets, a bit in the header may be set that causes a packet to be broadcast down any Cartesian dimension and deposited at each node. Software can set the hint bits appropriately so that “dead” nodes or links are avoided as described further below. Full connectivity can be maintained when there are up to three noncolinear faulty nodes.
Again referring to
As introduced above, the hint bits can also be used to dynamically route around a congested network. As an example, we consider the network 710 between node8 622 and node5 618 illustrated in
The disclosure herein includes a method and apparatus for dynamically rerouting node traffic on the compute nodes of a massively parallel computer system using hint bits without restarting applications executing on the system. Dynamically rerouting node traffic can significantly decrease the amount of down time for increased efficiency of the computer system.
One skilled in the art will appreciate that many variations are possible within the scope of the claims. Thus, while the disclosure is particularly shown and described above, it will be understood by those skilled in the art that these and other changes in form and details may be made therein without departing from the spirit and scope of the claims.
Peters, Amanda, Swartz, Brent Allen, Darrington, David L., Sidelnik, Albert, Smith, Brian Edward, McCarthy, Patrick Joseph
Patent | Priority | Assignee | Title |
11467876, | Dec 18 2018 | Fujitsu Limited | Information processing apparatus, information processing method and non-transitory computer-readable storage medium for storing information processing program of determining relations among nodes in N-dimensional torus structure |
8559307, | Dec 28 2009 | INTELLECTUAL VENTURES ASIA PTE LTD | Routing packets in on-chip networks |
9191341, | Dec 28 2009 | Empire Technology Development LLC | Packet routing within an on-chip network |
Patent | Priority | Assignee | Title |
5287345, | Feb 04 1988 | INTELLECTUAL VENTURES FUND 41 LLC | Data handling arrays |
5495426, | Jan 26 1994 | CISCO TECHNOLOGY, INC , A CORPORATION OF CALIFORNIA | Inband directed routing for load balancing and load distribution in a data communication network |
6865149, | Mar 03 2000 | ADTRAN, INC | Dynamically allocated ring protection and restoration technique |
20060034171, | |||
20070053283, | |||
20080084827, | |||
20080084864, | |||
20080084865, | |||
20080084889, | |||
20080178177, | |||
20080186853, | |||
20080189573, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Apr 10 2007 | MCCARTHY, PATRICK J | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019177 | /0612 | |
Apr 10 2007 | PETERS, AMANDA | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019177 | /0612 | |
Apr 10 2007 | SIDELNIK, ALBERT | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019177 | /0612 | |
Apr 10 2007 | SMITH, BRIAN E | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019177 | /0612 | |
Apr 11 2007 | DARRINGTON, DAVID L | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019177 | /0612 | |
Apr 11 2007 | SWARTZ, BRENT A | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019177 | /0612 | |
Apr 18 2007 | International Business Machines Corporation | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Dec 23 2009 | ASPN: Payor Number Assigned. |
Mar 29 2013 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Aug 18 2017 | REM: Maintenance Fee Reminder Mailed. |
Feb 05 2018 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Jan 05 2013 | 4 years fee payment window open |
Jul 05 2013 | 6 months grace period start (w surcharge) |
Jan 05 2014 | patent expiry (for year 4) |
Jan 05 2016 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jan 05 2017 | 8 years fee payment window open |
Jul 05 2017 | 6 months grace period start (w surcharge) |
Jan 05 2018 | patent expiry (for year 8) |
Jan 05 2020 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jan 05 2021 | 12 years fee payment window open |
Jul 05 2021 | 6 months grace period start (w surcharge) |
Jan 05 2022 | patent expiry (for year 12) |
Jan 05 2024 | 2 years to revive unintentionally abandoned end. (for year 12) |