A method and apparatus are provided for implementing system to system communication in a switchless non-infiniband (ib) compliant environment. ib architected multicast facilities are used to communicate between HCAs in a loop or string topology. Multiple HCAs in the network subscribe to a predetermined multicast address. multicast messages sent by one hca destined to the pre-determined multicast address are received by other HCAs in the network. Intermediate TCA hardware, per ib architected multicast support, forward the multicast messages on via hardware facilities, which do not require invocation of software facilities thereby providing performance efficiencies. The messages flow until picked up by an hca on the network. Architected higher level ib connections, such as ib supported Reliable Connections (RCs) are established using the multicast message flow, eliminating the need for an ib subnet manager (SM).
|
1. A method for implementing system to system communication in a switchless non-infiniband (ib) compliant environment, an infiniband (ib) fabric including a plurality of ib links, each system including a host channel adapters (hca) connected to respective ib links of the an infiniband (ib) fabric, and each hca including ib multicast facilities, said method comprising:
providing a predetermined multicast address used by the HCAs for communicating between HCAs ;
a first hca sending multicast messages destined to the predetermined multicast address on respective ib links of the ib fabric prior to local ID addresses (LIDs) being assigned in the non-infiniband (ib) compliant environment without ib switches and without a common subnet manager (SM) entity;
providing a flow of the multicast messages on the ib fabric until an hca on the ib fabric picks up the multicast messages;
a second hca receiving incoming packets on at least one respective ib link of the ib fabric, identifying the predetermined multicast address, and receiving the multicast messages without forwarding the multicast messages; and
responsive to the first hca sending multicast messages destined to the predetermined multicast address, the second hca sending multicast response messages, establishing higher level ib connections.
13. An apparatus for implementing system to system communication in a switchless non-infiniband (ib) compliant environment comprising:
an infiniband (ib) fabric including a plurality of ib links;
each system including a host channel adapters (hca) connected to respective ib links of the infiniband (ib) fabric,
a predetermined multicast address used by each of the HCAs;
each hca including ib multicast facilities for sending multicast messages on the respective ib links of the ib fabric;
a first hca sending multicast messages destined to the predetermined multicast address on respective ib links of the ib fabric prior to local ID addresses (LIDs) being assigned in the non-infiniband (ib) compliant environment without ib switches and without a common subnet manager (SM) entity;
a second hca receiving incoming packets, identifying the predetermined multicast address, and receiving the multicast messages without forwarding the multicast messages;
responsive to the first hca sending multicast messages destined to the predetermined multicast address, the second hca sending multicast response messages, establishing higher level ib connections; and
a plurality of intermediate target channel adapters (TCAs) connected to respective ib links of the infiniband (ib) fabric between the first hca and the second hca, and wherein each of the plurality of intermediate TCAs forwards the multicast messages on via hardware facilities without invocation of software facilities.
9. A non-transitory computer program product implementing system to system communication in a switchless non-infiniband (ib) compliant environment, an infiniband (ib) fabric including a plurality of ib links, each system including a host channel adapters (hca) connected to respective ib links of the a-a infiniband (ib) fabric, and each hca including ib multicast facilities, said non-transitory computer program product including a plurality of computer executable instructions stored on a non-transitory computer readable medium, wherein said instructions, when executed by a hca, cause the hca to perform the steps of:
providing a predetermined multicast address used by the HCAs for communicating between HCAs;
sending, by a first hca, multicast messages destined to the predetermined multicast address on respective ib links of the ib fabric prior to local ID addresses (LIDs) being assigned in the non-infiniband (ib) compliant environment without ib switches and without a common subnet manager (SM) entity;
providing a flow of the multicast messages on the ib fabric until an hca on the ib fabric picks up the multicast messages;
receiving, by a first hca, incoming packets on at least one respective ib link of the ib fabric, identifying the predetermined multicast address, and receiving the multicast messages without forwarding the multicast messages; and
responsive to the first hca sending multicast messages destined to the predetermined multicast address, the second hca sending multicast response messages, establishing higher level ib connections.
8. A method for implementing system to system communication in a switchless non-infiniband (ib) compliant environment, an infiniband (ib) fabric including a plurality of ib links, each system including a host channel adapters (hca) connected to respective ib links of the infiniband (ib) fabric, and each hca including ib multicast facilities, said method comprising:
providing a predetermined multicast address used by the HCAs for communicating between HCAs;
a first hca sending multicast messages destined to the predetermined multicast address on respective ib links of the ib fabric prior to local ID addresses (LIDs) being assigned in the non-infiniband (ib) compliant environment without ib switches and without a common subnet manager (SM) entity;
providing a flow of the multicast messages on the ib fabric until an hca on the ib fabric picks up the multicast messages;
a second hca receiving incoming packets on at least one respective ib link of the ib fabric, identifying the predetermined multicast address, and receiving the multicast messages without forwarding the multicast messages;
responsive to the first hca sending multicast messages destined to the predetermined multicast address, the second hca sending multicast response messages, establishing higher level ib connections; and
a plurality of intermediate target channel adapters (TCAs) connected to respective ib links of the infiniband (ib) fabric between the first hca and the second hca, and each of the plurality of intermediate TCAs forwarding the multicast messages on via hardware facilities without invocation of software facilities or local I/O processor cycles.
2. The method for implementing system to system communication in a switchless non-infiniband (ib) compliant environment as recited in
3. The method for implementing system to system communication in a switchless non-infiniband (ib) compliant environment as recited in
4. The method for implementing system to system communication in a switchless non-infiniband (ib) compliant environment as recited in
5. The method for implementing system to system communication in a switchless non-infiniband (ib) compliant environment as recited in
6. The method for implementing system to system communication in a switchless non-infiniband (ib) compliant environment as recited in
7. The method for implementing system to system communication in a switchless non-infiniband (ib) compliant environment as recited in
10. The non-transitory computer program product implementing system to system communication as recited in
11. The non-transitory computer program product implementing system to system communication as recited in
12. The non-transitory computer program product implementing system to system communication as recited in
14. The apparatus for implementing system to system communication in a switchless non-infiniband (ib) compliant environment as recited in
|
The present invention relates generally to the data processing field, and more particularly, relates to a method and apparatus for implementing system to system communication in a switchless non-InfiniBand (IB) compliant environment using InfiniBand unreliable datagram multicast facilities.
Input/output (I/O) networks, such as system buses, can be used for the processor of a computer to communicate with peripherals such as network adapters or with processors of other computers in the network. However, constraints in the architectures of common I/O networks, such as the Peripheral Component Interface (PCI) bus, limit the overall performance of the I/O network and the computers and I/O peripherals that it interconnects. As a result new types of I/O networks have been introduced for interconnecting systems.
One recent type of I/O network is known and referred to as the InfiniBand (IB) network. The InfiniBand network replaces the PCI or other bus currently found in computers used for system level interconnects with a packet-switched network, complete with zero or more routers. A host channel adapter (HCA) couples the processor to a subnet, and target channel adapters (TCAs) couple the peripherals to the subnet. The subnet typically includes at least one switch, and links that connect the HCA and the TCAs to the switches. For example, a simple InfiniBand network typically has at least one switch, to which the HCA and the TCAs connect through links.
The IB fabric typically includes a plurality of endnodes, such as HCAs and TCAs, a plurality of switches, a plurality of routers, and a plurality of links. Ports on endnodes, switches, and routers are connected in a point to point fashion by links. In a known InfiniBand (IB) subnet, a Subnet Manager (SM) is responsible for initial discovery and configuration of the subnet. Another InfiniBand component known as the Subnet Administrator (SA) provided with the Subnet Manager (SM) provides services to members of the subnet including access to configuration and routing information determined by the SM. See InfiniBand Architecture Specification Volume 1 for more detail.
A need exists to establish communications over an InfiniBand (IB) fabric between Host Channel Adapters (HCAs) in distinct systems, such as processor nodes, in a network without IB switches and without a common Subnet Manager (SM) entity to assign unique local ID addresses (LIDs) to the HCA, i.e., a non-compliant IB network. The IB network may contain Target Channel Adapters (TCAs) which may or may not be IB-compliant. The network topology, being switchless, consists of multiple strings or a loop topology. Packets need to flow from source HCA to target HCA prior to LIDs being assigned with or without intermediate TCAs on the IB fabric.
Known solutions to this problem typically make use of external IB switches in a switched topology, which include a Subnet Manager function as part of the switch. The cost of the switch is a significant issue for the Small to Medium Business (SMB) environment. Also, the development, test, and maintenance costs for integrating a fully IB-compliant SM function in firmware in a processor node in a switchless environment can be significant.
A switchless solution, i.e., a string or loop topology, conventionally would require a Subnet Manager function to exist somewhere in the network, likely uniquely developed for one of the processor nodes and using the bandwidth and resources of that processor node, to manage LIDs in a multi-HCA topology. For an IB subnet, the Subnet Manager (SM) is responsible for initial discovery and configuration of the subnet. Tightly coupled with the SM is another InfiniBand component known as the Subnet Administrator (SA). The SA provides services to members of the subnet including access to configuration and routing information determined by the SM. The capabilities of the SM and SA can be sophisticated: they resolve all potential paths from all nodes with deadlock avoidance, they support many optional features of the InfiniBand Architecture (IBA), they provide quality of service (QOS) support, and the like.
Thus full SM development and deployment is a considerable software development and system expense. Additionally, the TCAs may be non-IB compliant and force solutions that are not addressed through existing IB compliant SMs.
It may be possible that other unique solutions could be developed that would require unique software intervention at each intermediate TCA to look inside incoming packet headers and determine that a special HCA only packet is on the wire and then forward out the egress port. However in addition to unique code development, this requires TCA processor cycles to partially process each inbound packet.
Principal aspects of the present invention are to provide a method and apparatus for implementing system to system communication in a switchless non-InfiniBand (IB) compliant environment using of InfiniBand unreliable datagram multicast facilities. Other important aspects of the present invention are to provide such method and apparatus for implementing system to system communication in a switchless non-InfiniBand (IB) compliant environment using of InfiniBand unreliable datagram multicast facilities substantially without negative effect and that overcome many of the disadvantages of prior art arrangements.
In brief, a method and apparatus are provided for implementing system to system communication in a switchless non-InfiniBand (IB) compliant environment. IB architected multicast facilities are used to communicate between HCAs connected, for example, in a loop or string topology. Multiple HCAs in the network subscribe to a predetermined multicast address. Multicast messages sent by one HCA destined to the pre-determined multicast address are received by other HCAs in the network. The multicast messages flow until picked up by an HCA on the network.
In accordance with features of the invention, each intermediate TCA hardware, per IB architected multicast support, forwards the multicast messages on via hardware facilities, which do not require invocation of software facilities thereby providing performance efficiencies. Each intermediate TCA forwards the multicast messages on via hardware facilities. Packets flow from source HCA to target HCA prior to LIDs being assigned with or without intermediate TCAs on the IB fabric
The present invention together with the above and other objects and advantages may best be understood from the following detailed description of the preferred embodiments of the invention illustrated in the drawings, wherein:
In accordance with features of the invention, a method and apparatus implement system to system communication in a switchless non-InfiniBand (IB) compliant environment using InfiniBand unreliable datagram multicast facilities. The method and apparatus of the invention establish communications over an InfiniBand (IB) fabric between Host Channel Adapters (HCAs) in distinct systems (processor nodes) in a network without IB switches and without a common Subnet Manager (SM) entity to assign unique local ID addresses (LIDs) to the HCA, i.e., a non-compliant IB network. The IB network may contain Target Channel Adapters (TCAs) which may or may not be IB-compliant. The network topology, being switchless, consists of multiple strings or a loop topology. Packets are enabled to flow from source HCA to target HCA prior to LIDs being assigned with or without intermediate TCAs on the IB fabric.
It should be noted that the driving force for using non-compliant devices in an IB network are two-fold. When building an internal proprietary network topology for restricted environments, it is desirable to take advantage of high usage industry standard parts where feasible for low cost. At the same time, where the environment does not call for interconnecting with a public network but requires unique chip development for devices such as support for I/O drawers which may not be used widely in the industry, a lower cost design can be achieved by defining less complex non-compliant devices such as switches and bridge logic for the referenced I/O drawers. Secondly, this environment can also achieve significant savings with respect to software development and support by greatly simplifying and reducing the role of such IB compliant entities as a Subnet Manager for network control.
Having reference now to the drawings, in
The illustrated non-compliant InfiniBand (IB) network 100 provides an example loop topology, while it should be understood that the present invention can be implemented with an IB network that includes multiple strings or the loop topology.
The non-compliant InfiniBand (IB) network 100 includes a first system 0 or Component Enclosure Complex (CEC) CEC0, 102 and a second system 1 or CEC1, 102, each including a Hub 104. The Hub hardware 104 along with the firmware used to control the Hub hardware is illustrated and described with respect to
The non-compliant InfiniBand (IB) network 100 includes a plurality of input/output (I/O) enclosures or I/O drawers 106, each including at least one bridge chip. As shown, each of the I/O drawers 106 includes a plurality of non-IB compliant IB to PCI bridge chips (NCBs) or target channel adapters (TCA) 108 with an associated PCI Host bridge 110 including one or more slots.
An InfiniBand (IB) fabric generally designated by the reference character 114 provides the example loop topology including a plurality of IB links 116, 118. The IB links 116 or IB cables 116 are point-to-point links connecting respective IB ports of the CEC0 or HCA A, 102 and CEC1 or HCA B, 102 to respective IB ports of adjacent I/O drawers 106. The IB links 118 are point-to-point links connecting respective adjacent NCB or TCAs 108.
Referring to
As shown in
As shown in
The LID Bit Array 202 shown in
It should be noted that alternative embodiments of this invention can be implemented with a single QP on each CEC serving as both send and receive QP functions. Also, the specific HCA design will dictate whether special features such as the Force Out Bit described above are required to force routing out the HCA ports versus routing internal to the HCA. It is only critical to the invention that the multicast messages are routed externally out an HCA port and not routed internally as if delivery is only required local to the HCA.
Referring also to
Referring also to
At the receiving HCA B or CEC1, 102, the initial send and receive port manager INIT SR PORT MGR 402 posts a response message as indicated at line 1A) POST SENT RSP to the receive queue pair RCVD QP 406. As indicated at line 2A) MOVE RSP DATA, the Hub hardware moves the response data from the receive queue pair RCVD QP 406 to the send queue pair SEND QP 404 of the HCA A or CEC0, 102. An response interrupt is generated as indicated at line 3A) RSP INTERRUPT applied to the event queue EQ 408 of the HCA A or CEC0, 102 and as indicated at line 4A) applied to the completion queue CQ 410 and coupled to the send queue pair SEND QP 404 as indicated at line 5A). As indicated at line 6A) the send queue pair SEND QP 404 is coupled to the queue pair QP handler 412, which applies the received response message to the initial send and receive port manager INIT SR PORT MGR 402 of the HCA A or CEC0, 102 as indicated at line 7A) RES RECEIVED.
While generating an interrupt and response interrupt is illustrated in
Referring now to
HCA controlling firmware and structure objects 500 include an HCA manager 502 coupling information and controls to a HUB controller 504, an Event Queue (EQ) 506, and a Completion Queue (CQ) 508 as indicated at respective lines labeled 100) START IB BUS, and KNOWS_A, where KNOWS_A indicates a pointer to a resource or other object in a separate memory location. HCA firmware and structure objects 500 include a IB Bus 510 started by the HUB controller 504 as indicated at respective lines 100A) IPL GIVEN PORTS; and 101) CREATE SR LOOP MANAGER. Alternatively, as indicated at a line 100ALT.) TAKE RECOVERABLE ERROR MSG is applied to the IB Bus 510.
The IB Bus 510 and HUB controller 504 are coupled to a lower level manager or SR Loop Manager 512, as indicated at respective lines 102) CTOR (C++ constructor in this implementation), and 201) CREATE BUS ADAPTER. The IB Bus 510 and HUB controller 504 is coupled to a lower level bus adapter or a SR Loop Bus Adapter 514, as indicated at line 202) CTOR, which is coupled to a SR Loop Bus Bucc 516 as indicated at line 203) CTOR. The SR Loop Bus Bucc 516 is coupled to a SR Loop Bus 518 as indicated at line 204) CTOR. The SR Loop Bus 518 is coupled to a Reliable Connection 520 as indicated at line 205) CTOR, which is coupled to a queue pair QP (APM support) 522 as indicated at line KNOWS_A.
The SR Loop Manager 512 is coupled to lower level manager or an initial SR Loop Manager 524, as indicated at line 103) CTOR, which is coupled to a SR Loop LID Manager 526 as indicated at line 104) CTOR and is coupled to a SR Port Manager 528 as indicated at line 105) CTOR. The SR Port Manager 528 is coupled to a queue pair QP (Mcast Send) 530 as indicated at line 106) CTOR and to a queue pair QP (Receive) 532 as indicated at line 107) CTOR. The initial SR Loop Manager 524, is coupled to a initial SR Port Manager 534 as indicated at line 108) CTOR, which is coupled to a queue pair QP (Mcast Send) 536 as indicated at line 109) CTOR and to a queue pair QP (Receive) 538 as indicated at line 110) CTOR. The queue pair QP (Mcast Send) 530, and queue pair QP (Mcast Send) 536 is a separate QP class for multicast messages. A multicast facility 540 is connected to each of the QP (Receive) 532, and the QP (Receive) 538. The multicast facility 540 under the QP objects 530, 532, 536 538
Referring now to
Referring now to
A sequence of program instructions or a logical assembly of one or more interrelated modules defined by the recorded program means 704, 706, 708, 710, direct the systems or CEC0, CEC1, 102 for establishing communications over a non-compliant InfiniBand (IB) network of the preferred embodiment.
While the present invention has been described with reference to the details of the embodiments of the invention shown in the drawing, these details are not intended to limit the scope of the invention as claimed in the appended claims.
Block, Timothy Roy, Sand, Thomas Rembert, Schimke, Timothy Jerry
Patent | Priority | Assignee | Title |
10193758, | Apr 18 2016 | International Business Machines Corporation | Communication via a connection management message that uses an attribute having information on queue pair objects of a proxy node in a switchless network |
10218601, | Apr 18 2016 | International Business Machines Corporation | Method, system, and computer program product for configuring an attribute for propagating management datagrams in a switchless network |
10225153, | Apr 18 2016 | International Business Machines Corporation | Node discovery mechanisms in a switchless network |
10225185, | Apr 18 2016 | International Business Machines Corporation | Configuration mechanisms in a switchless network |
10904132, | Apr 18 2016 | International Business Machines Corporation | Method, system, and computer program product for configuring an attribute for propagating management datagrams in a switchless network |
11165653, | Apr 18 2016 | International Business Machines Corporation | Node discovery mechanisms in a switchless network |
11190444, | Apr 18 2016 | International Business Machines Corporation | Configuration mechanisms in a switchless network |
Patent | Priority | Assignee | Title |
6192417, | Mar 30 1999 | International Business Machines Corporation | Multicast cluster servicer for communicating amongst a plurality of nodes without a dedicated local area network |
6877044, | Feb 10 2000 | Vicom Systems, Inc.; VICOM SYSTEMS, INC | Distributed storage management platform architecture |
7245627, | Apr 23 2002 | Mellanox Technologies Ltd. | Sharing a network interface card among multiple hosts |
7299290, | Mar 22 2000 | EMC IP HOLDING COMPANY LLC | Method and system for providing multimedia information on demand over wide area networks |
7475174, | Mar 17 2004 | Super Talent Electronics, Inc | Flash / phase-change memory in multi-ring topology using serial-link packet interface |
7484114, | Aug 17 2001 | International Business Machines Corporation | Method and apparatus for providing redundant access to a shared resource with a shareable spare adapter |
7606141, | Jan 12 2006 | International Business Machines Corporation | Implementing N-way fast failover in virtualized Ethernet adapter |
7610330, | Mar 30 2006 | CA, INC | Multi-dimensional computation distribution in a packet processing device having multiple processing architecture |
7653769, | Dec 14 2006 | International Business Machines Corporation | Management of devices connected to infiniband ports |
7676558, | Nov 12 2004 | International Business Machines Corporation | Configuring shared devices over a fabric |
7676623, | Dec 14 2006 | International Business Machines Corporation | Management of proprietary devices connected to infiniband ports |
7724748, | Dec 28 2000 | Intel Corporation | LAN emulation over infiniband fabric apparatus, systems, and methods |
8040914, | Jul 30 2002 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Method and apparatus for establishing metazones across dissimilar networks |
8116339, | Jul 30 2002 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Method and apparatus for establishing metazones across dissimilar networks |
20010049740, | |||
20020103943, | |||
20020165899, | |||
20020165978, | |||
20030031183, | |||
20030037275, | |||
20030061296, | |||
20030061379, | |||
20030070014, | |||
20030200315, | |||
20030208531, | |||
20040022245, | |||
20040213220, | |||
20040218623, | |||
20050144313, | |||
20050271073, | |||
20060104265, | |||
20060203846, | |||
20080005329, | |||
20080005343, | |||
20080016269, | |||
20080144531, | |||
20080192654, | |||
20090024817, | |||
20090063665, | |||
20100082853, | |||
20100095080, | |||
20100226375, | |||
20110268117, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 25 2008 | BLOCK, TIMOTHY ROY | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 021599 | 0189 | |
Sep 25 2008 | SAND, THOMAS REMBERT | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 021599 | 0189 | |
Sep 25 2008 | SCHIMKE, TIMOTHY JERRY | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 021599 | 0189 | |
Sep 29 2008 | International Business Machines Corporation | (assignment on the face of the patent) |
Date | Maintenance Fee Events |
Mar 04 2016 | REM: Maintenance Fee Reminder Mailed. |
Jul 24 2016 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Jul 24 2015 | 4 years fee payment window open |
Jan 24 2016 | 6 months grace period start (w surcharge) |
Jul 24 2016 | patent expiry (for year 4) |
Jul 24 2018 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jul 24 2019 | 8 years fee payment window open |
Jan 24 2020 | 6 months grace period start (w surcharge) |
Jul 24 2020 | patent expiry (for year 8) |
Jul 24 2022 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jul 24 2023 | 12 years fee payment window open |
Jan 24 2024 | 6 months grace period start (w surcharge) |
Jul 24 2024 | patent expiry (for year 12) |
Jul 24 2026 | 2 years to revive unintentionally abandoned end. (for year 12) |