An example embodiment of the present invention provides processes relating to a connection/communication protocol and a memory-addressing scheme for a distributed shared memory system. In the example embodiment, a logical node identifier comprises bits in the physical memory addresses used by the distributed shared memory system. Processes in the embodiment include logical node identifiers in packets which conform to the protocol and which are stored in a connection control block in local memory. By matching the logical node identifiers in a packet against the logical node identifiers in the connection control block, the processes ensure reliable delivery of packet data. Further, in the example embodiment, the logical node identifiers are used to create a virtual server consisting of multiple nodes in the distributed shared memory system.
|
0. 16. A method comprising:
receiving, at a first node in a distributed shared memory system, a message from a second node in the distributed shared memory system, the distributed shared memory system comprising a plurality of interconnected nodes each having a unique logical node identifier, wherein the message indicates a memory operation related to a local memory of the first node and identifies a memory address;
if a first plurality of contiguous bits of the memory address equal a logical node identifier of the first node, changing the first plurality of contiguous bits to a predetermined value;
if the first plurality of contiguous bits of the memory address equal the predetermined value, changing the first plurality of contiguous bits to the logical node identifier of the first node; and
forwarding the message to a processor of the first node for processing.
0. 21. A method comprising:
receiving, at a first node in a distributed shared memory system, a message from a processor of the first node identifying a memory operation related to a local memory of a second node in the distributed shared memory system, the distributed shared memory system comprising a plurality of nodes each having a unique logical unit identifier, the plurality of nodes being interconnected by a switch fabric, wherein the message identifies a memory address;
if a first plurality of contiguous bits of the memory address equal a logical node identifier of the first node, changing the first plurality of contiguous bits to a predetermined value;
if the first plurality of contiguous bits of the memory address equal the predetermined value, changing the first plurality of contiguous bits to the logical node identifier of the first node; and
forwarding the message to the second node for processing.
4. A method, comprising:
receiving, at a distributed memory logic circuit of a first node, a packet from a distributed memory logic circuit of a second node, wherein the packet includes a source logical node identifier and wherein the first and second nodes are connected by a network switch fabric and are parts of a distributed shared memory system;
determining whether a destination switch fabric address included in the packet matches a switch fabric address for the first node;
using the source logical node identifier as an index into a connection control block to locate an entry for the a connection between the first and second nodes, resulting in a located entry of the connection control block, wherein the connection control block is stored in a local memory on the first node;
determining whether a destination logical node identifier included in the packet matches a logical node identifier for the first node, wherein the logical node identifier for the first node is identified in the located entry of the connection control block; and
accepting data in the packet for further processing by the first node.
11. A distributed memory logic circuit encoded with executable logic, the logic when executed operable to:
receive, at the distributed memory logic circuit of a first node, a packet from a distributed memory logic circuit of a second node, wherein the packet includes a source logical node identifier and wherein the first and second nodes are connected by a network switch fabric and are parts of a distributed shared memory system;
determine whether a destination switch fabric address included in the packet matches a switch fabric address for the first node;
use the source logical node identifier as an index into a connection control block to locate an entry for a connection between the first and second nodes, resulting in a located entry of the connection control block, wherein the connection control block is stored in a local memory on the first node;
determine whether a destination logical node identifier included in the packet matches a logical node identifier for the first node, wherein the logical node identifier for the first node is identified in the located entry of the connection control block; and
accept data in the packet for further processing by the first node.
0. 22. A distributed shared memory system, comprising:
a network switch fabric; and
a plurality of nodes interconnected by the network switch fabric, each given node of the plurality of nodes comprising:
a logical node identifier of a plurality of contiguous bits;
a local memory;
a distributed shared memory management chip operative to share the local memory of the given node with others of the plurality of nodes in the distributed shared memory system to create a shared memory accessible using binary addresses comprising a plurality of bits, wherein a set of contiguous most-significant bits of the binary addresses collectively represent a logical node identifier of a node of the plurality of nodes; and
one or more processors each operative to access the local memory of the given node, the local memory accessed using binary addresses having the set of contiguous most-significant bits collectively set to a predetermined value,
wherein the distributed shared memory management chip is further operative to map the predetermined value to the logical node identifier of the given node in memory management traffic transmitted between the plurality of nodes that include one or more binary addresses of the shared memory.
1. A method, comprising:
receiving, at a distributed memory logic circuit of a first node, data for a packet destined to a distributed memory logic circuit of a second node, wherein the first and second nodes are connected by a network switch fabric and are parts of a distributed shared memory system, and wherein the data for the packet includes a physical memory address in which one or more bits in the physical memory address comprise a destination logical node identifier for the second node;
using the destination logical node identifier as an index into a connection control block to locate an entry for a connection between the first and second nodes, resulting in a located entry of the connection control block, wherein the connection control block is stored in a local memory on the first node;
building a the packet in a format of a connection and communication protocol using the data, the destination logical node identifier, and a logical node identifier for the first node, wherein the logical node identifier for the first node is included in the located entry of the connection control block entry;
adding, to the packet, a header that includes a switch fabric address for the second node, wherein the switch fabric address is identified in the located entry of the connection control block; and
transmitting the packet on a link to the switch fabric.
8. A distributed memory logic circuit encoded with executable logic, the logic when executed operable to:
receive, at the distributed memory logic circuit of a first node, data for a packet destined to a distributed memory logic circuit of a second node, wherein the first and second nodes are connected by a network switch fabric and are parts of a distributed shared memory system,; and wherein the data for the packet includes a physical memory address in which one or more bits in the physical memory address comprise a destination logical node identifier for the second node;
use the destination logical node identifier as an index into a connection control block to locate an entry for a connection between the first and second nodes, resulting in a located entry of the connection control block, wherein the connection control block is stored in a local memory on the first node;
build a the packet in a format of a connection and communication protocol using the data, the destination logical node identifier, and a logical node identifier for the first node, wherein the logical node identifier for the first node is included in the located entry of the connection control block entry;
add, to the packet, a header that includes a switch fabric address for the second node, wherein the switch fabric address is identified in the located entry of the connection control block; and
transmit the packet on a link to the switch fabric.
15. A distributed shared memory system comprising:
a network switch fabric;
two or more nodes in a distributed shared memory system connected by a the network switch fabric; and wherein, each of the two or more nodes comprises comprising:
one or more processors,;
local memory; and
a distributed shared memory logic circuit,
wherein the distributed memory logic circuit is encoded with executable logic, the logic that
when executed, is operable to:
receive, at the distributed memory logic circuit of a local node, data for a packet destined to a distributed memory logic circuit of a remote node of the two or more nodes in the distributed shared memory system, wherein the data for the packet includes a physical memory address in which one or more bits in the physical memory address comprise a destination logical node identifier for the remote node,
use the destination logical node identifier as an index into a connection control block to locate an entry for a connection between the local node and the remote node, resulting in a local entry of the connection control block, wherein the connection control block is stored in local memory on the local node,
build a the packet in a format of a connection and communication protocol using the data, the destination logical node identifier, and a logical node identifier for the local node, wherein the logical node identifier for the local node is included in the located entry of the connection control block entry,
add, to the packet, a header that includes a switch fabric address for the remote node, wherein the switch fabric address is identified in the located entry of the connection control block,
transmit the packet on a link to the network switch fabric, receive, at the distributed memory logic circuit of the local node, a second packet from a distributed memory logic circuit of the remote node or another remote node of the two or more nodes in the distributed shared memory system, wherein the second packet includes a source logical node identifier,
determine whether a destination switch fabric address included in the second packet matches a switch fabric address for the local node,
use the source logical node identifier as an index into the connection control block to locate an entry for a connection between the local and remote node, resulting in a second located entry of the connection control block, determine whether a destination logical node identifier included in the second packet matches a the logical node identifier for the local node, wherein the logical node identifier for the local node is identified in the second located entry of the connection control block, and
accept data in the packet for further processing by the local node.
2. A method as in
3. A method as in
5. The method of
6. The method of
7. The method of
9. The distributed memory logic circuit of in
10. The distributed memory logic circuit of
12. The distributed memory logic circuit of
13. The distributed memory logic circuit of
14. The distributed memory logic circuit of
0. 17. The method of claim 16, wherein the predetermined value is zero.
0. 18. The method of claim 16, wherein each node of the plurality of interconnected nodes internally accesses a respective local memory having memory addresses with a first plurality of contiguous bits set to the predetermined value.
0. 19. The method of claim 16, wherein a given node of the plurality of interconnected nodes accesses a local memory of another node of the plurality of interconnected nodes that has a logical unit identifier equal to the predetermined value using the given node's own respective logical node identifier for the another node.
0. 20. The method of claim 16, wherein the memory operation is one of a read command, a write command, or a probe.
0. 23. The distributed shared memory system of claim 22, wherein the distributed shared memory management chip of each node of the plurality of nodes is further operative to:
if the set of contiguous most-significant bits of a given binary address equal the logical node identifier of the given node, change the set of contiguous most-significant bits of the given binary address to the predetermined value; and
if the set of contiguous most-significant bits of the given binary address equal the predetermined value, change the set of contiguous most-significant bits of the given binary address to the logical node identifier of the given node.
|
identities identifies the packet's destination node. This is the connection identifier (i.e., remote LNID) at the source node. This field is 16 bits wide.
In particular embodiments, the DSM system uses a software data structure called the connection control block (CCB), stored in local memory such as the local main memory shown in
For an RDP connection between a pair of nodes, the node at each end uses an LNID to refer to the node at the other end. Within a multi-node virtual server (VS), every node is assigned a unique LNID, possibly by some management entity for the DSM system. For example, within a three-node VS, the LNID values might be 0, 1, and 2, or 1, 3, and 4, i.e., they not need to be sequentially incrementing from 0. In addition, every server (multi-node virtual server or standalone server) assigns a unique LNID to each node that communicates with it. For example, a standalone server node that communicates with the virtual server described above might be assigned an LNID value of 16 by the VS. If that same node communicates with another server, it may be assigned the same LNID or a different LNID by that server. Therefore, LNID assignments are unique from the standpoint of a given server, but they are not unique across servers.
An example of LNID assignments is shown in
Table 7.2 shows the SrcLNID and DstLNID values used in the headers of RDP packets exchanged between different node pairs. For example, VS nodes A0 and A1 both belong to virtual server A, so a packet from A0 to A1 will have a SrcLNID value of 0 (LNID assigned to A1 by VS A), and a DstLNID value of 1 (LNID assigned to A1 by VS A). As another example, a packet from A1 to I/O server D will have a SrcLNID value of 2 (LNID assigned to A1 by I/O server D) and a DstLNID value of 16 (LNID assigned by VS A to I/O server D).
As indicated earlier, the DSM system also uses LNIDs in its memory-addressing scheme. In particular embodiments, the physical memory address width is 40-bits (e.g., in DSM systems that use the present generation of Opteron CPUs), though it will be appreciated that there are numerous other suitable widths.
In particular embodiments of the DSM system, the physical address space for a virtual server is arranged so that the local node's memory always starts at address 0 (zero). One reason for using this arrangement is compatibility with legacy system software, in particular embodiments. Specifically, with local memory starting at address 0, system software (e.g., boot code) accesses local memory the same way that it does on a standard server. Another reason for using this arrangement is that it simplifies the address lookup in the CMM. For a memory read/write request from a local processor, an address in the lower 1/16th or 1/256th segment of the 40-bit address space is always local and all other addresses map to memory in other nodes.
To see how the arrangement works, consider the example of a virtual server consisting of three nodes: 0, 1, and 2. In a 16-node DSM system, the total addressable memory space for this virtual server would be 1 terabyte (2^40) and each node would be allocated a segment which is 1/16 of that space (64GB or 2^36). From a global view, the first 64GB segment of the physical address space starting at address 0 would be allocated to node 0 (i.e., the node whose LNID equals 0), the next 64GB segment to node 1, and the following segment to node 2. The remaining 13 segments would be unused since LNIDs 4-15 are not used.
It will be appreciated that in order to accomplish this arrangement, the locations of the local segment and the node 0 segment are swapped in the address map. And since MY_LNID, as defined above, is the LNID assigned to the local node, this is equivalent to swapping MY_LNID with LNID 0 in the address map. However, such a swapping would create confusion in the DSM system if it were applied to memory traffic leaving the node over the switched fabric. Therefore, the node's CMM reverses the swapping for traffic leaving the node.
Particular embodiments of the above-described processes might be comprised of instructions that are stored on storage media. The instructions might be retrieved and executed by a processing system. The instructions are operational when executed by the processing system to direct the processing system to operate in accord with the present invention. Some examples of instructions are software, program code, firmware, and microcode. Some examples of storage media are memory devices, tape, disks, integrated circuits, and servers. The term “processing system” refers to a single processing device or a group of inter-operational processing devices. Some examples of processing devices are integrated circuits and logic circuitry. Those skilled in the art are familiar with instructions, storage media, and processing systems.
Those skilled in the art will appreciate variations of the above-described embodiments that fall within the scope of the invention. In this regard, it will be appreciated that there are many other possible orderings of the steps in the processes described above and many other possible modularizations of those orderings. Also, it will be appreciated that the above processes relating to memory-addressing will work with physical memory addresses that exceed 40-bits in width and DSM systems that have more than 256 nodes. Further, it will be appreciated that the DSM system will work with nodes whose CPUs are not Opterons having a ccHT bus. As a result, the invention is not limited to the specific examples and illustrations discussed above, but only by the following claims and their equivalents.
Akkawi, Isam, Krakirian, Shahe Hagop
Patent | Priority | Assignee | Title |
10021806, | Oct 28 2011 | III Holdings 2, LLC | System and method for flexible storage and networking provisioning in large scalable processor installations |
10050970, | Oct 30 2009 | III Holdings 2, LLC | System and method for data center security enhancements leveraging server SOCs or server fabrics |
10135731, | Oct 30 2009 | III Holdings 2, LLC | Remote memory access functionality in a cluster of data processing nodes |
10140245, | Oct 30 2009 | III Holdings 2, LLC | Memcached server functionality in a cluster of data processing nodes |
10187452, | Aug 23 2012 | Hewlett Packard Enterprise Development LP | Hierarchical dynamic scheduling |
10205772, | Aug 23 2012 | Hewlett Packard Enterprise Development LP | Saving and resuming continuation on a physical processor after virtual processor stalls |
10353736, | Aug 29 2016 | Hewlett Packard Enterprise Development LP | Associating working sets and threads |
10579274, | Jun 27 2017 | Hewlett Packard Enterprise Development LP | Hierarchical stalling strategies for handling stalling events in a virtualized environment |
10579421, | Aug 29 2016 | Hewlett Packard Enterprise Development LP | Dynamic scheduling of virtual processors in a distributed system |
10620992, | Aug 29 2016 | Hewlett Packard Enterprise Development LP | Resource migration negotiation |
10623479, | Aug 23 2012 | Hewlett Packard Enterprise Development LP | Selective migration of resources or remapping of virtual processors to provide access to resources |
10645150, | Aug 23 2012 | Hewlett Packard Enterprise Development LP | Hierarchical dynamic scheduling |
10783000, | Aug 29 2016 | Hewlett Packard Enterprise Development LP | Associating working sets and threads |
10817347, | Aug 31 2017 | Hewlett Packard Enterprise Development LP | Entanglement of pages and guest threads |
10877695, | Oct 30 2009 | III Holdings 2, LLC | Memcached server functionality in a cluster of data processing nodes |
11023135, | Jun 27 2017 | Hewlett Packard Enterprise Development LP | Handling frequently accessed pages |
11159605, | Aug 23 2012 | Hewlett Packard Enterprise Development LP | Hierarchical dynamic scheduling |
11175927, | Nov 14 2017 | Hewlett Packard Enterprise Development LP | Fast boot |
11240334, | Oct 01 2015 | Hewlett Packard Enterprise Development LP | Network attached memory using selective resource migration |
11403135, | Aug 29 2016 | Hewlett Packard Enterprise Development LP | Resource migration negotiation |
11449233, | Jun 27 2017 | Hewlett Packard Enterprise Development LP | Hierarchical stalling strategies for handling stalling events in a virtualized environment |
11467883, | Mar 13 2004 | III Holdings 12, LLC | Co-allocating a reservation spanning different compute resources types |
11494235, | Nov 08 2004 | III Holdings 12, LLC | System and method of providing system jobs within a compute environment |
11496415, | Apr 07 2005 | III Holdings 12, LLC | On-demand access to compute resources |
11513836, | Aug 29 2016 | Hewlett Packard Enterprise Development LP | Scheduling resuming of ready to run virtual processors in a distributed system |
11522811, | Apr 07 2005 | III Holdings 12, LLC | On-demand access to compute resources |
11522952, | Sep 24 2007 | The Research Foundation for The State University of New York | Automatic clustering for self-organizing grids |
11526304, | Oct 30 2009 | III Holdings 2, LLC | Memcached server functionality in a cluster of data processing nodes |
11533274, | Apr 07 2005 | III Holdings 12, LLC | On-demand access to compute resources |
11537434, | Nov 08 2004 | III Holdings 12, LLC | System and method of providing system jobs within a compute environment |
11537435, | Nov 08 2004 | III Holdings 12, LLC | System and method of providing system jobs within a compute environment |
11630704, | Aug 20 2004 | III Holdings 12, LLC | System and method for a workload management and scheduling module to manage access to a compute environment according to local and non-local user identity information |
11650857, | Mar 16 2006 | III Holdings 12, LLC | System and method for managing a hybrid computer environment |
11652706, | Jun 18 2004 | III Holdings 12, LLC | System and method for providing dynamic provisioning within a compute environment |
11656878, | Nov 14 2017 | Hewlett Packard Enterprise Development LP | Fast boot |
11656907, | Nov 08 2004 | III Holdings 12, LLC | System and method of providing system jobs within a compute environment |
11658916, | Mar 16 2005 | III Holdings 12, LLC | Simple integration of an on-demand compute environment |
11709709, | Nov 08 2004 | III Holdings 12, LLC | System and method of providing system jobs within a compute environment |
11720290, | Oct 30 2009 | III Holdings 2, LLC | Memcached server functionality in a cluster of data processing nodes |
11762694, | Nov 08 2004 | III Holdings 12, LLC | System and method of providing system jobs within a compute environment |
11765101, | Apr 07 2005 | III Holdings 12, LLC | On-demand access to compute resources |
11803306, | Jun 27 2017 | Hewlett Packard Enterprise Development LP | Handling frequently accessed pages |
11831564, | Apr 07 2005 | III Holdings 12, LLC | On-demand access to compute resources |
11861404, | Nov 08 2004 | III Holdings 12, LLC | System and method of providing system jobs within a compute environment |
11886915, | Nov 08 2004 | III Holdings 12, LLC | System and method of providing system jobs within a compute environment |
11907768, | Aug 31 2017 | Hewlett Packard Enterprise Development LP | Entanglement of pages and guest threads |
11960937, | Mar 13 2004 | III Holdings 12, LLC | System and method for an optimizing reservation in time of compute resources based on prioritization function and reservation policy parameter |
12120040, | Mar 16 2005 | III Holdings 12, LLC | On-demand compute environment |
12124878, | Mar 13 2004 | III Holdings 12, LLC | System and method for scheduling resources within a compute environment using a scheduler process with reservation mask function |
12155582, | Apr 07 2005 | III Holdings 12, LLC | On-demand access to compute resources |
12160371, | Apr 07 2005 | III Holdings 12, LLC | On-demand access to compute resources |
9054990, | Oct 30 2009 | Silicon Valley Bank | System and method for data center security enhancements leveraging server SOCs or server fabrics |
9069929, | Oct 31 2011 | Silicon Valley Bank | Arbitrating usage of serial port in node card of scalable and modular servers |
9077654, | Oct 30 2009 | Silicon Valley Bank | System and method for data center security enhancements leveraging managed server SOCs |
9092594, | Oct 31 2011 | Silicon Valley Bank | Node card management in a modular and large scalable server system |
9465771, | Sep 24 2009 | Silicon Valley Bank | Server on a chip and node cards comprising one or more of same |
9479463, | Oct 30 2009 | III Holdings 2, LLC | System and method for data center security enhancements leveraging managed server SOCs |
9509552, | Oct 30 2009 | III Holdings 2, LLC | System and method for data center security enhancements leveraging server SOCs or server fabrics |
9585281, | Oct 28 2011 | Silicon Valley Bank | System and method for flexible storage and networking provisioning in large scalable processor installations |
9648102, | Dec 27 2012 | Silicon Valley Bank | Memcached server functionality in a cluster of data processing nodes |
9680770, | Oct 30 2009 | Silicon Valley Bank | System and method for using a multi-protocol fabric module across a distributed server interconnect fabric |
9749326, | Oct 30 2009 | III Holdings 2, LLC | System and method for data center security enhancements leveraging server SOCs or server fabrics |
9792249, | Oct 31 2011 | III Holdings 2, LLC | Node card utilizing a same connector to communicate pluralities of signals |
9866477, | Oct 30 2009 | III Holdings 2, LLC | System and method for high-performance, low-power data center interconnect fabric |
9876735, | Oct 30 2009 | Silicon Valley Bank | Performance and power optimized computer system architectures and methods leveraging power optimized tree fabric interconnect |
9929976, | Oct 30 2009 | III Holdings 2, LLC | System and method for data center security enhancements leveraging managed server SOCs |
9965442, | Oct 31 2011 | III Holdings 2, LLC | Node card management in a modular and large scalable server system |
ER1475, | |||
ER2853, | |||
ER9943, |
Patent | Priority | Assignee | Title |
5774731, | Mar 22 1995 | Hitachi, Ltd.; Hitachi ULSI Engineering Co., Ltd. | Exclusive control method with each node controlling issue of an exclusive use request to a shared resource, a computer system therefor and a computer system with a circuit for detecting writing of an event flag into a shared main storage |
6160814, | May 31 1997 | TEXAS INSRRUMENTS INCORPORATED | Distributed shared-memory packet switch |
6757790, | Feb 19 2002 | EMC IP HOLDING COMPANY LLC | Distributed, scalable data storage facility with cache memory |
6877030, | Feb 28 2002 | Hewlett Packard Enterprise Development LP | Method and system for cache coherence in DSM multiprocessor system without growth of the sharing vector |
6922766, | Sep 04 2002 | Cray Inc. | Remote translation mechanism for a multi-node system |
20010037435, | |||
20030076831, | |||
20040030763, | |||
20040148472, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 14 2011 | THE FLORIDA STATE UNIVERSITY FOUNDATION, INCORPORATED | Intellectual Ventures Holding 80 LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028347 | /0991 | |
May 10 2012 | Intellectual Ventures Holding 80 LLC | (assignment on the face of the patent) | / | |||
Aug 27 2015 | Intellectual Ventures Holding 80 LLC | INTELLECTUAL VENTURES FUND 81 LLC | MERGER SEE DOCUMENT FOR DETAILS | 037575 | /0812 | |
Aug 27 2015 | Intellectual Ventures Holding 80 LLC | Intellectual Ventures Holding 81 LLC | CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE NAME PREVIOUSLY RECORDED AT REEL: 037575 FRAME: 0812 ASSIGNOR S HEREBY CONFIRMS THE MERGER | 038516 | /0869 |
Date | Maintenance Fee Events |
Oct 16 2017 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Oct 14 2021 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Nov 26 2016 | 4 years fee payment window open |
May 26 2017 | 6 months grace period start (w surcharge) |
Nov 26 2017 | patent expiry (for year 4) |
Nov 26 2019 | 2 years to revive unintentionally abandoned end. (for year 4) |
Nov 26 2020 | 8 years fee payment window open |
May 26 2021 | 6 months grace period start (w surcharge) |
Nov 26 2021 | patent expiry (for year 8) |
Nov 26 2023 | 2 years to revive unintentionally abandoned end. (for year 8) |
Nov 26 2024 | 12 years fee payment window open |
May 26 2025 | 6 months grace period start (w surcharge) |
Nov 26 2025 | patent expiry (for year 12) |
Nov 26 2027 | 2 years to revive unintentionally abandoned end. (for year 12) |