data is collected by an active node from passive nodes. A source node extracts the data format, and a remote memory blade identification (ID), a remote memory blade address, and ranges of the RMMA space, and composes and sends metadata to receiving nodes and receiving racks.
|
1. In an optically-connected memory (OCM) system, a method for a memory switching protocol, comprising:
collecting data by an active node from passive nodes and storing data according to receiving nodes,
extracting, by a source node, a data format and a remote memory blade identification (ID), a remote memory blade address, and ranges of a remote machine memory address (RMMA) space, and composing and sending metadata to the receiving nodes and receiving racks,
unmapping the RMMA space by the source node thereby removing all entries in page tables that map a linear address to the remote RMMA space, wherein system memory address (SMA) space that is associated with alternative RMMA space is free,
allocating memory from one of memory blades upon receiving a memory request from the one of a plurality of processors, wherein a circuit is established with the memory for the one of the plurality of processors, an address space of the memory sent to the one of the plurality of processors,
mapping the address space of the memory to the SMA space upon the one of the plurality of processors receiving the address space of the memory, wherein entries of the page table corresponding to the address space are created,
retaining a remote memory superpage of the memory in a memory blade when reading the remote memory super page of the memory into the one of the plurality of processors, and
transferring a physical memory address space from the one of the plurality of processors to an alternative one of the plurality of processors.
12. An optically-connected memory (OCM) system for a memory switching protocol, comprising:
a plurality of nodes including at least a source node, passive nodes, and receiving nodes,
at least one processor device in communication with each of the plurality of nodes and operable in the computing storage environment, wherein the at least one processor device performs each of:
collects data by an active node from passive nodes and stores data according to receiving nodes,
extracts, by the source node, a data format and a remote memory blade identification (ID), a remote memory blade address, and ranges of a remote machine memory address (RMMA) space, and composes and sends metadata to the receiving nodes and receiving racks,
unmaps the RMMA space by the source node thereby removing all entries in page tables that map a linear address to a remote RMMA space, wherein system memory address (SMA) space that is associated with alternative RMMA space is free,
allocating memory from one of memory blades upon receiving a memory request from the one of a plurality of processors, wherein a circuit is established with the memory for the one of the plurality of processors, the address space of the memory sent to the one of the plurality of processors,
mapping the address space of the memory to the SMA space upon the one of the plurality of processors receiving the address space of the memory, wherein entries of the page table corresponding to the address space are created,
retaining a remote memory superpage of the memory in the memory blade when reading the remote memory super page of the memory into the one of the plurality of processors, and
transferring a physical memory address space from the one of the plurality of processors to an alternative one of the plurality of processors.
23. In an optically-connected memory (OCM) system, for a memory switching protocol, a computer program product in a computing environment using a processor device, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
a first executable portion that collects data by an active node from passive nodes and stores data according to receiving nodes; and
a second executable portion that extracts, by a source node, a data format and a remote memory blade identification (ID), a remote memory blade address, and ranges of the RMMA space, and composes and sends metadata to the receiving nodes and receiving racks,
unmaps the RMMA space by the source node thereby removing all entries in page tables that map a linear address to a remote RMMA space, wherein system memory address (SMA) space that is associated with alternative RMMA space is free,
allocating memory from one of memory blades upon receiving a memory request from the one of a plurality of processors, wherein a circuit is established with the memory for the one of the plurality of processors, the address space of the memory sent to the one of the plurality of processors,
mapping the address space of the memory to the SMA space upon the one of the plurality of processors receiving the address space of the memory, wherein entries of the page table corresponding to the address space are created,
retaining a remote memory superpage of the memory in the memory blade when reading the remote memory super page of the memory into the one of the plurality of processors, and
transferring a physical memory address space from the one of the plurality of processors to an alternative one of the plurality of processors.
2. The method of
3. The method of
4. The method of
5. The method of
receiving the metadata sent by the source node,
grafting the RMMA space onto an available system memory address (SMA) space of the one of the plurality of processors into mapping tables,
if a circuit with a remote memory blade does not already exist, setting up a circuit with the remote memory blade,
reading the data by the active node and sending the data to passive processor nodes via an intra-rack edge switch, wherein the data at the RMMA space retains routing information for different portions of the data, and
upon transmission of all the data in the RMMA space, performing, by the active node, one of relinquishing the memory to the memory manager at the memory blades and reuses the memory with new data arriving for a subsequent operation for dynamically switching the memory.
6. The method of
7. The method of
8. The method of
9. The method of
10. The method of
11. The method of
sending the metadata and a switching request to the one of the plurality of processors,
adjusting a remote machine memory address (RMMA) space and flushing a translation look-aside buffer (TLB),
performing an unmapping operation for removing entries in the page tables that map a linear address to the RMMA space,
grafting the RMMA space onto an available system memory address (SMA) space of the one of the plurality of processors and regenerating the mapping table,
dynamically switching the memory from an active node of the one of the plurality of processors to a remote active node of the alternative one of the plurality of processors, and
using the page tables by a receiving one of the plurality of processors for instantaneous access to the data.
13. The system of
14. The system of
at least one processor blade,
at least one memory blade, remotely separated from the at least one processor blade,
an optical plane,
the least one processor device arranged in one of the at least one processor blade and the at least one memory blade and in communication with the optical plane,
a translation look-aside buffer (TLB) in communication with the at least one processor device and the optical plane,
at least one memory in the at least one memory blade, and
an optical switching fabric communicatively coupled between the at least one processor blade and the at least one memory blade and in communication with the at least one processor device.
15. The system of
16. The system of
receiving the metadata sent by the source node,
grafting the RMMA space onto an available system memory address (SMA) space of the one of the plurality of processors into mapping tables,
if a circuit with a remote memory blade does not already exist, setting up a circuit with the remote memory blade,
reading the data by the active node and sending the data to passive processor nodes via an intra-rack edge switch, wherein the data at the RMMA space retains routing information for different portions of the data, and
upon transmission of all the data in the RMMA space, performing, by the active node, one of relinquishing the memory to the memory manager at the memory blades and reuses the memory with new data arriving for a subsequent operation for dynamically switching the memory.
17. The system of
18. The system of
19. The system of
20. The system of
21. The system of
22. The system of
sending the metadata and a switching request to the one of the plurality of processors,
adjusting a remote machine memory address (RMMA) space and flushing the translation look-aside buffer (TLB),
performing an unmapping operation for removing entries in the page tables that map a linear address to the RMMA space,
grafting the RMMA space onto an available system memory address (SMA) space of the one of the plurality of processors and regenerating the mapping table,
dynamically switching the memory from an active node of the one of the plurality of processors to a remote active node of the alternative one of the plurality of processors, and
using the page tables by a receiving one of the plurality of processors for instantaneous access to the data.
24. The computer program product of
25. The computer program product of
26. The computer program product of
27. The computer program product of
receiving the metadata sent by the source node,
grafting the RMMA space onto an available system memory address (SMA) space of the one of the plurality of processors into mapping tables,
if a circuit with a remote memory blade does not already exist, setting up a circuit with the remote memory blade,
reading the data by the active node and sending the data to passive processor nodes via an intra-rack edge switch, wherein the data at the RMMA space retains routing information for different portions of the data, and
upon transmission of all the data in the RMMA space, performing, by the active node, one of relinquishing the memory to the memory manager at the memory blades and reuses the memory with new data arriving for a subsequent operation for dynamically switching the memory.
28. The computer program product of
29. The computer program product of
30. The computer program product of
31. The computer program product of
32. The computer program product of
33. The computer program product of
sending the metadata and a switching request to the one of the plurality of processors,
adjusting a remote machine memory address (RMMA) space and flushing a translation look-aside buffer (TLB),
performing an unmapping operation for removing entries in the page tables that map a linear address to the RMMA space,
grafting the RMMA space onto an available system memory address (SMA) space of the one of the plurality of processors and regenerating the mapping table,
dynamically switching the memory from an active node of the one of the plurality of processors to a remote active node of the alternative one of the plurality of processors, and
using the page tables by a receiving one of the plurality of processors for instantaneous access to the data.
|
This Application is a Continuation of U.S. patent application Ser. No. 14/822,615, filed on Aug. 10, 2015, which is a Continuation of U.S. patent application Ser. No. 13/760,942, filed on Feb. 6, 2013, now U.S. Pat. No. 9,110,818, which is a Continuation of U.S. patent application Ser. No. 13/446,931, filed on Apr. 13, 2012, now U.S. Pat. No. 8,954,698.
The present invention relates generally to computer systems, and more particularly to a memory switching protocol when switching optically-connected memory.
In today's society, computer systems are commonplace. Computer systems may be found in the workplace, at home, or at school. Computer systems may include data storage systems, or disk storage systems, to process and store data. Recent trends in hardware and software systems introduce a memory capacity wall. With the continual increase in the number of central processing unit (CPU) cores within a chip, the increased processing capacity per socket demands increase in memory size to support increased OS footprint, high data volume, and increased number of virtual machines (VMs), etc. The rate of growth of per-socket memory capacity reveals that the supply of memory capacity fails to remain at par with the demand leading to loss of efficiency within the computing environment.
Recent trends in processor and memory systems in large-scale computing systems reveal a new “memory wall” that prompts investigation on alternate main memory organization separating main memory from processors and arranging them in separate ensembles. Multi-core trends in processor configurations incorporate an increasing number of central processing unit (CPU) cores within a chip, thus increasing the compute capacity per socket. Such an increase in processing capacity demands proportional increase in memory capacity. Also, operating systems and emerging applications (in-memory databases, stream processing, search engine, etc.) require increasing volume of memories due to increased operating software (OS) footprint and application data volume, respectively. In a virtualized system, increase in per chip core sizes implies the placement of increasing number of virtual machines (VMs) within a processor chip. Each of these factors demand increase in memory supplies at the chip level. However, projections on the rate of growth of memory capacity per socket reveal that the supply of memory capacity fails to remain at par with the demands. Therefore, a need exists for an optical interconnection fabric that acts as a bridge between processors and memory using a memory-switching protocol that transfers data across processors without physically moving (e.g., copying) the data across electrical switches. A need exits for allowing large-scale data communication across processors through transfer of a few tiny blocks of meta-data while supporting communication patterns prevalent in any large-scale scientific and data management applications.
Accordingly, and in view of the foregoing, various exemplary method, system, and computer program product embodiments for a memory switching protocol when switching optically-connected memory in a computing environment are provided. In one embodiment, by way of example only, in an optically-connected memory (OCM) system, data is collected by an active node from passive nodes. A source node extracts the data format, and a remote memory blade identification (ID), a remote memory blade address, and ranges of the RMMA space, and composes and sends metadata to receiving nodes and receiving racks.
In addition to the foregoing exemplary method embodiment, other exemplary system and computer product embodiments are provided and supply related advantages. The foregoing summary has been provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
In order that the advantages of the invention will be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings. Understanding that these drawings depict embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings, in which:
In modern computer systems, memory modules may be tightly coupled with central processing unit (CPU)-cores through the system bus. Such a co-location of memory and processors precludes a few performance optimizations, for example, memory consolidation among a group of servers (processors), decoupling processor, and memory failures, etc. Also, temporal variations may exist in the usage of CPU and memory resources within and across applications. Such variations may be attributed to the data characteristics, variations in workload and traffic patterns, component (e.g., core) failures. Provisioning the system with the worst case memory consumption might result in underutilization of the resources, as the peak memory consumption is an order of magnitude higher than the average or low-baseline usage. Therefore, a new architectural structure and solution for allowing transparent memory capacity expansion and shrinking across the servers is provided.
In traditional systems, failure of either a processor or a memory module connected to the processor renders the resources (CPU and memory) within the ensemble unavailable. This situation increases the downtime of both the processor and the memories. With the increase in memory capacity, the server blades contain higher number of memory modules, which results in higher rates of Failure in Time (FIT). Such a frequent outage of the server ensemble limits the utilization of the system resources. Other challenges exist in large-scale data-center like system, including issues such as: maintaining large bisection bandwidth, scaling to a large number of nodes, energy efficiency, etc. In one embodiment, a large-scale system based on a separation of processors and memory is provided. In one embodiment, memory is optically connected for enabling a processor blade to allocate memory within a memory blade residing in any memory rack; such an allocation is feasible through creating a circuit switched connection, through optical communication fabrics, between the processor and memory blades. Designing a large system using existing high-radix switches and ensuring high bi-section bandwidth (i.e., with a limit on over-subscription ratio) is challenging, yet no system currently provides a solution that scales to a large number of nodes given the existing commodity switches and a large bi-section bandwidth requirement.
Moreover, in a modern large-scale data center, many data intensive applications require large volume data exchange among the nodes. These applications exhibit various data-transfer patterns: one-to-one, one-to-all (broadcasting), all-to-all (e.g., MapReduce, Database Joins, Fast Fourier Transform “FFT:), etc. The MapReduce communication pattern needs to shuffle a large volume of data while in a stream processing application and large volumes of data are collected at the stream sources that are dispersed within the data center, and sent to the nodes carrying out the actual processing. In a virtualized data center, various management workloads (e.g., VM patching, after-hour maintenance, automated load balancing through live migration, etc.) demand significant network bandwidth in addition to that of the traditional applications running on the system. The management workload increases as the data center scales up as new features (e.g., high-availability, recovery) become commonplace. Therefore, in a large scale data center, the communication bandwidth among the nodes becomes the primary bottleneck.
Therefore, the mechanisms of the illustrated embodiments seek to address these factors, as listed above, by providing a large-scale system based on a separation of processors and memory. In one embodiment, memory is optically connected for enabling a processor blade to allocate memory within a memory blade residing in any memory rack; such an allocation is feasible through creating a circuit switched connection, through optical communication fabrics, between the processor and memory blades. An optically attached memory system segregates memory from the processors and connects the two subsystems through optical fabrics. In addition to consolidating memory, improving the capacity per core, and decoupling processor and memory failures, the illustrated embodiments provide for transferring a large volume of data among the processors through memory switching via a transparent data transfer across the nodes and provide protocols for switching memory across the processors.
In one embodiment, various communications patterns with memory switching in a large-scale system are supported. Such patterns (i.e., all-to-all communication and broadcast) are performance critical for supporting a wide range of applications (i.e., Fast Fourier Transform “FFT”, database joins, image correlation, video monitoring, etc.) in a massively parallel system. Considering the spatial distribution of data involved in switching, the illustrated embodiments offer at least two types of memory switching: gathering and non-gathering. Gathering type memory switching involves active nodes, which are the end points of memory switching activities and directly transfer (switch) a chunk of data to a remote active node. Non-gathering type memory switching includes each processor within an active node contributing a small fraction of the total data, and thus the data should be stored onto a different memory space before switching to another active node.
For a certain class of applications without any barrier synchronization requirements, optically switching memory reduces communication overhead by allowing the receiving node to resume the processing within minimal wait time. This is due to the avoidance of explicit data movement over the network, and of memory allocation or storage overhead at the receiving node.
In one embodiment, a solution for accessing memory in a less complex and more efficient (e.g. faster) implementation than using an ethernet network/protocol to access or change remote memory, via a datacenter network is depicted. In other words, providing a more efficient implementation architecture for a memory controller to access a dual in-line memory module (DIMM) in a processor blade and a memory blade. In one embodiment, for example, the protocol for the memory controller (part of the CPU/Processor chip) to communicate with the memory module (DIMM which may be made of several memory “chips” and control mechanisms) is more efficient than using an ethernet network/protocol to access or change remote memory, via a datacenter network. (A DIMM to memory controller connection may be referred to as point-to-point). The ethernet network/protocol will be more complex since a generic network needs to handle multiple needs of general communication as compared to the simple point-to-point connection between memory-to-processor with a memory controller. Thus, in one embodiment, the need for unification/generalization of doing everything with one network is reduced and/or eliminated, and in the alternative, optically (and/or electrically) connected components of the computing system as close as possible to the memory controller that is connected to DIMMs via the optically connected memory.
In one embodiment, a circuit switch is used (e.g., an optical switch but may be electrical in principle) to reconfigure/change/adapt the memory as needed on a longer time scale. Moreover, additional functionality is added while maintaining the simplicity of the point-to-point by acting on a larger scale (“memory bank”). The memory bank may connect to more than one processor in the data center at a time. Because the granularity of managing multiple access to remote memory is course (e.g., a granularity size of a gigabyte rather than per byte), once a memory section is associated with some remote processor through the optical circuit switch (or an electrical switch), a finer granularity is achieved because a particular memory section may be accessed by only one processor (or processor complex). The optically connected memory switching only allows for the specific processor to access the particular memory section. Other resources trying to gain access must use the specific processor that has access to the memory section, if a need arises for using the memory space by another resource. The granularity may be fine if the memory DIMM is located and connected to the Memory controller of that particular processor. In one embodiment all the functionality of the optically connected memory (including the coherency such as in symmetric multiprocessing “SMP”) will be managed by a particular processor. As such, a low latencies is maintained and the latencies are kept as close to the speed of light traveling the distance between memory blade and processors (vs. typical much larger latency times with say Ethernet switches of few microseconds round trip). For example, at 5 nanosecond per meter fiber, a 30 meters data center distance round trip would be 300 nanoseconds.
In one embodiment, the data transfer overhead is decoupled from the volume of data that is to be transferred. Data is retained in the memory blades, and instead of sending the data to the remote processor blade; meta-data is sent to a remote processor blade. The meta-data contains necessary information to regenerate the mapping/page tables at the remote processor side using pages tables. The remote processor instantly starts to access data thereby eliminating a long wait time. Thus, at the remote processor side, it appears as if a large volume of remote memory space has just been allocated. In such a memory switching process, the overhead of transferring meta-data and regenerating the mapping tables is low. All-to-all communication pattern across a group of processors is supported, and such communication patterns are critical to attain high performance for a wide range of applications in a large-scale system.
It should be noted that the dynamic switching of memories allows for achieving a high volume data transfer and memory consolidation across the processors, blades, racks, etc. In one embodiment, the processors access remote memories through independent circuits (i.e., circuit-switched networks) established between the processors and the remote memory blades. A processor can have multiple remote channels through which the processor may access multiple remote memories. The processor(s) may teardown (e.g., disconnect) a circuit to a memory blade (module) and then signal another processor to establish a channel (i.e., circuit) with the memory blade. Thus the latter processor gains access to the data and physical memory space available in the remote memory blade. Each processor may also have, in addition, a local memory pool. The granularity (size) of remote memory chunk allows for balancing the costs and efficiency.
Moreover, as mentioned previously, the illustrated embodiments may be applied in a large system and allow for all-to-all switching capability so as to increase the complexity of the optical switches. Only a small number of nodes (active nodes), within a rack, will switch memory; and the rest of the nodes (passive nodes) send data to the active nodes, which store the data in memory and switch to the remote node (in a remote rack). The passive nodes communicate with the active ones through local or low overhead switches. Active nodes bypass the electrical router (or switches) by exchanging data through switchable memories using circuit switching. In case of memory gathering (before switching memories), the delay increases with data size, and thus, a set of specialized active nodes with multiple transceivers or channels may be used for transferring parallel data to the remote memory blade. In an Intra-Memory-Blade data transfer, the address space across the processor blades that share the same memory blade is adjusted. It should be noted that in one embodiment, the dynamic switching of memory may be performed through an electrical-switching fabric (and/or network) using a communication pattern to transfer memory space in the memory blades from one processor to an alternative processor in the processor blades without physically copying data in the memory to the processors.
Turning now to
Furthermore, the physical properties of these external links may require the use of multiple optical wavelengths in a WDM (wavelength division multiplexer), which are all coupled into one fiber or one external link, but are separable at both ends. The mirror-based micro electro mechanical system “MEMS” optical circuit switch “OCS” will deflect in the optics domain, the light beams within these external links, regardless of their number of wavelength, protocol, and signaling speed. These external links are common to all memory blades and processor blades.
It should be noted that at least one optical circuit switch is shared between the optical external links. Also, several independent circuits may be established between the processors and the memory blades sharing the optical circuit switch. These external links are made for optimizing a point-to-point connection at very high bandwidth. This optimization may be in the physical implementation used, in the protocol chosen to facilitate such high bandwidth and has the ability to support aggregation of multiple streams within one physical link or multiple physical links to look like one high bandwidth physical link made of a few physical links. Since these external links are circuit switched, via an all optical switch that will not be aware of the protocol, data or content of such, these should use a very light weight communication protocol. Furthermore, the physical properties of these external links may require the use of multiple optical wavelengths in a WDM (wavelength division multiplexer), which are all coupled into one fiber or one external link, but are separable at both ends. The mirror-based micro electro mechanical system “MEMS” optical circuit switch “OCS” will deflect in the optics domain, the light beams within these external links, regardless of their number of wavelength, protocol, and signaling speed. These external links are common to all processors, blades, memory, and independent circuits, such that any memory blade/processor blade may pass information on one or all of these external links, either directly or by passing through the interconnected processor blades. In one exemplary embodiment, circuit-switching switches are used. Circuit switching switches do not need to switch frequently, and thus may be much simpler to build, and can use different technologies (e.g., all optical, MEMS mirror based) to dynamically connect between the circuits, memory, and processor blades.
These types of external links (not shown) and dynamic switching enable very high throughput (high bandwidth) connectivity that dynamically changes as needed. As multi-core processing chips require very high bandwidth networks to interconnect the multi-core processing chips to other such physical processing nodes or memory subsystem, the exemplary optically-connected memory architecture plays a vital role in providing a solution that is functionally enabled by the memory switching operations.
The optically connected memory architecture 200 engenders numerous benefits: (a) transparent memory capacity changes across the system nodes, (b) eliminate notion of worst-case provisioning of memories and allow the applications to vary memory footprints depending on the workloads, and (c) decouple the CPU downtime from the memory module failure, thus increasing the CPU availability. As will be described below in other embodiments, an architecture for memory management techniques is provided. As shown in
Turning now to
In an optically connected memory system (see
The processor blade (as shown with components 306, 308, and 310a-n in
The block size for the remote memory (for example a remote memory page) is an order of magnitude larger than that for the local memory. Therefore the table (e.g., a remote memory page table) mapping the SMA (shown in 302 and 304 of
In an optically connected memory (OCM)-based system (as seen in
To further illustrate the memory switching protocol, consider the following. At the source rack/node side, the metadata and a switching request to the destination rack/node is sent. A remote machine memory address (RMMA) space is adjusted and a translation look-aside buffer (TLB) is flushed. The circuit may be disconnected (e.g., torn down) if necessary. At the destination side, the metadata is received and a circuit may be set up (if a circuit does not already exist on the destination side). The RMMA space is grafted (e.g., joined) onto the available SMA space and the mapping table (for the remote memory) is regenerated. For switching the memory data, one or more options for the switching may be employed; 1) switch the memory data, 2) gather or collect the memory data at a different module and then switch, and/or 3) move the redundant data to a different memory module.
In one embodiment, address space management is employed. While switching memories by signaling a remote processor node, the remote processor node should receive the address space of the remote memory blade (i.e., RMMA space) that contains the data to be transferred. Upon receiving the RMMA space, the receiving processor should map the RMMA space to the free portion of its SMA space, and create the page table entries corresponding to the received RMMA space. Thus, from the receiving processor's perspective, the process is similar to physical (remote) memory allocation. Here, the optical plane supplies a fixed set of remote memory superpages, and the processor node assimilates the extra superpages by grafting the extra superpages into its SMA space, creating the necessary mapping within the page tables. Therefore, at the receiving processor side, the applications can transparently access the remote memory superpages. The memory controller at the remote memory blade observes, for the same data, the same RMMA as in the source processor node. A processor device (e.g., CPU) may be used to assist the memory controller in performing any of the required features of the illustrated embodiments relating to the memory on the memory blades.
Turning now to
Having outlined the details of the system, a switching protocol that transfers a volume of remote memory address space from one processor to another processor is used. At the source (i.e., the sending active node) side, the active node collects or gathers data from the passive nodes and arranges and stores data according to the destination nodes and racks. To switch memory space to a remote active node, the sender node extracts the data format and the RMMA details (ranges and remote memory blade id/address), composes the metadata, and sends the metadata to the destination active node. The source node unmaps the remote RMMA space. Such an unmapping operation removes all the entries in the page tables mapping the linear address to the remote RMMA space; thus, the relevant SMA space that can be associated with different RMMA space is free. Such operation sequences are similar to those with memory deallocations. The difference is that such mapping and unmapping operations only maintain the page tables and data structures (e.g., buddy tree) to do the SMA-RMMA mapping, and manage the free segments (unmapped) within the SMA space; the actual physical allocation/deallocation is managed at the memory blade/rack side. The unmapping operation invalidates the cache and also cleans up the TLB. The source node tears down (e.g., disconnects) the circuit to the remote memory blade if the circuit is no longer necessary. At the receiving side, the active node receives the metadata and grafts the extra memory (i.e., supplied RMMA) into the mapping tables, emulating the process of memory allocation at the processor side. If a circuit with the remote memory blade does not already exists, the active node then sets up the circuit with the remote memory blade. The active node now reads the remote data and sends the data to the passive processor nodes via an intra-rack/edge switch. The remote data at the RMMA space keeps within itself the routing information (e.g., destination node) for different portions of data. Upon transmission of all the data in the received RMMA space, the active node either deallocates (relinquishing to the memory manager at the memory blades) or reuses the memory (filling with newly arriving data for subsequent memory switching.)
In one embodiment, at least two communication patterns—all-to-all personalized communication and all-to-all/one-to all broadcast—that are the performance critical patterns in any large-scale parallel system is supported. The illustrated embodiments support these communication patterns using optical memory switching in a large-scale system. Such communication patterns are widely observed in a large number of scientific and database applications.
In all-to-all personalized (AAP) communication, each of the processors send distinct data to each of the remaining processors in the system. Such patterns (e.g., for example as used in torus and mesh networks) are often used in matrix or array transposition or parallel Fast Fourier Transform (FFT) operations. Support for more than one of the communication pattern operations (e.g., all-to-all communication pattern) with memory switching is provided.
As illustrated above in
Turning now to
TAAP(Na)≈Na max(TMS-I+TR&S,TMS-O+TG&W) (1)
Here, assuming an upper bound N′ p/Ni on the data imbalance across the participating processor nodes (N′p); data imbalance is expressed as the ratio of maximum and minimum data volume transferred to/from a participating processor node (edge link) in the rack. TX is the total time for operation X(X=MS-I,MS-O,R&S,G&W), and Boc and Be are the Bandwidth of an optical channel and a link in edge switch, respectively. Ni is the total links (ports) connected to an active node i. The ratio of maximum and minimum data volume transferred to/from a participating processor node (edge link) in the rack is shown as:
TR&S≈(Sdata)/min(nms-iBoc,NiBe) (2)
TG&W≈(Sdata)/min(nms-oBoc,NiBe) (3)
Here, nms-i, nms-o is the total switch-in and switch-out channels, respectively. Sdata is the total data transfer volume (in a phase of AAP/AAB). nproc is the total processors corresponding to a processor blade/node. Binter is the total aggregate bandwidth of inter-rack interconnection (core and/or aggregation switches).
Turning now to
As mentioned above,
The previously discussed embodiments are illustrated in
In one embodiment, all-to-All broadcast (AAB) communication pattern is supported. In the AAB communication pattern, each processor sends the same data to all other processors. Large-scale parallelization of applications, such as correlation detection in multimedia data and non-equijoins among large volume database tables, should broadcast data across all the participating processors in the system. For example, computing correlation across two video or image streams requires, similar to any high-dimensional data, each incoming frame or image within a stream to be compared against almost all the frames or images in the other stream. Thus, the incoming data should be broadcast to all the participating processors. The protocol for broadcast communication with memory switching is similar to that of all-to-all communication. In a broadcast communication phase, as shown in
TAAB(Na)=Na max(TMS-I+TR&S+TMS-O) (4)
(See Table 1 in
Based upon the foregoing embodiments, the following model applications are illustrated by presenting two representative applications from scientific and database domains that use the communication patterns described above. Upon execution of the two representative applications from scientific and database domains that use the communication patterns, the execution details in the two level interconnection systems, as described in
In one embodiment, the three-dimensional Fast Fourier Transform (FFT) is applied where the data is represented as points in a three dimensional cube with data size N=n×n×n points. The three dimensional FFT is computed by computing n2 1 dimensional (1D) FFTs, one for each row of n elements, along each of the three dimensions. Due to scalability bottlenecks, a pencil method that divides two of the three dimensions of a data cube and allocates the elements across the processors (P=p×p) is used, as shown in
NAAP=[nprocNp] (5)
Sreorg=nprocNpSpencil (6)
(See Table 1 in
Sswitch≈(Sreorg/NAAP) (7)
(See Table 1 in
Tcomm≈2[Sreorg/NAAPMmax]TAAP(NAAP) (8)
Tcomp≈3KtcN log 2N (9)
Here, TAAP (NAAP) is the total time of an AAP communication phase with group size NAAP, as given in Equation 1, tc is execution time per instruction, and K is a constant that we determine observing the processor time of the FFT computation with varying data size and using a curve fitting operation.
When processing videos, images, and/or other data over a large number of processing nodes, the data elements (video frames, images, tuples etc.) are collected at a number of sinks and distributed across the processing nodes. The whole data set is partitioned across the nodes that store one or more partitions of the data. In one embodiment, an application for joining two large tables; R and S, over a number of compute nodes may be utilized, where each compute node hosts 32 processor cores. Such an application is suitable for memory switching without data gathering. Here, two tables are equally partitioned among the compute nodes, and each compute node stores one partition from each table; one of the two tables (say, R) is stationary, as its partitions are pinned to the respective compute nodes, while the other table (say, S) is mobile, as the respective partitions circulates across the compute nodes. At the onset of the join processing, each node locally processes the join results. Upon processing the local data partitions, each compute node transfers its partition of the table S to the adjacent compute node arranged in a ring structure; so, each compute node receives a new partition of table S, and joins it with the locally stored partition of table R. This data transfer and local computation are carried out in stages until each of the partitions of table S is circulated among all the compute nodes; the number of such stages is equal to the total compute nodes participating in the join processing task. In such an organization, the communication pattern is all to all broadcast (AAB) as described in subsection 4.2.
As illustrated below, the switching memory is applied and experimental results on switching memory are demonstrated. The feasibility results of the switching memory (with optical fabric) are illustrated in two scenarios—gathering and non-gathering. In case of memory gathering, data is written to a different memory space through a separate optical channel before being switched to a remote node. In non-gathering scenario, data is transferred to a remote node without using any intermediate memory space or optical channels. The illustrated embodiments may apply two applications: joining tables and 3D-FFT. The former one uses the non-gathering based switching of memory, whereas the 3D-FFT uses the gathering-based data transfer across the processor nodes. For each of the applications, the system equipped with optically switched memory is compared with a system where the compute nodes communicate through an electrical switch. The default link bandwidth of such an electrical switch is 10 Gbps.
The following parameters are applied for an optical switch: optical to electrical or electrical to optical conversion time 15 nanoseconds (ns), optical switching latency 10 milliseconds (ms), optical fiber length 40 meters. As for electrical communication, an ethernet switch with per-switch latency 41 μs (microseconds) may be used. The optical switch connects a number of compute nodes (blades) with memory blades, and each compute node hosts 32 processor cores. Turning now to
In processing joins over large tables, each compute node receives data from only one compute node. Each of the nodes can proceed with the computation, without participating in any synchronization barrier involving multiple nodes, as soon as it receives data. As a node does not need to store the received data in a separate memory space, the node can do on the fly, pipelined processing on the data. For join processing, the system consists of a number of compute blades, each hosting 32 processor cores. This corresponds to a rack level topology, where each compute blade is an active node, and hence can switch memory directly to another compute blade. In one embodiment, by way of example only, the maximum switchable memory space (Mmax) may be set to 32 GB. The size of each elements in a table in taken as 64 Bytes.
In this subsection, the 3D-FFT applications that use the AAP communication pattern is used. Here, each processor receives data from a number of other processors during a reorganization stage (each of the reorganization stages consists of multiple AAP phases.) Data from the sending processors is gathered at an active node and switched to the receiving active node that sends the data to the respective processors. Therefore, in a system with switchable memories, such a scenario corresponds to memory switching with data gathering, and the data transfer involves synchronization barrier at each of the processing nodes. In one embodiment, by way of example only, a hierarchical model is configured with racks and blades: each blade hosts 32 processing cores, each rack contains 128 blades (passive nodes), and the whole system consists of 512 racks. In one embodiment, the default value of maximum switchable memory space (Mmax) is 32 GB. As a baseline system, one baseline system with optical (crossbar) interconnection among the top-of-rack (ToR) switches may be used, and memory switching performance across racks with the performance of that system may be compared.
The foregoing embodiments seek to provide a solution for memory switching via optically connected processors in processors racks with memory in memory racks. The decrease in memory capacity per core due to growing imbalance between the rate of growth in cores per sockets and in memory density motivates the redesign of memory subsystem by organizing processors and memories in separates ensembles and connecting them through optical interconnection fabrics. In one embodiment, an architectural approach that exploits the switching fabrics to transfer large volume of data across processor blades in a transparent fashion is used. Memory allocation at the receiving side is eliminated and the processing startup delay involving large volume data transfer is reduced, as the communication avoids physically moving (for example copying) data over electrical links. Two communication patterns are supported: All-to-All personalized (AAP) communication and All-to-all broadcast (AAB) communication. In one example using model-based simulation, the performance metrics (e.g., communication delay, bandwidth) in a large-scale system are analyzed, and the feasibility of supporting different communication patterns with optically switching memory is illustrated. While the performance gain of memory switching with data gathering is dependent on the bandwidth of the electrical switch, optically switching memory without data gathering is effective in applications requiring large volume data transfer across processor blades without being locked into any synchronization barrier involving a number of compute blades. The performance data demonstrates the effectiveness of switching memory in transparent data sharing and communication within a rack.
It should be noted that the illustrated embodiments may be applied to cloud computing for enabling virtualization. For example, by using the optically switching memory operations, most of the dataset of virtual machine (VM) jobs from a remote memory access can still be used and have cache layers and smaller local processors for the memory data. Within the cloud-computing environment, the flexibility to move a dataset around to better load balance the VM jobs is allowed. For example, because the VM data migration requires a copy of very large datasets from one local memory to another, the remote memory is optically switched for connecting to another processor (e.g., a less busy processor) to avoid the copying of the very large datasets.
In addition, the illustrated embodiments provide for increased efficiency and productivity relating to resource enablement for provisioning, deployment, elasticity, and workload management by adding memory capacity as needed for a group of nodes and may simplify workload management (requires organization of blades and data). Moreover, within a system management of a computing system, the illustrated embodiments, as described above, provide a solution to a memory wall (in part) because data, as well as memory, become a dynamic resource that can be switched around the system.
In addition, the illustrated embodiments may be applied to databases (including streaming) where data is processed by a group of processors, intermediate results are recorded, the data is moved to another group, and includes a partial state recorded in the local memory for each processor. Also, the latency due to distant access (speed of light 5 nsec/m fiber) does not affect performance. The cache and prefetching (streams) reduce the latency as well.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
While one or more embodiments of the present invention have been illustrated in detail, the skilled artisan will appreciate that modifications and adaptations to those embodiments may be made without departing from the scope of the present invention as set forth in the following claims.
Schenfeld, Eugen, Chakraborty, Abhirup
Patent | Priority | Assignee | Title |
10928871, | Oct 31 2017 | SK Hynix Inc. | Computing device and operation method thereof |
11016666, | Nov 08 2017 | SK Hynix Inc. | Memory system and operating method thereof |
11048573, | Jan 12 2018 | SK Hynix Inc. | Data processing system and operating method thereof |
11221931, | Jan 15 2019 | SK Hynix Inc. | Memory system and data processing system |
11636014, | Oct 31 2017 | SK Hynix Inc. | Memory system and data processing system including the same |
Patent | Priority | Assignee | Title |
5522045, | Mar 27 1992 | Panasonic Corporation of North America | Method for updating value in distributed shared virtual memory among interconnected computer nodes having page table with minimal processor involvement |
6233681, | Nov 24 1997 | Samsung Electronics Co. Ltd.; SAMSUNG ELECTRONICS CO , LTD | Computer system and a control method of the same for in-system reprogramming of a fixed flash ROM when access to the fixed flash ROM is not possible |
7856544, | Aug 18 2008 | International Business Machines Corporation | Stream processing in super node clusters of processors assigned with stream computation graph kernels and coupled by stream traffic optical links |
8127111, | Apr 14 2006 | EZCHIP SEMICONDUCTOR LTD ; Mellanox Technologies, LTD | Managing data provided to switches in a parallel processing environment |
20100239266, | |||
20110119344, | |||
20110231713, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jan 25 2013 | CHAKRABORTY, ABHIRUP | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037657 | /0306 | |
Feb 05 2013 | SCHENFELD, EUGEN | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037657 | /0306 | |
Feb 03 2016 | International Business Machines Corporation | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Mar 02 2020 | REM: Maintenance Fee Reminder Mailed. |
Aug 17 2020 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Jul 12 2019 | 4 years fee payment window open |
Jan 12 2020 | 6 months grace period start (w surcharge) |
Jul 12 2020 | patent expiry (for year 4) |
Jul 12 2022 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jul 12 2023 | 8 years fee payment window open |
Jan 12 2024 | 6 months grace period start (w surcharge) |
Jul 12 2024 | patent expiry (for year 8) |
Jul 12 2026 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jul 12 2027 | 12 years fee payment window open |
Jan 12 2028 | 6 months grace period start (w surcharge) |
Jul 12 2028 | patent expiry (for year 12) |
Jul 12 2030 | 2 years to revive unintentionally abandoned end. (for year 12) |