Provided herein are methods and systems for improved storage strategies for use of collections of storage resources, such as solid state drives, including in connection with a converged networking and storage node that may be used for virtualization of a collection of physically attached and/or network-connected storage resources.
|
1. A converged networking and storage system, comprising:
a data access management controller that interfaces to an operating system, a collection of storage resources comprising local data storage resources and at least one network-distributed data storage resource, wherein the controller responds to a data access request by the operating system for data managed by the data access management controller as if the operating system were accessing the local data storage resources independent of the requested data being located in the local data storage resources or in the at least one network-distributed data storage resource without requiring modification of the operating system;
a plurality of solid-state drives that are grouped into a plurality of sub-groups, the collection of storage resources comprising the plurality of solid-state drives; and
an interface via which an operator of the system designates different sub-groups for performing garbage collection at different times, the data access management controller further tracking sub-groups performing garbage collection and controlling writing data to the collection by directing write operations to sub-groups not performing garbage collection based at least in part on the tracking.
20. A system comprising:
a data access management system that interfaces to an operating system, a collection of storage resources comprising at least one local data storage resource and at least one network-distributed data storage resource, wherein the system responds to a data access request by the operating system for data managed by the data access management system as if the operating system were accessing the at least one local data storage resource independent of requested data being located in the at least one local data storage resource or in the at least one network-distributed data storage resource;
a system for writing data to a collection of solid state drives in the collection of storage resources, wherein the solid state drives are defined as a single logical storage resource for the operating system;
an operator interface; and
wherein the data access request includes write operations of the operating system, the write operations are managed by the system to occur in stripes across the blocks of the collection of solid state drives and wherein the solid state drives are grouped into a plurality of sub-groups and wherein an operator of the system can designate via the operator interface different sub-groups at different times for performing garbage collection, wherein throughput of the write operations to the collection is independent of which sub-groups are performing garbage collection.
2. A system of
3. A system of
4. A system of
an application programming interface of a solid state drive of the plurality of solid state drives by which the controller instructs the solid state drive when to perform a garbage collection process of the solid state drive in response to at least two of the operator's garbage collection designations, dirtiness of blocks of the solid state drive, and an indication of at least one portion of the collection designated to be written next.
5. A system of
6. A system of
7. A system of
8. A system of
9. A system of
10. A system of
12. A system of
13. A system of
14. A system of
15. A system of
16. A system of
18. A system of
19. A system of
21. The system of
22. The system of
23. The system of
|
This application claims the benefit of U.S. provisional patent application Ser. No. 62/301,743 filed Mar. 1, 2016, titled: METHODS AND SYSTEMS FOR DATA STORAGE USING SOLID STATE DRIVES. This application is a continuation in part of U.S. patent application Ser. No. 14/640,717 filed Mar. 6, 2015, titled: METHODS AND SYSTEMS FOR CONVERGED NETWORKING AND STORAGE, which claims the benefit of U.S. provisional patent application Ser. No. 62/017,257, filed Jun. 26, 214, titled: AN APPARATUS FOR VIRTUALIZED CUSTER IO, and U.S. provisional patent application Ser. No. 61/950,036, filed Mar. 8, 2014, titled: METHOD AND APPARATUS FOR APPLICATION DRIVEN STORAGE ACCESS. Each of the patent applications mentioned above is incorporated herein by reference in its entirety.
This application relates to the fields of networking and data storage, and more particularly to the field of converged networking and storage systems.
Storage protocols have been designed in the past to provide reliable delivery of data. Examples include Fibre channel (FC), Fibre Channel over Ethernet (FCoE), and iSCSI, including RDMA-capable transports (e.g., Infiniband™, etc). NVMe is a relatively recent storage protocol that is designed for a new class of storage media, such as NAND Flash™, and the like. As the name NVMe (Non volatile Media—express) suggests, NVMe is a protocol highly optimized for media that is close to the speeds of DRAM, as opposed that of to Hard Disk Drives (HDDs). NVMe is typically accessed on a host system via a driver over the PCIe interface of the host. However, as noted above, methods and systems disclosed herein provide for accessing NVMe over a network. Since the latency of DRAM and similar media is orders of magnitude lower than that of HDDs, the approach for accessing NVMe over a network may preferably entail minimal overhead (in terms of latency). As such, there is a need to design a protocol to access NVMe devices over the network via a lightweight protocol.
Also, NVMe is designed to operate over a PCIe interface, where there are hardly any packet drops. So, the error recovery mechanisms built into conventional NVMe are based primarily on large I/O timeouts implemented in the host driver. To enable use of NVMe over a network, a need exists to account for errors that result from packet drops.
The proliferation of scale-out applications has led to very significant challenges for enterprises that use such applications. Enterprises typically choose between solutions like virtual machines (involving software components like hypervisors and premium hardware components) and so-called “bare metal” solutions (typically involving use of an operating system like Linux™ and commodity hardware. At large scale, virtual machine solutions typically have poor input-output (IO) performance, inadequate memory, inconsistent performance, and high infrastructure cost. Bare metal solutions typically have static resource allocation (making changes in resources difficult and resulting in inefficient use of the hardware), challenges in planning capacity, inconsistent performance, and operational complexity. In both cases, inconsistent performance characterizes the existing solutions. A need exists for solutions that provide high performance in multi-tenant deployments, that can handle dynamic resource allocation, and that can use commodity hardware with a high degree of utilization.
Referring to
Referring still to
As an alternative to hypervisors (which provide a separate operating system for each virtual machine that they manage), technologies such as Linux™ containers have been developed (which enable a single operating system to manage multiple application containers). Also, tools such as Dockers have been developed, which provide provisioning for packaging applications with libraries. Among many other innovations described throughout this disclosure, an opportunity exists for leveraging the capabilities of these emerging technologies to provide improved methods and systems for scaleout applications.
Another area in which current approaches are problematic is in the area of the strategies used to write data to individual solid state drives (SSDs) and to groups of SSDs) over time, where current “garbage collection” processes typically require moving significant amounts of data through a series of copying and pasting operations (entailing large numbers of I/O operations in conventional systems), such as to copy and paste all of the valid data from an old block that contains some invalid data into a new block, so that the old block can be erased in its entirety to make it available for writing of new data. For an application this “garbage collection” period results in an unpredictable response time. A need exists for more efficient storage strategies that reduce the number of operations required to write data to collections of SSDs, and also to minimize the response time variation for the application.
Methods and systems are provided herein for enabling converged networking and storage, such methods and systems including, without limitation, methods and systems for managing a collection of physically attached and network-distributed data storage resources as a virtualized cluster of storage resources. In embodiments, in the virtualized cluster behaves in response to an operating system as if the virtualized cluster of storage resources were entirely composed of physically attached storage resources without requiring modification of the operating system.
Methods and systems involving converged networking and storage may employ various strategies for writing data to a collection of resources, such as in a virtualized cluster, for garbage collection and the like. Such methods and systems may include writing of data to a collection of solid state drives in the virtualized cluster, wherein the solid state drives are defined as a single logical storage resource for an operating system. In embodiments, write operations of the operating system are managed by the converged networking and storage system to occur in stripes across the blocks of the collection of solid state drives. In embodiments, the solid state drives are grouped into a plurality of sub-groups and wherein an operator of the converged networking and storage system can designate different sub-groups at different times for performing garbage collection.
Such methods and systems may further include a solid state drive within the virtualized cluster of resources and an application programming interface of the solid state drive by which the converged networking and storage system can instruct the solid state drive when to perform a garbage collection process of the solid state drive. In embodiments, a collection of solid state drives in the virtualized cluster have varying drive writes per day (DWPD) capabilities and the virtualized cluster is configured to operate as a unified logical storage resource to satisfy a DWPD requirement of an application that uses the virtualized cluster.
Such methods and systems may further include a system for providing dual-level encryption relating to data stored on a solid state drive (SSD) in the collection of storage resources, wherein encryption is provided on the SSD of the data that is stored on the SSD and encryption is provided in a converged networking and storage controller of the converged networking and storage system. In embodiments, a different encryption key may be used at the converged networking and storage controller for two different sets of data that are stored on the same SSD. In embodiments, the system includes an interface for allocating the different keys to different tenants that can use the SSD in a multi-tenant configuration.
Such methods and systems may further include writing data to a solid state drive (SSD) in the collection of storage resources, wherein the system writes data to the SSD sequentially to selected pages of at least one block of the SSD, provides gaps between the sequentially written pages of the block and maintains a map of the locations to which the pages are written. In embodiments, locations to which the pages are written are randomly allocated. In embodiments, the pages are written using an elevator algorithm.
In embodiments, the system provides a job de-duplication capability for networking and storage jobs. In embodiments, the system has a capability for global de-duplication and erasure encoding across a plurality of storage resources in the collection. In embodiments, the system uses a hash-based system for locating data on a storage resource within the collection of storage resources. In embodiments, the system provides in-line hashing and routing of data in a network without requiring writing of data to memory in order to perform a hash calculation. In embodiments, the system has in-line erasure encoding in a network without requiring the writing of data to memory in order to perform erasure encoding. In embodiments, the system has in-line de-duplication of redundant blocks.
In embodiments, the collection of storage resources includes disk attached solid state drives and network-attached storage resources. In embodiments, addition of additional storage resources to the cluster does not require the user of the cluster to rebalance the allocation of data storage across the cluster.
Referring to
The NVMEoN protocol enabled herein is designed with no assumption made about the underlying network being Layer 2 or Layer 3. The endpoints may be defined generically, with constraint as to the type of host. Various options for network encapsulation for implementation and standardization are described below.
Among other characteristics, the NVMEoN protocol may fit into the generic NVME architecture and be standardizable; work independent of other lossless protocols in the network, including with built-in error detection and recovery; minimize overhead introduced in the network; dynamically carve receiver's resources (buffers) across multiple senders; and be easily implementable through a combination of hardware and software modules (e.g., to achieve minimal latency overhead and to use hardware functions where beneficial).
Elements of the methods and systems disclosed herein may include various components, processes, features and the like, which are described in more detail below. These may include an NVMEoN Exchange Layer, a layer in NVMEoN that maintains exchanges for every NVME command. Also provided below is a Burst Transmission Protocol (BTP) layer in NVMEoN that provides guaranteed delivery. Also provided is a proxy NVME controller, an NVME controller that is used to terminate PCIe level transactions of NVME commands and transport them over a network. Also, one or more remote NVME controllers may include virtual NVME controllers that can handle NVME commands received over a network.
As noted elsewhere throughout this disclosure, a “node” may refer to any host computer on a network, such as any server. An initiator may comprise a node that initiates a command (such as an NVME command), while a target may comprise a node that is a destination of an NVME command. A mode may include an NVME driver, which may be a conventional NVME driver that runs on a Linux or Windows server. The host may include a host CPU, a processor on which applications run. A host may have an embedded CPU, a processor on which NVMEoN-specific control agents run.
As described below, NVMEoN may involve exchanges. Each NVME command may be translated by the NVMEoN exchange layer, such as at an initiator, into a unique exchange for purposes of tracking the exchanges over a network. An Exchange Status Block (ESB), may comprise a table of open exchanges and their state information.
The conventional NVME protocol on a host typically runs with an NVME Driver (e.g., in the Linux kernel) accessing an NVME controller over PCIe. The NVME controller translates the NVME I/O commands into actual reads/writes, such as to a NAND Flash drive. NVMEoN, as disclosed herein, extends this NVME protocol over a network with no assumptions as to the absence of losses in the network.
Provided herein are methods and systems that include a converged storage and network controller in hardware that combines initiator, target storage functions and network functions into a single data and control path, which allows a “cut-through” path between the network and storage, without requiring intervention by a host CPU. For ease of reference, this is referred to variously in this disclosure as a converged hardware solution, a converged device, a converged adaptor, a converged IO controller, a “datawise” controller, or the like throughout this disclosure, and such terms should be understood to encompass, except where context indicates otherwise, a converged storage and network controller in hardware that combines target storage functions and network functions into a single data and control path.
Among other benefits, the converged solution will increase raw performance of a cluster of computing and/or storage resources; enforce service level agreements (SLAs) across the cluster and help guarantee predictable performance; provide a multi-tenant environment where a tenant will not affect its neighbor; provide a denser cluster with higher utilization of the hardware resulting in smaller data center footprint, less power, fewer systems to manage; provide a more scalable cluster; and pool storage resources across the cluster without loss of performance.
The various methods and systems disclosed herein provide high-density consolidation of resources required for scaleout applications and high performance multi-node pooling. These methods and systems provide a number of customer benefits, including dynamic cluster-wide resource provisioning, the ability to guarantee quality-of-service (QoS), Security, Isolation etc. on network and storage functions, and the ability to use shared infrastructure for production and testing/development.
Also provided herein are methods and systems to perform storage functions through the network and to virtualize storage and network devices for high performance and deterministic performance in single or multi-tenant environments.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having the capability to handle multi-tenant operations.
Also provided herein, are methods and systems for virtualization of storage devices, such as those using NVMe and similar protocols, and the translation of those virtual devices to different physical devices, such as ones using SATA.
The methods and systems disclosed herein also include methods and systems for end-to-end congestion control involving only the hardware on the host (as opposed to the network fabric) that includes remote credit management and a distributed scheduling algorithm at the box level.
Also provided herein are various methods and systems that are enabled by the converged network/storage controller, including methods and systems for virtualization of a storage cluster or of other elements that enable a cluster, such as a storage adaptor, a network adaptor, a container (e.g., a Linux container), a Solaris zone or the like. Among advantages, one aspect of virtualizing a cluster is that containers can become location-independent in the physical cluster. Among other advantages, this allows movement of containers among machines in a vastly simplified process described below.
Provided herein are methods and systems for virtualizing direct-attached storage (DAS), so that the operating system stack 108 still sees a local, persistent device, even if the physical storage is moved and is remotely located; that is, provided herein are methods and systems for virtualization of DAS. In embodiments this may include virtualizing DAS over a fabric, that is, taking a DAS storage system and moving it outside the box and putting it on the network. In embodiments this may include carving DAS into arbitrary name spaces. In embodiments the virtualized DAS is made accessible as if it were actual DAS to the operating system, such as being accessible by the OS 108 over a PCIe bus via NVMe. Thus, provided herein is the ability to virtualize storage (including DAS) so that the OS 108 sees it as DAS, even if the storage is actually accessed over a network protocol such as Ethernet, and the OS 108 is not required to do anything different than would be required with local physical storage.
Provided herein are methods and systems for providing DAS across a fabric, including exposing virtualized DAS to the OS 108 without requiring any modification of the OS 108.
Also provided herein are methods and systems for virtualization of a storage adaptor (referring to a target storage system).
Provided herein are methods and systems for combining storage initiation and storage targeting in a single hardware system. In embodiments, these may be attached by a PCIe bus 110. A single root virtualization function (SR-IOV) may be applied to take any standard device and have it act as if it is hundreds of such devices. Embodiments disclosed herein include using SR-IOV to give multiple virtual instances of a physical storage adaptor. SR-IOV is a PCIe standard that virtualizes I/O functions, and while it has been used for network interfaces, the methods and systems disclosed herein extend it to use for storage devices. Thus, provided herein is a virtual target storage system.
Embodiments may include a switch form factor or network interface controller, wherein the methods and systems disclosed herein may include a host agent (either in software or hardware). Embodiments may include breaking up virtualization between a front end and a back end.
Embodiments may include various points of deployment for a converged network and target storage controller. While some embodiments locate the converged device on a host computing system 102, in other cases the disk can be moved to another box (e.g., connected by Ethernet to a switch that switches among various boxes below. While a layer may be needed to virtualize, the storage can be separated, so that one can scale storage and computing resources separately. Also, one can then enable blade servers (i.e., stateless servers). Installations that would have formerly involved expensive blade servers and attached to storage area networks (SANs) can instead attach to the switch. In embodiments this comprises a “rackscale” architecture where resources are disaggregated at the rack level.
Methods and systems disclosed herein include methods and systems for virtualizing various types of non-DAS storage as DAS in a converged networking/target storage appliance. In embodiments, one may virtualize whatever storage is desired as DAS, using various front end protocols to the storage systems while exposing storage as DAS to the OS stack 108.
Methods and systems disclosed herein include virtualization of a converged network/storage adaptor. From a traffic perspective, one may combine systems into one. Combining the storage and network adaptors, and adding in virtualization, gives significant advantages. Say there is a single host 102 with two PCIe buses 110. To route from the PCIe 110, you can use a system like RDMA to get to another machine/host 102. If one were to do this separately, one has to configure the storage and the network RDMA system separately. One has to join each one and configure them at two different places. In the converged scenario, the whole step of setting up QoS, seeing that this is RDMA and that there is another fabric elsewhere is a zero touch process, because with combined storage and networking the two can be configured in a single step. That is, once one knows the storage, one doesn't need to set up the QoS on the network separately.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having virtualization of a converged network/storage adaptor.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a combination of a network adaptor and a storage adaptor with target storage in a converged network/storage appliance and storage system having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having virtualization of a storage adaptor that refers to target storage resources.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a software system for handling combined traffic streams in a converged networking and target storage adaptor.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having the capability to allow a user to set a desired QoS independent of the need to configure QoS for a network or a fabric.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a capability for single-step and single entity configuration of QoS for storage and networking resources.
Method and systems disclosed herein include virtualization and/or indirection of networking and storage functions, embodied in the hardware, optionally in a converged network adaptor/storage adaptor appliance. While virtualization is a level of indirection, protocol is another level of indirection. The methods and systems disclosed herein may convert a protocol suitable for use by most operating systems to deal with local storage, such as NVMe, to another protocol, such as SAS, SATA, or the like. One may expose a consistent interface to the OS 108, such as NVMe, and in the back end one may convert to whatever storage media is cost-effective. This gives a user a price/performance advantage. If components are cheaper/faster, one can connect any one of them. The back end could be anything, including NVMe.
Provided herein are methods and systems that include a converged data path for network and storage functions in an appliance. Alternative embodiments may provide a converged data path for network and storage functions in a switch.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a converged data path for network functions and storage functions in a networking and storage system.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a software system for unified handling of networking functions and storage initiation and management.
In embodiments, methods and systems disclosed herein include storage/network tunneling, wherein the tunneling path between storage systems over a network does not involve the operating system of a source or target computer. In conventional systems, one had separate storage and network paths, so accessing storage remotely, required extensive copying to and from memory, I/O buses, etc. Merging the two paths means that storage traffic is going straight onto the network. The OS 108 of each computer sees only a local disk. Another advantage is simplicity of programming. A user does not need to separately program a SAN, meaning that the methods disclosed herein include a one-step programmable SAN. Rather than requiring discovery and specification of zones, and the like, encryption, attachment, detachment and the like may be centrally, and programmatically done.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having storage-network tunneling where the tunneling is independent of the operating system.
Embodiments disclosed herein may include virtualizing the storage to the OS 108 so that the OS 108 sees storage as a local disk. The level of indirection involved in the methods and systems disclosed herein allows the converged system to hide not only the location, but the media type, of storage media. All the OS sees is that there is a local disk, even if the actual storage is located remotely and/or is or a different type, such as a SAN. Thus, virtualization of storage is provided, where the OS 108 and applications do not have to change. One can hide all of the management, policies of tiering, polices of backup, policies of protection and the like that are normally needed to configure complex storage types behind.
Methods and systems are provided for selecting where indirection occurs in the virtualization of storage. Virtualization of certain functions may occur in hardware (e.g., in an adaptor on a host, in a switch, and in varying form factors (e.g., FPGA or ASICs) and in software. Different topologies are available, such as where the methods and systems disclosed herein are deployed on a host machine, on a top of the rack switch, or in a combination thereof. Factors that go into the selection include ease of use. Users who want to run stateless servers may prefer a top of rack. Ones who don't care about that approach might prefer the controller on the host.
Methods and systems disclosed herein include providing NVMe over Ethernet. These approaches can be the basis for the tunneling protocol that is used between devices. NVMe is a suitable DAS protocol that is intended conventionally to go to a local PCIe. Embodiments disclosed herein may tunnel the NVMe protocol traffic over Ethernet. NVMe (non-volatile memory express) is a protocol that in Linux and Windows provides access to PCIe-based Flash Storage. This provides high performance by by-passing the software stacks used in conventional systems.
Embodiments disclosed herein may include providing an NVMe device that is virtualized and dynamically allocated. In embodiments one may piggy back NVMe, but carve up and virtualize and dynamically allocate an NVMe device. In embodiments there is no footprint in the software. The operating system stays the same (just a small driver that sees the converged network/storage card). This results in virtual storage presented like a direct attached disk, but the difference is that now we can pool such devices across the network.
Provided herein are methods and systems for providing the simplicity of direct attached storage (DAS) with the advantages of sharing like in a storage area network (SAN). Each converged appliance in various embodiments disclosed herein may be a host, and any storage drives may be local to a particular host but seen by the other hosts (as in a SAN or other network-accessible storage). The drives in each box enabled by a network/storage controller of the present disclosure behave like a SAN (that is, are available on the network), but the management methods are much simpler. When a storage administrator sets up a SAN, a typical enterprise may have a whole department setting up zones for a SAN (e.g., a fiber channel switch), such as setting up “who sees what.” That knowledge is pre-loaded and a user has to ask the SAN administrator to do the work to set it up. There is no programmability in a typical legacy SAN architecture. The methods and systems disclosed herein provide local units that are on the network, but the local units can still access their storage without having to go through complex management steps like zone definition, etc. These devices can do what a SAN does just by having both network and storage awareness. As such, they represent the first programmatic SAN.
Methods and systems disclosed herein may include persistent, stateful, disaggregated storage enabled by a hardware appliance that provides converged network and storage data management.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having persistent, stateful, disaggregated storage enabled by a system that provides converged network and storage data management.
Methods and systems disclosed herein may also include convergence of network and storage data management in a single appliance, adapted to support use of containers for virtualization. Such methods and systems are compatible with the container ecosystem that is emerging, but offering certain additional advantages.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having the capability to use containers.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a capability for providing visibility across a plurality of containers, such that containers can access information with respect to other containers and can be operated as a cluster.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a container as a first class network end point.
Methods and systems are disclosed herein for implementing virtualization of NVMe. Regardless how many sources to how many destinations, as long as the data from the sources is serialized first before going into the hub, then the hub distributes to data to the designated destination sequentially. If so, then data transport resources such as DMA engine can be reduced to only one copy. This may include various use scenarios. In one scenario, for NVMe virtual functions (VFs), if they are all connected to the same PCIe bus, then regardless how many VFs are configured, the data would be coming into this pool of VFs serially, so there is only one DMA engine and only one storage block (for control information) is needed. In another use scenario, for a disk storage system with a pool of discrete disks/controllers, if the data is originated from the physical bus, i.e. PCIe, since the data is serially coming into this pool of disks, then regardless how many disks/controllers are in the pool, the transport resources such as the DMA engine can be reduced to only one instead of one per controller.
In accordance with various exemplary and non-limiting embodiments, a device comprises a converged input/output controller that includes a physical target storage media controller, a physical network interface controller; and a gateway between the storage media controller and the network interface controller, wherein gateway provides a direct connection for storage traffic and network traffic between the storage media controller and the network interface controller.
In accordance with various exemplary and non-limiting embodiments, a method of virtualization of a storage device comprises accessing a physical storage device that responds to instructions in a first storage protocol, translating instructions between the first storage protocol and a second storage protocol and using the second protocol, presenting the physical storage device to an operating system, such that the storage of the physical storage device can be dynamically provisioned, whether the physical storage device is local or remote to a host computing system that uses the operating system.
In accordance with various exemplary and non-limiting embodiments, a method of facilitating migration of at least one of an application and a container comprises providing a converged storage and networking controller, wherein a gateway provides a connection for network and storage traffic between a storage component and a networking component of the device without intervention of the operating system of a host computer and mapping the at least one application or container to a target physical storage device that is controlled by the converged storage and networking controller, such that the application or container can access the target physical storage, without intervention of the operating system of the host system to which the target physical storage is attached, when the application or container is moved to another computing system.
In accordance with various exemplary and non-limiting embodiments, a method of providing quality of service (QoS) for a network, comprises providing a converged storage and networking controller, wherein a gateway provides a connection for network and storage traffic between a storage component and a networking component of the device without intervention of the operating system, a hypervisor, or other software running on the CPU of a host computer and, also without intervention of the operating system, hypervisor, or other software running on the CPU of a host computer, managing at least one quality of service (QoS) parameter related to a network in the data path of which the storage and networking controller is deployed, such managing being based on at least one of the storage traffic and the network traffic that is handled by the converged storage and networking controller.
QoS may be based on various parameters, such as one or more of a bandwidth parameter, a network latency parameter, an IO performance parameter, a throughput parameter, a storage type parameter and a storage latency parameter. QoS may be maintained automatically when at least one of an application and a container that is serviced by storage through the converged storage and network controller is migrated from a host computer to another computer. Similarly, QoS may be maintained automatically when at least one target storage device that services at least one of an application and a container through the converged storage and network controller is migrated from a first location to another location or multiple locations. For example, storage may be scaled, or different storage media types may be selected, to meet storage needs as requirements are increased. In embodiments, a security feature may be provided, such as encryption of network traffic data, encryption of data in storage, or both. Various storage features may be provided as well, such as compression, protection levels (e.g., RAID levels), use of different storage media types, global de-duplication, and snapshot intervals for achieving at least one of a recovery point objective (RPO) and a recovery time objective (RTO).
In embodiments, the methods and systems described herein include storage strategies that provide improved efficiencies in the use of SSDs, including collections of SSDs, such as to reduce the number of operations required to write and modify data on the SSDs. These methods and systems include system level write strategies, such as write strategies where writes are striped across different sets of solid state drives (“SSDs), with certain SSDs performing garbage collection at identified points in time, where groupings of the particular SSDs that used for writes and garbage collection are varied from time period to time period. These methods and systems also include methods and systems for drive arrangement optimization. These methods and systems also provide additional capabilities, such as providing system level encryption strategies. Also, these methods and systems include providing novel writing strategies for SSDs, including write strategies that leave unwritten pages within a block of data during a series of write operations, so that the new data can be written to the unwritten pages on subsequent passes through the SSD. The arrangement of written and unwritten pages may be random, or may be arranged according to a defined pattern. A series of write operations may be ordered across multiple blocks of an SSD and/or across blocks distributed across multiple SSDs. A map may be maintained at the system level to keep track of what data has been written at what time to what pages, blocks, and SSDs. Such write strategies may be used to avoid many of the difficulties of garbage collection processes and to provide much more efficient usage of storage resources, requiring far fewer operations than current garbage collection processes. Such storage strategies may be used in combination with the various other capabilities of the embodiments of the converged storage and networking solution described throughout this disclosure.
The accompanying figures where like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the systems and methods disclosed herein.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the systems and methods disclosed herein.
The present disclosure will now be described in detail by describing various illustrative, non-limiting embodiments thereof with reference to the accompanying drawings and exhibits. The disclosure may, however, be embodied in many different forms and should not be construed as being limited to the illustrative embodiments set forth herein. Rather, the embodiments are provided so that this disclosure will be thorough and will fully convey the concept of the disclosure to those skilled in the art. The claims should be consulted to ascertain the true scope of the disclosure.
Before describing in detail embodiments that are in accordance with the systems and methods disclosed herein, it should be observed that the embodiments reside primarily in combinations of method steps and/or system components related to converged networking and storage. Accordingly, the system components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the systems and methods disclosed herein so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art.
Referring to
As noted above, one benefit of the converged solution 300 is that the operating system stack 108 connects to the converged solution 300 over a conventional PCIe 110 or a similar bus, so that the OS stack 108 sees the converged solution 300, and any storage that it controls through the cut-through to storage devices 302, as one or more local, persistent devices, even if the physical storage is remotely located. Among other things, this comprises the capability for virtualization of DAS 308, which may include virtualizing DAS 308 over a fabric, that is, taking a DAS 308 storage system and moving it outside the computing system 102 and putting it on the network. The storage controller 112 of the converged solution 300 may connect to and control DAS 308 on the network 122 via various known protocols, such as SAS, SATA, or NVMe. In embodiments virtualization may include carving DAS 308 into arbitrary name spaces. In embodiments the virtualized DAS 308 is made accessible as if it were actual, local, physical DAS to the operating system, such as being accessible by the OS 108 over a PCIe bus 110 to the storage controller 112 of the converged solution 300 via a standard protocol such as NVMe. Again, the OS 108 sees the entire solution 300 as a local, physical device, such as DAS. Thus, provided herein is the ability to virtualize storage (including DAS and other storage types, such as SAN 310) so that the OS 108 sees any storage type as DAS, even if the storage is actually accessed over a network 122, and the OS 108 is not required to do anything different than would be required with local physical storage. In the case where the storage devices 302 are SAN 310 storage, the storage controller 112 of the converged solution may control the SAN 310 through an appropriate protocol used for storage area networks, such as the Internet Small Computing System Interface (iSCSI), Fibre Channel (FC), or Fibre Channel over Ethernet (FCoE). Thus, the converged solution 300 provides a translation for the OS stack 108 from any of the other protocols used in storage, such as Ethernet, SAS, SATA, NVMe, iSCSI, FC or FCoE, among others, to a simple protocol like NVMe that makes the disparate storage types and protocols appear as local storage accessible over PCIe 110. This translation in turns enables virtualization of a storage adaptor (referring to any kind of target storage system). Thus, methods and systems disclosed herein include methods and systems for virtualizing various types of non-DAS storage as DAS in a converged networking/target storage appliance 300. In embodiments, one may virtualize whatever storage is desired as DAS, using various protocols to the storage systems while exposing storage as DAS to the OS stack 108. Thus, provided herein are methods and systems for virtualization of storage devices, such as those using NVMe and similar protocols, and the translation of those virtual devices to different physical devices, such as ones using SATA.
Storage/network tunneling 304, where the tunneling path between storage systems over the network 122 does not involve the operating system of a source or target computer enables a number of benefits. In conventional systems, one has separate storage and network paths, so accessing storage remotely required extensive copying to and from memory, I/O buses, etc. Merging the two paths means that storage traffic is going straight onto the network. An advantage is simplicity of programming. A user does not need to separately program a SAN 310, meaning that the methods disclosed herein enable a one-step programmable SAN 310. Rather than requiring discovery and specification of zones, and the like, configuration, encryption, attachment, detachment and the like may be centrally, and programmatically done. As an example, a typical SAN is composed of “initiators,” “targets,” and a switch fabric, which connects the initiators and targets. Typically, which initiators see which targets are defined/controlled by the fabric switches, called “zones.” Therefore, if an initiator moves or a target moves, zones need to be updated. The second control portion of a SAN typically lies with the “targets.” They can control which initiator port can see what logical unit numbers (LUNs) (storage units exposed by the target). This is typically referred to as LUN masking and LUN mapping. Again, if an initiator moves locations, one has to re-program the “Target”. Consider now that in such an environment if an application moves from one host to another (such as due to a failover, load re-balancing, or the like) the zoning and LUN masking/mapping needs to be updated. Alternatively, one could pre-program the SAN, so that every initiator sees every target. However, doing so results in an un-scalable and un-secure SAN. In the alternate solution described throughout this disclosure, such a movement of an application, a container, or a storage device does NOT require any SAN re-programming, resulting in a zero touch solution. The mapping maintained and executed by the converged solution 300 allows an application or a container, the target storage media, or both, to be moved (including to multiple locations) and scaled independently, without intervention by the OS, a hypervisor, or other software running on the host CPU.
The fact that the OS 108 sees storage as a local disk allows simplified virtualization of storage. The level of indirection involved in the methods and systems disclosed herein allows the converged system 300 to hide not only the location, but the media type, of storage media. All the OS 108 sees is that there is a local disk, even if the actual storage is located remotely and/or is or a different type, such as a SAN 310. Thus, virtualization of storage is provided through the converged solution 300, where the OS 108 and applications do not have to change. One can hide all of the management, policies of tiering, polices of backup, policies of protection and the like that are normally needed to configure complex storage types behind.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having virtualization of storage to an operating system, such that the operating system sees various types of storage as a local disk.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a facility for selectively managing where indirection occurs in a system for virtualization of storage.
The converged solution 300 enables the simplicity of direct attached storage (DAS) with the advantages of a storage area network (SAN). Each converged appliance 300 in various embodiments disclosed herein may act as a host, and any storage devices 302 may be local to a particular host but seen by the other hosts (as is the case in a SAN 310 or other network-accessible storage). The drives in each box enabled by a network/storage controller of the present disclosure behave like a SAN 310 (e.g., are available on the network), but the management methods are much simpler. When a storage administrator normally sets up a SAN 310, a typical enterprise may have a whole department setting up zones for a SAN 310 (e.g., a fiber channel switch), such as setting up “who sees what.” That knowledge must be pre-loaded, and a user has to ask the SAN 310 administrator to do the work to set it up. There is no programmability in a typical legacy SAN 310 architecture. The methods and systems disclosed herein provide local units that are on the network, but the local units can still access their storage without having to go through complex management steps like zone definition, etc. These devices can do what a SAN does just by having both network and storage awareness. As such, they represent the first programmatic SAN.
The solution 300 can be described as a “Converged IO Controller” that controls both the storage media 302 and the network 122. This converged controller 300 is not just a simple integration of the storage controller 112 and the network controller (NIC) 118. The actual functions of the storage and network are merged such that storage functions are performed as the data traverses to and from the network interface. The functions may be provided in a hardware solution, such as an FPGA (one or more) or ASIC (one or more) as detailed below.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a field programmable hardware device that provides converged network and storage data path management.
Referring to
In embodiments, the converged solution 300 may be included on a host computing system 102, with the various components of a conventional computing system as depicted in
Referring to
Embodiments disclosed herein may thus include a switch form factor or a network interface controller, or both which may include a host agent (either in software or hardware). These varying deployments allow breaking up virtualization capabilities, such as on a host and/or on a switch and/or between a front end and a back end. While a layer may be needed to virtualize certain functions, the storage can be separated, so that one can scale storage and computing resources separately. Also, one can then enable blade servers (i.e., stateless servers). Installations that would have formerly involved expensive blade servers and attached storage area networks (SANs) can instead attach to the storage enabled switch 500. In embodiments this comprises a “rackscale” architecture, where resources are disaggregated at the rack level.
Methods and systems are provided for selecting where indirection occurs in the virtualization of storage. Virtualization of certain functions may occur in hardware (e.g., in a converged adaptor 300 on a host 102, in a storage enabled switch 500, in varying hardware form factors (e.g., FPGAs or ASICs) and in software. Different topologies are available, such as where the methods and systems disclosed herein are deployed on a host machine 102, on a top of the rack switch 500, or in a combination thereof. Factors that go into the selection of where virtualization should occur include ease of use. Users who want to run stateless servers may prefer a top of rack storage enabled switch 500. Ones who don't care about that approach might prefer the converged controller 300 on the host 102.
Method and systems disclosed herein include virtualization and/or indirection of networking and storage functions, embodied in the hardware converged controller 300, optionally in a converged network adaptor/storage adaptor appliance 300. While virtualization is a level of indirection, protocol is another level of indirection. The methods and systems disclosed herein may convert a protocol suitable for use by most operating systems to deal with local storage, such as NVMe, to another protocol, such as SAS, SATA, or the like. One may expose a consistent interface to the OS 108, such as NVMe, and on the other side of the converged controller 300 one may convert to whatever storage media 302 is cost-effective. This gives a user a price/performance advantage. If components are cheaper/faster, one can connect any one of them. The side of the converged controller 300 could face any kind of storage, including NVMe. Furthermore the storage media type may be any of the following including, but not limited, to HDD, SSD (based on SLC, MLC, or TLC Flash), RAM etc or a combination thereof.
In embodiments, a converged controller may be adapted to virtualize NVMe virtual functions, and to provide access to remote storage devices 302, such as ones connected to a storage-enabled switch 500, via NVMe over an Ethernet switch 402. Thus, the converged solution 300 enables the use of NVMe over Ethernet 700, or NVMeoE. Thus, methods and systems disclosed herein include providing NVMe over Ethernet. These approaches can be the basis for the tunneling protocol that is used between devices, such as the host computing system 102 enabled by a converged controller 300 and/or a storage enabled switch 500. NVMe is a suitable DAS protocol that is intended conventionally to go to a local PCIe 110. Embodiments disclosed herein may tunnel the NVMe protocol traffic over Ethernet. NVMe (non-volatile memory express) is a protocol that in Linux and Windows provides access to PCIe-based Flash. This provides high performance via by-passing the software stacks used in conventional systems, while avoiding the need to translate from NVMe (as used by the OS stack 108) and the traffic tunneled over Ethernet to other devices.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a capability for using a Non-Volatile Memory Express protocol over an Ethernet.
The embodiment of the FPGA 800 of
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having an interface that allows an operator to handle storage area network resources with an interface that is used for disk attached storage.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a pool of virtualized, converged networking/storage devices that appear to an operating system as disk attached storage.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having disk attached storage across a network fabric.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having an integrated framework for network management and storage management, including controlling target storage functions and handling network fabric capabilities.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having network policies for containers exposed with network management function in a unified network and storage management interface.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a network and storage management interface that allow separated handling storage functions and network functions for a unified networking and storage system.
The internal functions of the FPGA 800 may include a number of enabling features for the converged solution 300 and other aspects of the present disclosure noted throughout. A set of virtual endpoints (vNVMe) 802 may be provided for the host. Analogous to the SR-IOV protocol that is used for the network interface, this presents virtual storage targets to the host. In this embodiment of the FPGA 800, NVMe has benefits of low software overhead, which in turn provides high performance. A virtual NVMe device 802 can be dynamically allocated/de-allocated/moved and resized. As with SR-IOV, there is one physical function (PF) 806 that interfaces with a PCIe driver 110 (see below), and multiple virtual functions 807 (VF) in which each appears as an NVMe device.
Also provided in the FPGA 800 functions are one or more read and write direct memory access (DMA) queues 804, referred to in some cases herein as a DMA engine 804. These may include interrupt queues, doorbells, and other standard functions to perform DMA to and from the host computing system 102.
A device mapping facility 808 on the FPGA 800 may determine the location of the virtual NVMe devices 802. The location options would be local (i.e.—attached to one of the storage media interfaces 824 shown), or remote on another host 102 of a storage controller 300. Access to a remote vNVMe device requires going through a tunnel 828 to the network 122.
A NVMe virtualization facility 810 may translate NVMe protocol instructions and operations to the corresponding protocol and operations of the backend storage media 302, such as SAS or SATA (in the case of use of NVMe on the backend storage medium 302, no translation may be needed) where DAS 308 is used, or such as iSCSI, FC or FCoE in the case where SAN 310 storage is used in the backend. References to the backend here refer to the other side of the converged controller 300 from the host 102.
A data transformation function 812 may format the data as it is stored onto the storage media 302. These operations could include re-writes, transformation, compression, protection (such as RAID), encryption and other functions that involve changing the format of the data in any way as necessary to allow it to be handled by the applicable type of target storage medium 308. In some embodiments, storage medium 308 may be remote.
In embodiments, storage read and write queues 814 may include data structures or buffering for staging data during a transfer. In embodiments, temporary memory, such as DRAM of NVRAM (which may be located off the FPGA 800) may be used for temporary storage of data.
A local storage scheduler and shaper 818 may prioritize and control access to the storage media 302. Any applicable SLA policies for local storage may be enforced in the scheduler and shaper 818, which may include strict priorities, weighted round robin scheduling, IOP shapers, and policers, which may apply on a per queue, per initiator, per target, or per c-group basis, and the like.
A data placement facility 820 may implement an algorithm that determines how the data is laid out on the storage media 302. That may involve various placement schemes known to those of skill in the art, such as striping across the media, localizing to a single device 302, using a subset of the devices 302, or localizing to particular blocks on a device 302.
A storage metadata management facility 822 may include data structures for data placement, block and object i-nodes, compression, deduplication, and protection. Metadata may be stored either in off-FPGA 800 NVRAM/DRAM or in the storage media 302.
A plurality of control blocks 824 may provide the interface to the storage media. These may include SAS, SATA, NVMe, PCIe, iSCSI, FC and/or FCoE, among other possible control blocks, in each case as needed for the appropriate type of target storage media 302.
A storage network tunnel 828 of the FPGA 800 may provide the tunneling/cut-through capabilities described throughout this disclosure in connection with the converged solution 300. Among other things, the tunnel 828 provides the gateway between storage traffic and network traffic. It includes encapsulation/de-encapsulation or the storage traffic, rewrite and formatting of the data, and end-to-end coordination of the transfer of data. The coordination may be between FPGAs 800 across nodes within a host computing system 102 or in more than one computing system 102, such as for the point-to-point path 400 described in connection with
A virtual network interface card facility 830 may include a plurality of SR-IOV endpoints to the host 102, presented as virtual network interface cards. One physical function (PF) 836 may interfaces with a PCIe driver 110 (see software description below), and multiple virtual functions (VF) 837, in which each appear as a network interface card (NIC) 118.
A set of receive/transmit DMA queues 832 may include interrupt queues, doorbells, and other standard functions to perform DMA to and from the host 102.
A classifier and flow management facility 834 may perform standard network traffic classification, typically to IEEE standard 802.1Q class of service (COS) mappings or other priority levels.
An access control and rewrite facility 838 may handle access control lists (ACLs) and rewrite policies, including access control lists typically operating on Ethernet tuples (MAC SA/DA, IP SA/DA, TCP ports, etc.) to reclassify or rewrite packets.
A forwarding function 840 may determines destination of the packet, such as through layer 2 (L2) or layer 3 (L3) mechanisms.
A set of network receive and transmit queues 842 may handle data structures or buffering to the network interface. Off-FPGA 800 DRAM may be used for packet data.
A network/remote storage scheduler and policer 844 may provide priorities and control access to the network interface. SLA policies for remote storage and network traffic may be enforced here, which may include strict priorities, weighted round robin, IOP and bandwidth shapers, and policers on a per queue, per initiator, per target, per c-group, or per network flow basis, and the like.
A local network switch 848 may forward packets between queues in the FPGA, so that traffic does not need to exit the FPGA 800 to the network fabric 122 if the destination is local to the FPGA 800 or the host 102.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a storage initialization and target functions in a network switch with attached disks.
An end-to-end congestion control/credit facility 850 may prevent network congestion. This is accomplished with two algorithms. First there may be an end-to-end reservation/credit mechanism with a remote FPGA 800. This may be analogous to a SCSI transfer ready function, where the remote FPGA 800 permits the storage transfer if it can immediately accept the data. Similarly, the local FPGA 800 allocates credits to remote FPGAs 800 as they request a transfer. SLA policies for remote storage may also be enforced here. Second there may be a distributed scheduling algorithm, such as an iterative round-robin algorithm, such as the iSLIP algorithm for input-queues proposed in the publication “The iSLIP Scheduling Algorithm for Input-Queues Switches”, by Nick McKeown, IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999. The algorithm may be performed cluster wide using the intermediate network fabric as the crossbar.
A rewrite, tag, and CRC facility 852 may encapsulate/de-encapsulate the packet with the appropriate tags and CRC protection.
A set of interfaces 854, such as MAC interfaces, may provide an interface to Ethernet.
A set of embedded CPU and cache complexes 858 may implement a process control plan, exception handling, and other communication to and from the local host and network remote FPGAs 800.
A memory controller 860, such as a DDR controller, may act as a controller for the external DRAM/NVRAM.
As a result of the integration of functions provided by the converged solution 300, as embodied in one example by the FPGA 800, provided herein are methods and systems for combining storage initiation and storage targeting in a single hardware system. In embodiments, these may be attached by a PCIe bus 110. A single root virtualization function (SR-IOV) or the like may be applied to take any standard device (e.g., any storage media 302 device) and have it act as if it is hundreds of such devices. Embodiments disclosed herein include using a protocol like SR-IOV to give multiple virtual instances of a physical storage adaptor. SR-IOV is a PCIe standard that virtualizes I/O functions, and while it has been used for network interfaces, the methods and systems disclosed herein extend it to use for storage devices. Thus, provided herein is a virtualized target storage system. In embodiments the virtual target storage system may handle disparate media as if the media are a disk or disks, such as DAS 310.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having the capability for virtualization in the input/output data path of a storage resources.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a network device with storage initiation and a storage target on the device.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a software system for managing a converged networking and target storage initiation and handling system.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having virtualization embodied in hardware in a converged network and storage system.
Enabled by embodiments like the FPGA 800, embodiments of the methods and systems disclosed herein may also include providing an NVMe device that is virtualized and dynamically allocated. In embodiments one may piggyback the normal NVMe protocol, but carve up, virtualize and dynamically allocate the NVMe device. In embodiments there is no footprint in the software. The operating system 108 stays the same or nearly the same (possibly having a small driver that sees the converged network/storage card 300). This results in virtual storage that looks like a direct attached disk, but the difference is that now we can pool such storage devices 302 across the network 122.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having pooled hardware storage resources that are virtualized to an operating system, such that what appears to be a physical disk expands in capacity without requiring a copying or rebalancing operation by the operating system that accesses the storage.
Methods and systems are disclosed herein for implementing virtualization of NVMe. Regardless how many sources are related to how many destinations, as long as the data from the sources is serialized first before going into the hub, then the hub distributes to data to the designated destination sequentially. If so, then data transport resources such as the DMA queues 804, 832 can be reduced to only one copy. This may include various use scenarios. In one scenario, for NVMe virtual functions (VFs), if they are all connected to the same PCIe bus 110, then regardless how many VFs 807 are configured, the data would be coming into this pool of VFs 807 serially, so there is only one DMA engine 804, and only one storage block (for control information) is needed.
In another use scenario, for a disk storage system with a pool of discrete disks/controllers, if the data is originated from the physical bus, i.e. PCIe 110, since the data is serially coming into this pool of disks, then regardless how many disks/controllers are in the pool, the transport resources such as the DMA engine 804 can be reduced to only one instead of one per controller.
Methods and systems disclosed herein may also include virtualization of a converged network/storage adaptor 300. From a traffic perspective, one may combine systems into one. Combining the storage and network adaptors, and adding in virtualization, gives significant advantages. Say there is a single host 102 with two PCIe buses 110. To route from the PCIe 110, you can use a system like remote direct memory access (RDMA) to get to another machine/host 102. If one were to do this separately, one has to configure the storage and the network RDMA systems separately. One has to join each one and configure them at two different places. In the converged solution 300, the whole step of setting up QoS, seeing that this is RDMA and that there is another fabric elsewhere is a zero touch process, because with combined storage and networking the two can be configured in a single step. That is, once one knows the storage, one doesn't need to set up the QoS on the network separately. Thus, single-step configuration of network and storage for RDMA solutions is enabled by the converged solution 300.
Referring again to
Referring to
The controller card 902 may be used as an add-on card on a commodity chassis, such as a 2 RU, 4 node chassis. Each node of the chassis (called a sled) is typically 1 RU and 6.76″ wide. The motherboard typically may provide a PCIe Gen3 x16 connector near the back. A riser card may be used to allow the Controller card 902 to be installed on top of the motherboard; thus, the clearance between the card and the motherboard may be limited to roughly on slot width.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having the capability to provide dynamic resource allocation and guaranteed performance in deployments using commodity networking and storage hardware.
In embodiments, the maximum power supplied by the PCIe connector is 75 W. The controller card 902 may consume about 60 W or less.
The chassis may provide good airflow, but the card should expect a 10 C rise in ambient temperature, because in this example the air will be warmed by dual Xeon processors and 16 DIMMs. The maximum ambient temperature for most servers is 35 C, so the air temperature at the controller card 902 will likely be 45 C or higher in some situations. Custom heat sinks and baffles may be considered as part of the thermal solution.
There are two FPGAs in the embodiment of the controller card 902 depicted in
The datapath chip 904 provides connectivity to the host computer 102 over the PCIe connector 110. From the host processor's point of view, the controller card 902 looks like multiple NVMe devices. The datapath chip 904 bridges NVMe to standard SATA/SAS protocol and in this embodiment controls up to six external disk drives over SATA/SAS links. Note that SATA supports up to 6.0 Gbps, while SAS supports up to 12.0 Gbps.
The networking chip 908 switches the two 10G Ethernet ports of the NIC device 118 and the eCPU 1018 to two external 10G Ethernet ports. It also contains a large number of data structures for used in virtualization.
The motherboard of the host 102 typically provides a PCIe Gen3 x16 interface that can be divided into two separate PCIe Gen3 x8 busses in the Intel chipset. One of the PCIe Gen3 x8 bus 110 is connected to the Intel NIC device 118. The second PCIe Gen3 x8 bus 110 is connected to a PLX PCIe switch chip 1010. The downstream ports of the switch chip 1010 are configured as two PCIe Gen3 x8 busses 110. One of the busses 110 is connected to the eCPU while the second is connected to the datapath chip 904.
The datapath chip 904 uses external memory for data storage. A single x72 DDR3 channel 1012 should provide sufficient bandwidth for most situations. The networking chip 908 also uses external memory for data storage, and a single x72 DDR3 channel is likely to be sufficient for most situations. In addition, the data structures require the use of non-volatile memory, such as one that provides high performance and sufficient density, such as Non-volatile DIMM (NVDIMM, which typically has a built-in power switching circuit and super-capacitors as energy storage elements for data retention.
The eCPU 1018 communicates with the networking 908 using two sets of interfaces. It has a PCIe Gen2x4 interface for NVMe-like communication. The eCPU 1018 also has two 10G Ethernet interfaces that connect to the networking chip 908, such as through its L2 switch.
An AXI bus 1020 (a bus specification of the ARM chipset) will be used throughout the internal design of the two chips 904, 908. To allow seamless communication between the datapath chip 904 and the networking chip 908, the AXI bus 1020 is used for chip-to-chip connection. The Xilinx Aurora™ protocol, a serial interface, may be used as the physical layer.
The key requirements for FPGA configuration are that (1) The datapath chip 904 must be ready before PCIe configuration started (QSPI Flash memory (serial flash memory with quad SPI bus interface) may be fast enough) and (2) the chips are preferably field upgradeable. The Flash memory for configuration is preferably large enough to store at least 3 copies of the configuration bitstream. The bitstream refers to the configuration memory pattern used by Xilinx™ FPGAs. The bitstream is typically stored in non-volatile memory and is used to configure the FPGA during initial power-on. The eCPU 1018 may be provided with a facility to read and write the configuration Flash memories. New bitstreams may reside with the processor of the host 102. Security and authentication may be handled by the eCPU 1018 before attempting to upgrade the Flash memories.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having hardware-level storage security in a cluster of storage resources.
In a networking subsystem, the Controller card 902 may handle all network traffic between the host processor and the outside world. The Networking chip 908 may intercept all network traffics from the NIC 118 and externally.
The Intel NIC 118 in this embodiment connects two10GigE, XFI interfaces 1022 to the Networking chip 908. The embedded processor will do the same. The Networking chip 908 will perform an L2 switching function and route Ethernet traffic out to the two external 10GigE ports. Similarly, incoming 10GigE traffic will be directly to either the NIC 118, the eCPU 1018, or internal logic of the Networking chip 908.
The controller card 902 may use SFP+ optical connectors for the two external 10G Ethernet ports. In other embodiments, the card may support 10GBASE-T using an external PHY and RJ45 connectors; but a separate card may be needed, or a custom paddle card arrangement may be needed to allow switching between SFP+ and RJ45.
All the management of the external port and optics, including the operation of the LEDs, may be controlled by the Networking chip 908. Thus, signals such as PRST, I2C/MDIO, etc may be connected to the Networking chip 908 instead of the NIC 118.
In a storage subsystem, the Datapath chip 904 may drive the mini-SAS HD connectors directly. In embodiments such as depicted in
To provide efficient use of board space, two x4 mini-SAS HD connectors may be used. All eight sets of signals may be connected to the Datapath chip 904, even though only six sets of signals might be used at any one time.
On the chassis, high-speed copper cables may be used to connect the mini-SAS HD connectors to the motherboard. The placement of the mini-SAS HD connectors may take into account the various chassis' physical space and routing of the cables.
The power to the controller card 902 may be supplied by the PCIe x16 connector. No external power connection needs to be used. Per PCIe specification, the PCIe x16 connector may supply only up to 25 W of power after power up. The controller card 902 may be designed such that it draws less than 25 W until after PCIe configuration. Thus, a number of interfaces and components may need to be held in reset after initial power up. The connector may supply up to 75 W of power after configuration, which may be arranged such that the 75 W is split between the 3.3V and 12V rails.
Referring to
Typically, it is easier to accomplish the movement within a reasonable amount of time as long as the application states and the storage are reasonable in terms of size. Typically storage-intense applications may use large amounts (e.g., multiple terabytes) of storage, in which case, it may not be practical to move the storage 302 within an acceptable amount of time. In that case, storage may continue to stay where it was and software-level shunting/tunneling would be undertaken to access the storage remotely, as shown in
As shown in
As shown in
Consider a similar scenario when a converged controller 300 is applied as shown in the
Thus, provided herein is a novel way of bypassing the main CPU where a storage device is located, which in turn (a) allows one to reduce latency and bandwidth significantly in accessing a storage across multiple computer systems and (b) vastly simplifies and improves situations in which an application needs to be moved away from a machine on which its storage is located.
Ethernet networks behave on a best effort basis and hence lossy in nature as well as bursty. Any packet could be lost forever or buffered and delivered in bursty and delayed manner along with other packets. Whereas, typical storage centric applications are sensitive to losses and bursts, it is important that when storage traffic is sent over Ethernet networks.
Conventional storage accesses over their buses/networks involve reliable and predictable methods. For example, Fibre Channel networks employ credit based flow control to limit number of accesses made by end systems. And the number of credits given to an end system is based on whether the storage device has enough command buffers to receive and fulfill storage requests in predictable amount of time fulfilling required latency and bandwidth needs. The figure below shows some credit schemes adopted by different types of buses such as SATA, Fibre Channel (FC), SCSI, SAS, etc.
Referring to
As one can see, for example, an FC controller 1610 may have its own buffering up to a limit of ‘N’ storage commands before sending them to an FC-based storage device 1612, but the FC device 1612 might have a different buffer limit, say ‘M’ in this example, which could be greater than, equal to, or less than ‘N’. A typical credit-based scheme uses target level (in this example, one of the storage devices 302, such as the FC Device 1602, is the target) aggregate credits, information about which is propagated to various sources (in this example, the controller, such as the FC Controller 1610, is the source) which are trying to access the target 302. For example, if two sources are accessing a target that has a queue depth of ‘N,’ then sum of the credits given to the sources would not exceed ‘N,’ so that at any given time the target will not receive more than ‘N’ commands. The distribution of credits among the sources may be arbitrary, or it may be based on various types of policies (e.g., priorities based on cost/pricing, SLAs, or the like). When the queue is serviced, by fulfilling the command requests, credits may be replenished at the sources as appropriate. By adhering to this kind of credit-based storage access, losses that would result from queues at the target being overwhelmed can be avoided.
Typical storage accesses over Ethernet, such as FCOE, iSCSI, and the like, may extend the target-oriented, credit-based command fulfillment to transfers over Ethernet links. In such cases, they may be target device-oriented, rather than being source-oriented. Provided herein are new credit based schemes that can instead be based on which or what kind of source should get how many credits. For example, the converged solution 300 described above, which directly interfaces the network to the storage, may employ a multiplexer to map a source-oriented, credit-based scheduling scheme to a target device oriented credit based scheme, as shown in
As shown in
In embodiments, methods and systems to provide access to blocks of data from a storage device 302 is described. In particular, a novel approach to allowing an application to access its data, fulfilling a specific set of access requirements is described.
As used herein, the term “application-driven data storage” (ADS) encompasses storage that provides transparency to any application in terms of how the application's data is stored, accessed, transferred, cached and delivered to the application. ADS may allow applications to control these individual phases to address the specific needs of the particular application. As an example, an application might be comprised of multiple instances of itself, as well as multiple processes spread across multiple Linux nodes across the network. These processes might access multiple files in shared or exclusive manners among them. Based on how the application wants to handle these files, these processes may want to access different portions of the files more frequently, may need quick accesses or use once and throw away. Based on these criteria, it might want to prefetch and/or retain specific portions of a file in different tiers of cache and/or storage for immediate access on per session or per file basis as it wishes. These application specific requirements cannot be fulfilled in a generic manner such as disk striping of entire file system, prefetching of read-ahead sequential blocks, reserving physical memory in the server or LRU or FIFO based caching of file contents.
Application-driven data storage I/O is not simply applicable to the storage entities alone. It impacts the entire storage stack in several ways. First, it impacts the storage I/O stack in the computing node where the application is running comprising the Linux paging system, buffering, underlying File system client, TCP/IP stack, classification, QoS treatment and packet queuing provided by the networking hardware and software. Second, it impacts the networking infrastructure that interconnects the application node and its storage, comprising Ethernet segments, optimal path selections, buffering in switches, classification and QoS treatment of latency-sensitive storage traffic as well as implosion issues related to storage I/O. Also, it impacts the storage infrastructure which stores and maintains the data in terms of files comprising the underlying file layout, redundancy, access time, tiering between various types of storage as well as remote repositories.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a capability for coordination of management of storage infrastructure.
Methods and systems disclosed herein include ones relating to the elements affecting a typical application within an application node and how a converged solution 300 may change the status quo to address certain critical requirements of applications.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having application-driven storage access.
Conventional Linux stacks may consist of simple building blocks, such generic memory allocation, process scheduling, file access, memory mapping, page caching, etc. Although these are essential for an application to run on Linux, this is not optimal for certain categories of applications that are input/output (TO) intensive, such as NoSQL. NoSQL applications are very IO intensive, and it is harder to predict their data access in a generic manner. If applications have to be deployed in a utility-computing environment, it is not ideal for Linux to provide generic minimal implementations of these building blocks. It is preferred for these building blocks to be highly flexible and have application-relevant features that can be controllable from the application(s).
Although every application has its own specific requirements, in an exemplary embodiment, the NoSQL class of applications has the following requirements which, when addressed by the Linux stack, could greatly improve the performance of NoSQL applications and other IO intensive applications. The requirements are first, the use of file level priority. The Linux file system should provide access level priority between different files at a minimum. For example, an application process (consisting of multiple threads) accessing two different files with one file given higher priority over the other (such as one database/table/index over the other). This would enable the precious storage I/O resources be preferentially utilized based on the data being accessed. One would argue that this could be indirectly addressed by running one thread/process be run at a higher or lower priority, but those process level priorities are not communicated over to file system or storage components. Process or thread level priorities are meant only for utilizing CPU resources. Moreover, it is possible that same thread might be accessing these two files and hence will be utilizing the storage resources at two different levels based on what data (file) being accessed. Second, there may be a requirement for access level preferences. A Linux file system may provide various preferences (primarily SLA) during a session of a file (opened file), such as priority between file sessions, the amount of buffering of blocks, the retention/life time preferences for various blocks, alerts for resource thresholds and contentions, and performance statistics. As an example, when a NoSQL application such as MongoDB or Cassandra would have two or more threads for writes and reads, where if writes may have to be given preference over reads, a file session for write may have to be given preference over a file session for read for the same file. This capability enables two sessions of the same file to have two different priorities.
Many of the NoSQL applications store different types of data into the same file; for example, MongoDB stores user collections as well as (b-tree) index collections in the same set of database files. MongoDB may want to keep the index pages (btree and collections) in memory in preference over user collection pages. When these files are opened, MongoDB may want to influence the Linux, File and storage systems to treat the pages according to MongoDB policies as opposed to treating these pages in a generic FIFO or LRU basis agnostic of the application's requirements.
Resource alerts and performance statistics enable an NoSQL database to understand the behavior of the underlying File and storage system and could service its database queries accordingly or trigger actions to be carried out such as sharding of the database or reducing/increasing of File I/O preference for other jobs running in the same host (such as backup, sharding, number read/write queries serviced, etc.). For example, performance stats about min, max and average number of IOPs and latencies as well as top ten candidate pages thrashed in and out of host memory in a given period of time would enable an application to fine tune itself dynamically adjusting the parameters noted above.
A requirement may also exist for caching and tiering preferences. A Linux file system may need to have a dynamically configurable caching policy while applications are accessing their files. Currently, Linux file systems typically pre-fetch contiguous blocks of a file, hoping that applications are reading the file in a sequential manner like a stream. Although it is true for many legacy applications like web servers and video streamers, emerging NoSQL applications do not follow sequential reads strictly. These applications read blocks randomly. As an example, MongoDB stores the document keys in index tables in b-tree, laid out flat on a portion of a file, which, when a key is searched, accesses the blocks randomly until it locates the key. Moreover, these files are not dedicated to such b-tree based index tables alone. These files are shared among various types of tables (collections) such as user documents and system index files. Because of this, a Linux file system cannot predict what portions of the file need to be cached, read ahead, swapped out for efficient memory usage, etc.
In embodiments of the methods and systems described herein, there is a common thread across various applications in their requirements for storage. In particular, latency and IOPs for specific types of data at specific times and places of need are very impactful on performance of these applications.
For example, to address the host level requirements listed above, disclosed herein are methods and systems for a well fine-tuned file-system client that enables applications to completely influence and control the storing, retrieving, retaining and tiering of data according to preference within the host and elsewhere.
As shown in
Methods and systems disclosed herein may provide extensive tiering services for data retrieval across network and hosts. As one can see in
The methods and systems disclosed herein also provide extensive caching service, wherein an application container in the High Performance DFS 1902 can proactively retrieve specific pages of a file from local storage and/or remote locations and push these pages to specific places for fast retrieval later when needed. For instance, the methods and systems may local memory and SSD usages of the hosts running the application and proactively push pages of an application's interest into any of these hosts' local memory/SSD. The methods and systems may use the local tiers of memory, SSD and HDD provisioned for this purpose in the DFS platform 1904 for very low latency retrieval by the application at a later time of its need.
The use of extending the cache across hosts of the applications is immense. For example, in MongoDB when the working set temporarily grows beyond its local host's memory, thrashing happens, and it significantly reduces the query handling performance. This is because when a needed file data page is discarded in order to bring in a new page to satisfy a query and subsequently, if the original page has to be brought back, the system has to reread the page afresh from the disk subsystem, thereby incurring huge latency in completing a query. Application-driven storage access helps these kinds of scenarios by keeping a cache of the discarded page elsewhere in the network (in another application host's memory/SSD or in local tiers of storage of the High Performance DFS system 1902) temporarily until MongoDB requires the page again and thereby significantly reducing the latency in completing the query.
Referring to
A system comprising of a set of hosts (H1 through HN), a file or block server 2102 and a storage subsystem 2104 is disclosed herein as shown in the
Storage 2104 may be a collection of entities capable of retaining a piece of data temporarily or permanently. This is typically comprised of static or dynamic random access memory (RAM), solid state storage (SSD), hard disk drive (HDD) or a combination of all of these. Storage could be an independent physical entity connected to a File or volume server 2102 through a link or a network. It could also be integrated with file or volume server 2102 in a single physical entity. Hence, hosts H1-HN, file or volume server 2102 and storage 2104 could be physically collocated in a single hardware entity.
A host is typically comprised of multiple logical entities as shown in
An example scenario depicting an application 2202 accessing a block of data from storage 2212 is shown in
In the methods and systems disclosed herein, in order to address performance requirements related to data access by most newer class of applications in the area of NoSQL and BigData, it is proposed that the components in the data block access comprising operating system 2204, file system client 2206, memory 2208, block server 2210 and storage 2212 be controlled by any application 2202. Namely, we propose the following. First, enable operating system 2204 to provide additional API to allow applications to control file system client 2206. Second, enhance file system client 2206 to support the following: (a) allow application 2202 to create a dedicated pool of memory in memory 2208 for a particular file or volume, in the sense, a file or volume will have a dedicated pool of memory buffers to hold data specific to it which are not shared or removed for the purposes of other files or volumes; (b) allow application 2202 to create a dedicated pool of memory in memory 2208 for a particular session with a file or volume such that two independent sessions with a file or volume will have independent memory buffers to hold their data. As an example, a critically important file session may have large number of memory buffers in memory 2208, so that the session can take advantage of more data being present for quicker and frequent access, whereas a second session with the same file may be assigned with very few buffers and hence it might have to incur more delay and reuse of its buffers to access various parts of the file; (c) allow application 2202 to create an extended pool of buffers beyond memory 2208 across other hosts or block server 2210 for quicker access. This enables blocks of data be kept in memory 2208 of other hosts as well as any memory 2402 present in the file or block server 2210; (d) allow application 2202 to make any block of data to be more persistent in memory 2208 relative to other blocks of data for a file, volume or a session. This allows an application to pick and choose a block of data to be always available for immediate access and not let operating system 2204 or file system client 2206 to evict it based on their own eviction policies; and (e) allow application 2202 to make any block of data to be less persistent in memory memory 2208 relative to other blocks of data for a file, volume or a session. This allows an application to let know operating system 2204 and file system client 2206 to evict and reuse the buffer of the data block as and when they choose to. This helps in retaining other normal blocks of data for longer period of time. Third, enable block server 2210 to host application specific modules in terms of application container 2400 as shown in the
The application driven feature of (2)(c) above needs further explanation. There are two scenarios. The first one involves block of data being retrieved from the memory of block server 2210. The other scenario involves retrieving the same from another host. Assuming the exact same block data has been read from storage 2212 by two hosts (H1) and (H2), the methods and systems disclosed herein provide a system such as depicted in
In embodiments, if file system client 2206 decides to evict a block of data from (D1) because of storing a more important block of data in its place, file system client 2206 could send the evicted block of data to file system client 2206′ to be stored in memory 2208′ on its behalf.
It should be noted that the abovementioned techniques can be applied to achieving fast failover in case of failure(s) of Hosts. Furthermore the caching techniques described above; especially pertaining to RAM can use used to achieve failover with a warm cache.
Provided herein is a system and method with a processor and a file server with an application specific module to control the storage access according to the application's needs.
Also provided herein is a system and method with a processor and a data (constituting blocks of fixed size bytes, similar or different objects with variable number of bytes) storage enabling an application specific module to control the storage access according to the application's needs.
Also provided herein is a system and method which retrieves a stale file or storage data block, previously maintained for the purposes of an application's use, from a host's memory and/or its temporary or permanent storage element and stores it in another host's memory or and/or its temporary or permanent storage element, for the purposes of use by the application at a later time.
Also provided herein is a system and method which retrieves any file or storage data block, previously maintained for the purposes of an application's use, from a host's memory and/or its temporary or permanent storage element and stores it in another host's memory or and/or its temporary or permanent storage element, for the purposes of use by the application at a later time.
Also provided herein is a system and method which utilizes memory and/or its temporary or permanent storage element of a host to store any file or storage data block which would be subsequently accessed by an application running in another host for the purposes of reducing latency of data access.
File or storage data blocks, previously maintained for the purposes of an application's use, from a host's memory and/or its temporary or permanent storage element, may be stored in another host's memory or and/or its temporary or permanent storage element, for the purposes of use by the application at a later time.
The mechanism of transferring a file or storage data block, previously maintained for the purposes of an application's use, from a host's memory and/or its temporary or permanent storage element to another host over a network.
In accordance with various exemplary and non-limiting embodiments, there is disclosed a device comprising a converged input/output controller that includes a physical target storage media controller, a physical network interface controller and a gateway between the storage media controller and the network interface controller, wherein gateway provides a direct connection for storage traffic and network traffic between the storage media controller and the network interface controller.
In accordance with some embodiments, the device may further comprise a virtual storage interface that presents storage media controlled by the storage media controller as locally attached storage, regardless of the location of the storage media. In accordance with yet other embodiments, the device may further comprise a virtual storage interface that presents storage media controlled by the storage media controller as locally attached storage, regardless of the type of the storage media. In accordance with yet other embodiments, the device may further comprise a virtual storage interface that facilitates dynamic provisioning of the storage media, wherein the physical storage may be either local or remote.
In accordance with yet other embodiments, the device may further comprise a virtual network interface that facilitates dynamic provisioning of the storage media, wherein the physical storage may be either local or remote. In accordance with yet other embodiments, the device may be adapted to be installed as a controller card on a host computing system, in particular, wherein the gateway operates without intervention by the operating system of the host computing system.
In accordance with yet other embodiments, the device may include at least one field programmable gate array providing at least one of the storage functions and the network functions of the device. In accordance with yet other embodiments, the device may be configured as a network-deployed switch. In accordance with yet other embodiments, the device may further comprise a functional component of the device for translating storage media instructions between a first protocol and at least one other protocol.
With reference to
In accordance with various embodiments, the first protocol is at least one of a SATA protocol, an NVMe protocol, a SAS protocol, an iSCSI protocol, a fiber channel protocol and a fiber channel over Ethernet protocol. In other embodiments, the second protocol is an NVMe protocol.
In some embodiments, the method may further comprise providing an interface between an operating system and a device that performs the translation of instructions between the first and second storage protocols and/or providing an NVMe over Ethernet connection between the device that performs the translation of instructions and a remote, network-deployed storage device.
With reference to
In accordance with various embodiments, the migration is of a Linux container or a scaleout application.
In accordance with yet other embodiments, the target physical storage is a network-deployed storage device that uses at least one of an iSCSI protocol, a fiber channel protocol and a fiber channel over Ethernet protocol. In yet other embodiments, the target physical storage is a disk attached storage device that uses at least one of a SAS protocol, a SATA protocol and an NVMe protocol.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having virtualization of at least one type of non-disk-attached storage such that it is handled as if it is disk attached storage in a converged networking/storage.
With reference to
Referring to the architecture 2900 depicted in
Referring to the architecture 3000 of
In embodiments, consideration may be given to DMA versus data transmission on a network. Any NVME I/O typically involves DMA to/from the host memory, such as using PRP/SGL lists. In embodiments, one way of architecting the protocol could be to pass the PRP/SGL lists over the network. The drawbacks associated with this approach are the need to reconcile various host OS page sizes and destination device page sizes, resulting in inefficiency and needless complexity. Also, passing host memory address pointers over a network is potentially insecure and may require protection, such as with digital signatures, against incorrect accesses.
In embodiments, these problems may be mitigated or avoided by using an architecture 3100 as depicted in
Referring to
The Burst Transmission Protocol (BTP) layer 3204 provides guaranteed delivery semantics for the NVMEoN protocol to run between a pair of nodes. The BTP layer 3204 may: provide guaranteed delivery of NVME command and data packets; reserve buffers at the receiver for the NVME command and data packets; avoid delivery of duplicate packets to upper layers; minimize control packet overhead by aggregating NVME flows across proxy controllers and queues (by transmitting multiple packets in one burst); and leave the order of delivery of packets to upper layer (such as in-order choice) as implementation choice for the designer.
In the context of the description of BTP, a BTP sender should be understood from the point of view of a node that sends packets to another node (the BTP receiver). This is distinct from NVME command initiators and targets. For instance, the NVME target, when processing a write command, becomes a BTP sender when it sends transfer ready (referred to herein in some cases as “Xfer Rdy”) packets and data packets. A given node can be both a BTP sender and a receiver at any point of time.
In embodiments, four types of packets are supported by the BTP. First, BTP Command packets are small (e.g., default max 256 bytes) packets that are used for sending NVME command, status and NVMEoN control packets (like Xfer Rdy and Exchange Cleanup). Second, BTP Batched Command packets allow for multiple NVME commands to be packed into one packet (default max, e.g., 1500 bytes). Third, BTP Data packets may be large and may depend on the typical MTU configuration in the network (e.g., default max of 1500 bytes). Such packets may be used for sending actual NVME data packets. Fourth, BTP Control packets may exchange BTP control information. BTP Command and Data Packets may be stored in buffers in some implementations.
In embodiments, a burst window may be understood to comprise a window where a BTP sender can request for credits and send a number of packets. Expected and received burst windows may be used by a BTP receiver to track packets received, such as in a sliding window of sorts. A Request Credit may signify the start of a burst. An ACK may signify the end of a burst. See below for a further explanation of these.
In embodiments, a burst ID may comprise a number, such as a 24 (or not 32) bit number, identifying each burst window uniquely. A BTP sender may start with a random number and increment this for every new burst window.
In embodiments, a sequence ID may comprise a number, such as a 30-bit number, uniquely identifying every BTP packet (e.g., command, data, control) across burst windows. The only requirement is that the sequence id is preferably unique across burst windows and should only be reused by the sender after an ACK from the receiver indicating that it has been successfully delivered. It need not be monotonically increasing, but if it is implemented that way, the starting sequence id is preferably randomized.
Between a pair of nodes, there can be multiple BTP channels. All BTP state information may be maintained per BTP channel. The BTP protocol (described below) may runs within the scope of a BTP channel. The BTP channel may be identified, such as by using an 8 bit Channel id in the header (along with the 24 bit burst id). By default, at least 1 channel (with channel id 1) should preferably be supported between a pair of nodes. Setting up the BTP channels between a pair of nodes may be implemented as a design choice. In embodiments, multiple BTP channels may be used in order to achieve high throughput and link utilization or to provide multiple classes of service.
In embodiments, multiple burst windows may overlap, taking care of pipelining requirements. A burst of transfers may secure credits, use the credits, and close. In the case of errors, granularity at the per packet and the per window basis allows for efficient recovery. Overlapping windows, among other benefits, take advantage of available bandwidth at a receiver during the time that acknowledgements are being exchanged with a source. Thus, a burst protocol may use multiple, parallel burst windows to maximize use of the network bandwidth and the bandwidth/capability of the receiver.
In embodiments, priorities can be handled, such as having a higher priority packet initiate closing of a window so that the packet can be sent with priority. Handling priorities may also allow high priority commands to be scheduled to a BTP window than low priority commands. A burst window may be configured based on the type of data, the type of network, network conditions and the like. Thus, a configurable burst window may be provided for error recovery and reliable transmission in an NVMEoN approach.
Referring to
When a BTP receiver gets a Request Credit 3302 with a new burst id, it may compare the sequence ids for which credits are requested with those received in the received window. In the Grant Credit message 3304, the receiver may specify two lists of sequence ids: a list of sequence ids “already received” for those packets in the received window and a list of sequence ids for which “credits are granted” (for packets not in the received window).
When the BTP sender receives a Grant Credit message 3304 it may first remove the packets whose sequence ids have been marked as “already received” from re-transmission queues. Next, it may send packets for which “credits are granted” in this burst window. Then, in the case that the sender has fewer packets to send than for which credits were granted (e.g. if the upper layers performed some cleanup between the request and grant operations of the window), the sender can send a “Close,” specifying the list of sequence ids that were sent in the burst window. The Close message is optional in case where the sender can send all packets for which credits are granted.
An ACK message may be sent by the BTP receiver when all packets expected within a burst window are received by the BTP receiver or if the ERROR_TOV timer expires after the Grant Credit 3304 was sent by the receiver. The ACK may specify which packets have been received and which ones have not been received in two separate lists. A sender may use the ACK to determine which packets were delivered to the receiver. It may queue the packets that were not delivered to the receiver for retransmission. A receiver may drop any command/data packet with a burst id that is not in the current burst window.
In embodiments the size of the burst window may be provided with a maximum value of 32 packets, which is chosen to provide a balance between two objectives: minimizing control packets overhead (3 packets for every burst of 32) while, in the event of a complete burst failure (which requires retransmission of the entire set), providing an acceptable (not too high) retransmission overhead.
Certain choices of algorithms may be implementation specific, with embodiments provided below. For example, methods of distributing credit, which relates to the ability to assure quality of service (QoS), may be addressed by a credit distribution algorithm, which may be used by a BTP receiver to distribute its buffers among various senders for fairness. In embodiments, one may implement a default minimum of one command and one data buffer per BTP sender. Also, one may implement some form for maximum value for each of the command and data buffers that each BTP sender can use.
A backoff algorithm may be used by a BTP sender to factor in congestion at the BTP receiver using Grant Credit responses 3304.
An algorithm may be used to prevent duplicate retransmission of delivered packets. Referring to
Packet loss detection and recovery may be addressed by introducing BTP control packets to request/grant credits and provide ACKs for packets sent in a burst window. There are several possible different packet drop scenarios that need to be accounted for and recovered from. Such scenarios are presented as flows in
Referring to
The NVMEoN exchange layer 3202 works on top of the BTP layer 3204 to provide framing and exchange level semantics for executing an NVME command (both admin and I/O) between an initiator and target. The fundamental building block for encapsulating NVME commands over the network is the introduction of the notion of an exchange for each NVME command. The NVMEoN exchange layer at the initiator may allocate a unique exchange for every NVME command that it sends to a given target. A NVME command may result in multiple exchanges. For example if a NVME command is divided into multiple sub-commands, there may be multiple exchanges associated with to NVME command. The initiator may maintain state information about this exchange in an exchange status block (ESB) until the exchange is completed. In embodiments, the initiator may ensure that the exchange is unique to cover NVME commands across the proxy controller ID (the ID of the proxy controller 2902 at the initiator), the queue ID (the ID of the queue within the proxy controller 2902, and the command ID (the ID of the NVME command within the given queue). Translating these parameters to a unique exchange at the initiator means that the network and the target can be agnostic to these parameters. The NVMEoN exchange layer at the target may allocate an ESB entry upon receipt of the first network packet in the exchange, which will be the NVME command that initiated the exchange.
Referring to
The exchange id may be, for example, a 32-bit value divided into two components: an initiator's component of the exchange ID (IXID), which may be allocated by the initiator when the first packet of the exchange (the NVME command) is sent to the target and a target's component of the exchange ID (TXID), which may be allocated by the target when it receives the first packet of the exchange (i.e., the NVME command).
The following guidelines may govern the usage of exchange ids. First, the NVME command packet may signify the start of the exchange. The NVME status packet may also signify the end of an exchange. When the Initiator sends the first packet of the exchange, it may set the TXID, such as to 0xFFFF. The target may allocate a TXID upon receipt of this first packet. In the next packet that the target sends to the initiator, it may set the TXID for that packet to the allocated TXID. IXIDs and TXIDs are only required to be unique between an initiator and target pair. There is no necessity for this to be monotonically increasing, but that is an option.
As to the total number of concurrent exchanges, the initiator and target should support the same total number of concurrent exchanges in a given direction. In one example, the minimum value may be one, while the maximum value may be, for example, much larger, such as sixteen thousand. The actual value can be determined by an upper level entity, but should preferably be configured to be the same at the NVMEoN exchange layer 3202 at the initiator and at the target. If the target receives more than the number of concurrent exchanges that it supports with an initiator, it can drop an exchange, allowing the initiator to timeout.
A state machine may be implemented. Both initiator and target may follow a simple lock step mechanism in order to complete an I/O. There may be four states an exchange can be in: OPEN, DATA XFER, CLEANUP, or CLOSED. The triggers for the initiator and the target to place the exchange into the appropriate state is described in the table 4300 of
The NVMEoN exchange layer 3202 may be responsible for breaking down NVME command and data packets to adhere to the network MTU expected by the BTP layer 3204. In the depicted embodiment, the minimum workable size of MTU is 512 (as NVMEoN control packets are preferably not fragmented). However, in other examples, other minimum workable size may be set. The maximum size may uncapped to allow operating in networks with very large MTU enabled. An actual path for MTU discovery implemented in various ways as would be understood by those of ordinary skill in the art. It may be statically configured or discovered using agents running a standard discovery protocol on every node. The path MTU may remain or may not remain uniform in the network. For example, the path MTU may be different in each direction between a given pair of nodes. In accordance with the illustrated example of the protocol, per initiator, per target, or per pair, the MTU may be configured by some entity.
In an example, per the Remote NVME controller 2904 at the Target, exactly one <Initiator, Proxy NVME Controller 2902> pair may be associated by an entity. The nodes may discover remote namespaces exposed and the remote control pairing may be done in various ways without limitation. The control plane may be architected and proposed for standardization if the Network Path MTU and the Remote Controller Discovery may be specified as part of this protocol. In embodiments, various techniques used to discover hardware in a network fabric may be used, such as approaches used in connection with iSCSI (such as IQM-based discovery) or fiber channel approaches.
In embodiments, a handshake may be established between hardware and software elements of the burst transmission protocol layer and the NVMEoN exchange layer. In hardware, the handshaking may enable very precise handling of the timing of overlapping burst windows to make optimum use of bandwidth on the network and at the receiver.
The flow diagrams as to how the NVMEoN exchange layer 3202 handles each NVME command are provided herein in conjunction with the building blocks described above. In these flows, “E” represents an exchange as described in the above subsection.
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
The error recovery may provide for handling an NVME abort Command. When an NVME driver detects an NVME command timeout, its recovery action may involve sending an NVME abort command. This command may be handled at various layers. For example, the proxy NVME Controller 2902 at the initiator 5002 may terminate the NVME abort command and use the NVME exchange layer APIs to clean up resources like the ESB and BTP queues. Also, the initiator 5002 may generate an NVMEoN cleanup request to the target 5004 identifying the exchange to clean up. Also, the remote NVME controller 2904 may clean up resources allocated for this exchange (e.g. commands queued to disk drives). Once all cleanup is done, a NVMEoN cleanup response may be passed all the way back to the Proxy NVME controller 2902 which may terminate the original I/O request and complete the Abort command.
The error recovery approach may provide for a reset of the proxy NVME controller 2902. The Proxy NVME controller reset may be handled using exactly the same or a similar flow as for an NVME abort command, but extending the NVMEoN cleanup request to specify multiple exchanges to clean up. The NVMEoN exchange layer at the initiator 5002 may keep track of which exchange IDs correspond to the proxy NVME controller 2902 and hence can do this translation. There may not need to be any resetting of the remote NVME controller 2904, as it is a logical, rather than physical, entity. Various possible failure scenarios during error handling (such as due to prolonged network drops) and recovery mechanisms that clean up resources at both the initiator 5002 and target 5004 are described here without limitation in accordance with various examples.
Referring to
Referring to
In order to determine the efficiency of the NVMEoN protocol, a comparison may be made relative to a protocol that sends every NVME PCIe transaction over the network with error detection boundaries at the NVME command level. Referring to
Referring to
In contrast to the efficient flows of
In order to compare overhead between the protocols, different sized I/Os, normal cases and drop scenarios may be considered. For example, “NVMEoN” may refer to the exemplary approach discussed above, while “Raw NVME” may refer to an approach of sending each NVME command without retransmissions being built into the protocol. The NVME data packets may be assumed to be fragmented at 1K boundaries (network MTU) in both cases for the sake of simple comparison, but may be fragmented differently in other examples.
Without limitation, an example comparison between NVMEoN and raw NVME for a single 4K write I/O command involves 19 total packets for NVMEoN (four request packets, four grant packets, and four ACK cycles=12 packets, plus three NVME command packets and four NVME data packets) and 11 total packets for raw NVME (command doorbell, fetch command, write command, Xfer ready, status doorbell, fetch status, status and four data packets). This example provides a theoretical scenario where only one NVME 4K I/O is outstanding between a pair of nodes, with no aggregation of flows. However, more typically there may be many flows between an initiator 4402 and a target 4402 allowing for a more efficient usage of every burst window in various examples. As there are additional flows, NVMEoN performs much better than raw NVME. An example comparison for 16 parallel 4K Write I/Os with no drops involves 127 total packets for NVMEoN (five request, grant, and ACK cycles=15 packets; 48 NVME command packets; and 64 NVME data packets) and at least 176 and up to 192 total packets for Raw NVME (11 packets for sending each 4K Write I/O (including Command Doorbell, Fetch Command, Write Command, Xfer Ready, Status Doorbell, Fetch Status, and Status, plus four data packets) times 16 I/Os results in 176 total packets.
In an example for a 32K Write I/O with no drops, NVMEoN requires 47 packets (four request, grand and ACK cycles=12 packets, three NVME command packets and 32 NVME data packets), while the Raw NVME requires 88 total packet (11 for each 4K write I/O as noted above times 8 cycles).
In an example involving a 128K Write I/O with no drops NVMEoN would require 152 total packets (7 Request, Grant, ACK cycles=21 packets, plus 3 NVME command packets and 128 NVME data packets), while Raw NVME would require 352 (the same 11 packets for each 4K write I/O as noted above, sent in 32 cycles for a total of 352 packets.
An example for a 32K Write I/O with a single data packet drop involves 51 packets for NVMEoN (5 Request, Grant, ACK cycles=15 packets, plus 3 NVME command packets, 32 NVME data packets and 1 NVME data packet retransmission, while Raw NVME requires 99 total packets (8 cycles of the same 11 packets needed for a 4K Write I/O as noted above, plus one retransmission of 11 packets, for a total of 99).
An example for a 32K Write I/O with two data packet drops is provided below without limitations. In accordance with this example, the two dropped packets may span 4K segments, but in other examples, the two dropped packets might span differently. Here NVMEoN requires 52 total packets (5 Request, Grant, ACK cycles=15 packets, plus 3 NVME command packets, 32 NVME data packets and 2 NVME data packets retransmission), while Raw NVME requires 110 total packets, including 8 cycles and 2 retransmission cycles (a total of 10), for the same 11 packets required for each 4K write I/O.
Thus, as seen in these examples, as complexity increases, drops occur, or parallel flows are involved, NVMEoN, which is comparable in performance to raw NVME for the simplest case, becomes significantly more efficient than raw NVME when sending data over a network.
Various timer values as used herein in various layers of the protocol may be set according to considerations relating to particular implementations. For example, a timer referred to as ERROR_TOV may be used by the BTP layer 3204 to detect packet losses in the network. An exemplary value may be 100 milliseconds, though other values may be defined in other examples. A timer referred to as EXCH_CLEANUP_TOV may be used by the initiator 4402 to determine a persistent network outage, causing the exchange cleanup to be dropped. An exemplary value may be 60 seconds, though other values may be defined in other examples. A timer referred to as EXCH_TOV may be used by the target 4404 to detect exchange timeouts due to repeated drops in the network and may clean up local resources. An exemplary value may be 90 seconds, but other values may be used.
There may be two possible options for implementing the NVMEoN to span network boundaries. In accordance with a first method, the initiator 4402 and the target 4404 end points may be identified using Ethernet MAC addresses. Any exemplary implementation may encapsulate the L2 packet in an overlay mechanism like VXLAN to span L3 segments. A special ethertype may be needed to standardize this. In accordance with a second method, the initiator 4402 and the target 4404 end points may be identified using a special UDP port over a node's IP address. Standardizing on the UDP port number may facilitate the method.
In accordance with different embodiments, various network packet formats used for transporting NVME command/data packets and NVMEoN and BTP control packets may be employed. The packet formats may be defined with the initiator 4402 and the target 4404 as L2 endpoints. This may be seamlessly extended to L3 endpoints since it may not be dependent on the encapsulation in the protocol.
Referring to
In various examples, various choices for implementation of the protocol may be employed. For example,
Referring to
In alternative embodiments, the map 6712 that tracks locations of writes may be (a) statically allocated or (b) dynamically allocated. A statically allocated map has the advantage that it does not require a lot of memory to hold the map; for example it can be a formula by which one can compute the SSD and offset where the logical access lies. Consider the example of a volume layout across four SSDs, where the logical blocks of a volume are simply striped across the four SSDs. In that example, for a volume layout of size 100 GB, the first 25 GB of data for the volume could be placed on first 25 GB of storage locations on the first SSD, the next 25 GB of data for the volume could be placed on the first 25 GB of storage locations on the second SSD, and so on. More complex layouts can be generated, such as using non-sequential storage blocks within particular SSDs, using non-sequential patterns for writing to the various SSDs, and the like, as described in more detail below. As long as the map is retained, the location of the actual data can be determined by reference to it. A disadvantage of this static type of mapping is the fact that when a block gets over-written the write will be directed to the same block on the particular SSD where the block being written to is mapped, making it increasingly difficult to deal with over-writes as they accumulate over time.
In the case of dynamic allocation of the map, a disadvantage is that an indirection map has to be kept; however, the advantage the dynamic allocation approach provides is that one can then issue backend writes to the SSD in a linear fashion. This indirection map needs to be kept updated and stored persistently. As an example perhaps the allocation is done in the same manner as in the example of static mapping. i.e. a dynamic volume layout of size 100 GB, in which the first 25 GB is on SSD one, and 25 GB is on the second SSD, and so on. When a write comes, say for offset 1 (for a write that is 4 k in size, for example), it shall be written on offset 0 on the first SSD, with this mapping (logical block 1→SSD1, offset 0) stored in the dynamic map. The second write may come for logical block 10, which will then be written to the first SSD, at offset 1, yielding the mapping (logical block 104 SSD1, offset 1). The advantage of this approach is the fact that the backend SSD is written in a sequential manner, which results in a gain in garbage collection efficiency.
Managing storage across a collection of drives 6702 can provide significant advantages in connection with certain challenges and inefficiencies involved in cleaning up invalid data, known as garbage collection. By way of background, garbage collection is a fundamental process in solid state drives (SSDs).
As an example of the challenges created by garbage collection processes, there may be a new write block for the new data, and there may be an erase block (e.g., 1 MB to 2 MB) corresponding to the old, now invalid data. Each block is typically made up of a plurality of smaller pages. The entire SSD may comprise a much larger storage resource, such as comprising a 100 GB drive, or larger. Also, backup space (e.g., 20 GB) may be retained on the drive. As noted above, on an SSD one cannot write on the same page again unless the block containing that page has been erased, but the “erase” operation is costly operation. If one has to overwrite a block, the SSD would mark the block invalid, and the new block that one is seeking to write is written to the backup location; that is, the SSD cannot overwrite data until it erases a whole block. The erase operation takes time, so the system typically has an internal log file system that writes serially until the end. Overwrites are written to the backup portion of the drive, and the system keeps marking some of the pages invalid, as new data is written to the backup area. Eventually, garbage collection finds out a block that has invalid pages, copies the valid pages of data from that block to backup, erases the block, and makes it available for re-writes. Flash memory has this property. It is not a “write in place” medium. In the first round, Flash memory performs very well for write operations, but as the drive nears being full, the garbage collection process requires many cycles of copying and erasing, so the drive performance diminishes significantly. For example, drive performance may diminish from 100K TOPS to 20K TOPS as the drive gets deep into garbage collection in order to make blocks available for new write operations. Internally the drive is moving large amounts of data in large numbers of operations. Eventually, user requests get blocked, because the disk is locked as it moves around data; that is, the disk can't write to the new place while the disk is copying and moving the data to make room for a subsequent erase operation. The garbage collection process for a drive could last, for example, from a millisecond to a second, during which the drive is locked for the user.
Some SSD vendors provide a garbage collection API by which third parties may manage garbage collection on the SSDs. In embodiments, such an API may be adapted to accommodate a converged storage solution as described throughout this disclosure in a manner that improves the performance of a pool of SSDs as compared to the diminishing performance normally seen as SSDs become full, due to the burdens of garbage collection. As noted elsewhere in this disclosure, embodiments of a converged solution may employ a set of SAS controllers, which may control a plurality of SSDs as a pool (e.g., six SSDs), such as the collection depicted in
The aforementioned APIs may be provided for various storage protocols, such as SAS, SATA, and NVMe. Such APIs may enable standardization as to how to call a drive and instruct it when to go into garbage collection and for how long garbage collection should take place. In embodiments, a given amount of space (e.g., seven percent, may be left reserved for garbage collection), to avoid problems that may occur with running completely out of space.
With the ability to manage garbage collection across a pool of SSDs, the manager of a pool can monitor the SSDs, such as knowing if the SSDs are of different sizes, manufacturers, or performance characteristics, so that garbage collection can be based on such awareness. As a system, the user has control over when to ask given SSDs to do garbage collection.
Some SSD vendors also have APIs to indicate how many free blocks are available. Awareness of this information may allow a user of the converged solution described throughout this disclosure to perform garbage collection selectively, such as on the drives that are more dirty.
Also, as SSDs can be of different sizes, one can arrange the garbage collection cycle based on sizes, dirtiness or other characteristics of the varying SSDs in a pool.
In embodiments, the system may direct all the SSDs in a pool to undertake garbage collection, if the situation called for it (e.g., during a time period when new writes are very unlikely).
In certain embodiments, provided herein is a storage system with time-varying assignment of sub-sets of SSDs in a pool of SSDs to perform garbage collection, while other SSDs in the pool remain available for writing new data.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a storage system with time-varying assignment of subsets of SSDs to garbage collection.
In certain embodiments, provided herein is an application programming interface for configuring SSD to initiate and close a garbage collection activity according to a schedule determined by a system external to the SSD that uses the SSD.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having an application programming interface for configuring an SSD to initiate and close a garbage collection activity.
In certain embodiments, provided herein is a storage system with log-based, file storage, that is striped sequentially across a plurality of SSDs, in which the system uses time-varied garbage collection among SSD nodes in the plurality of SSDs.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a storage system with log-based, striped storage, with time-varied garbage collection among SSD nodes.
In certain embodiments, provided herein are methods and systems for arranging the garbage collection cycle for a plurality of SSDs based on sizes, dirtiness, performance parameters, or other characteristics of the SSDs in a pool of SSDs.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having arrangement of a garbage collection cycle based on at least one of the size and cleanliness of the SSDs in a collection of SSDs.
In certain embodiments, provided herein are methods and systems for coordinating the timing of garbage collection in an SSD with a discontinuous write strategy for a plurality of related SSDs with which the SSD is pooled.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having coordination and synchronization of garbage collection in an SSD with a discontinuous write strategy for the SSDs in a collection of SSDs.
In embodiments, methods and systems are provided for arranging sets of optimally sized drives in a collection of SSDs, including to satisfy drive writes per day (DWPD) requirements on a per application basis. A given drive in a collection, such as a 100 GB drive, may be warranted over a given duration (e.g., 3, 5 or 8 years) to provide a minimum number of DWPD, and the DWPD for a drive relates to its ability to handle the write requirements of one or more applications. In SSDs that use media, such as NAND Flash, that requires erasure before re-writing to a block, there can be limits to the number of times the media can be erased (e.g., 10,000 to 30,000 times). This is due to the limited life of the physical substrate used in the media. The number of drive writes per day allows determination of the duration of a warranty, and higher intensity (higher DWPD) drives are more expensive. In the field, it can be very difficult to determine the appropriate drive for a given application, because the number of writes may be somewhat unpredictable.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a facility for arranging sets of optimally sized drives in a collection of SSDs.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a hardware device providing erasure encoding for an array of redundant disks that are treated as one logical unit across multiple storage boxes.
As noted above in the example of
With this capability to tune the DWPD over a set of drives, one can club/group applications intelligently to make good use of the purchased and warranted level of DWPD. This operation is a major advancement in practical situations. Today, applications may use a drive as a cache (often without the user being aware of that fact), and the drive, if not sized with the correct DWPD, may largely be a wasted resource. Also, today, if one buys three DWPD, but an application is doing five DWPD, then the system must throttle the application back or risk violating the warranty terms. On the other hand, if the user buys five DWPD and uses three, then the money for the additional DWPD is entirely wasted. Either situation is sub-optimal as compared to tuning to the correct DWPD needs of the application.
Internally, the system may enable a write or a set of writes. As the system sees drives taking more writes than warrantied, the system may allocate the write load to another drive in the collection to balance the writes with the warranted DWPD. If the system is initially misconfigured, one can request more writes per day to accommodate. This is possible because the converged solution controls multiple SSDs in a set, even though the set may be seen by the operating system/file system as a single drive.
In one example, if all six drives are 100 GB drives, warranted at one DWPD, the storage node has a total of 600 GB at one DWPD. This much can be written across these six drives. One may define an allocation policy for writing 100 GB per day across each of the six.
Thus, in embodiments, methods and systems are provided for arranging sets of optimally sized drives in a collection of SSDs, which may include arrangements that are based on DWPD requirements for one or more given applications.
Also, in embodiments, given a variety of combinations of DWPD parameters and life of one or more drives, methods and systems are provided for mixing and matching, virtualizing and providing the equivalent of what each application needs in a group of drives that are managed as a group.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a facility for selective mixing and virtualization of SSDs of varying DWPD parameters and life expectancy to satisfy needs of at least one application by a group of heterogeneous SSDs.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having hardware encryption at a second level of virtualization of a SSD.
In embodiments, the methods and systems disclosed herein may further employ compression (e.g., LZIP), de-duplication (e.g., MBHash), thin provisioning, load balancing, and other techniques to further optimize the use of a collection of drives.
Consider de-duplication as an example. An SSD capable of doing de-duplication can be optimized at the system level in the following way. Taking an example of a system with six SSDs, and using a dynamic volume layout, the six SSDs can be divided into six ranges. For example, if the SSD uses a Secure Hash Algorithm 1 (SHA-1) technique to fingerprint a block of 4 k in size, the output of the SHA-1 algorithm is 20 bytes, or 160 bits. That means the range is {2{circumflex over ( )}{circumflex over ( )}0, 2{circumflex over ( )}{circumflex over ( )}160}. This range can be divided into six sub-ranges say {r1, r2, r3, r4, r5, r6} and each SSD may be assigned to a sub-range. The dynamic volume map on each write operation may compute the SHA-1 for the data and re-direct the write to an SSD that falls under the assigned sub-range. The writing of the dynamic volume map may be implemented as explained elsewhere in this disclosure. With this approach one is able to achieve system-level, global de-duplication. An added advantage of this technique is the fact that no lookup or a database of SHA-1 blocks needs to be maintained at the system level.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a job de-duplication capability for networking and storage jobs.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a capability for global de-duplication and erasure encoding across a plurality of redundant storage resources.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a capability for adding nodes in a system having a capability for global de-duplication and erasure encoding across a plurality of redundant storage resources.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a hash-based system for locating data on a target storage box.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having in-line hashing and routing of the data in a network without requiring the writing of data to memory in order to perform a hash calculation.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having in-line erasure encoding in a network, without requiring the writing of data to memory in order to perform erasure encoding.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having in-line de-duplication of redundant blocks in a networking and target storage system.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having in-line de-duplication and erasure encoding in a networking and storage system without requiring writing of data to temporary memory in order to perform calculations.
To support data security, for certain storage resources, including hard drives and SSDs, there is a class of drives referred to as “self-encrypting drives” (SED). Performing data encryption in software is very expensive. One problem is that if there are multiple users of the drive, a system would preferably have more than one key (e.g., one for each user), but in conventional SEDs, there is only a single key. As a result, sharing a drive, such as in cases of multi-tenancy, causes a problem. At a system level, embodiments of a converged solution may employ certain techniques to provide encryption capability across a set of drives. First, the system can help manage the keys centrally, and do so in software in a way that takes advantage of self-encrypting nature of the drives. Second, one can produce a key in the hardware of the converged solution for each one of the virtual volumes that is carved out of a set of drives. In this case, instead of (or in addition to) performing encryption in the drive or SSD, because the converged solution is the controller over each of a set of drives, one can generate a key per user, a key per application, or both, and still carve out suitable storage across a set of drives (e.g., six or eight drives). This can be done with hardware-assisted encryption, such as with generation of keys, as well as management of keys, being performed by hardware in the converged solution.
Such a solution offers benefits to users; for example, if data is encrypted, then users may not be obligated to report situations where third parties have obtained access to it. Similarly, parties who wish to share data with others (such as customers with service providers, or vice versa) can allow parties access into an account, because data associated with the account is encrypted, except for specific data that is shared, such as by providing a key.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having encryption with different keys applicable to data on the same disk drive.
Thus, embodiments of this disclosure including providing hardware level encryption at a level of virtualization of a data storage resource, such as an SSD.
Also, provided herein is a solution that allows encryption of data on a drive with different keys that are applicable to data on the same drive.
Also, provided herein is a solution that allows double level hardware encryption of data on a drive, including hardware-level encryption on the drive itself (SED) and hardware level encryption at a virtualization layer above the drive (such as in an FPGA-enabled converged solution as described herein).
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having storage encryption both at the level of an SSD and at the level of an Field Programmable Gate Array.
In embodiments, a strategy is provided for writing to SSDs, wherein the writes to the SSD are sequential, but with gaps between the written blocks of data.
As an alternative, as seen in
In the case of a discontinuous, sequential write strategy, when the system goes back to a block as it continues its sequential path through the drive, it knows, by virtue of keeping a map 6712, what blocks were invalidated (even if the blocks were not erased). The system knows, for example, that some other block was over-written by new data that superseded a block that was written at this location. By knowing what blocks were invalidated, one can write directly into free pages in the invalid regions. The system operates not strictly as a log, but like an elevator. The system continues to find invalid blocks and keeps writing serially, keeping track of which blocks are valid or invalid. In such embodiments, garbage collection occurs somewhat differently from a conventional process. In the first write cycle, the SSD will see that all writes are valid. In the second round it will see some invalid blocks, and the writes will go somewhere else in the gaps to perform writes. At any given time, the SSD doesn't have very many blocks to copy and re-write elsewhere. The result is that this approach makes the SSD's garbage collection process more efficient. The system can tell the SSD what the next blocks are that the system will be using, so that garbage collection can focus on cleaning up blocks that are not going to be in use in the next cycle. In embodiments, a converged solution can signal a write strategy to an SSD, so that an SSD provider can choose an order that allows the write strategy to work in sync with the garbage collection of the SSD.
The elevator write algorithm approach is different from a log-write implementation in which the drive is written like a sequential log. In the case of log-write implementations, valid pages from a block are moved to a different block in order to create an empty block for new writes. However, the elevator write algorithm picks up only invalid pages for new writes continuing in the same direction (as it passes from the start to end of the drive) like an elevator that stops only at certain floors. For first round of writes, when entire drive is free, all the pages are picked up for writes in order, just like in a log-write implementation. The elevator write algorithm keeps track of all invalidated pages, but it may refrain from using the invalidated page information until the whole drive is written once. Thereafter, for the next round of writes, the elevator write algorithm starts picking invalid pages for new writes. This algorithm can leave the garbage collection mechanism to be performed by the drive, but the sequential write pattern (with holes) as described in this disclosure enhances the drive's garbage collection efficiency, as seen in experimental results. Among other things, the elevator write algorithm avoids the overhead of reading/writing pages to create large contiguous free space, as is required for log-write implementations.
In embodiments, if higher-level software keeps track of where the last writes were done, then write strategies can be optimized, including based on the performance characteristics, garbage collection approaches, or other capabilities of particular SSDs.
Thus, provided herein are methods and systems for coordinating and/or synchronizing garbage collection in an SSD with a sequential, discontinuous write strategy for an SSD.
In embodiments, a networking and storage system is provided having a capability for handling a collection of physically attached or network-distributed storage resources as a virtualized cluster of storage resources and having a sequential, gapped write strategy to an SSD.
Thus, an embodiment of a write algorithm with coordinated garbage collection is provided. Among other objectives, the write algorithm provides a consistent number of IOPs and consistent level latency of IO operations, resulting in a predictable overall system performance. Also, an objective is to avoid adverse effects of garbage collection by keeping track of garbage collection statistics and making appropriate adjustments based on them. The write algorithm may work in coordination with garbage collection on drives, such as SSDs, such as to help the garbage collection process on the drives. Another objective is to avoid the need for data movement to create contiguous space on drives, since data movement is expensive and can be avoided when the system helps the garbage collection process on the drives.
In this example, a segment process may be undertaken, including creating a list of free drive regions, which may or may not be physically contiguous. One may create a segment of required size by selecting free regions from a specific offset in circular fashion, such as tracked in a table, referred to as the FreeRegionTable. The FreeRegionTable may be initialized with all free regions of a drive, when it is formatted. In embodiments, regions may be given to segments in a circular fashion from the FreeRegionTable. Region entries may be inserted into this table whenever there is an over-write to an existing region. The FreeRegionTable entries are ordered based on offset of regions. When a delete of a region of happens, the deleted region will be inserted to the FreeRegionTable.
Basically this gives the garbage collection process on the drives enough time to clean up or run the garbage collection process on regions in the FreeRegionTable by the time those regions are allocated to segments.
The exemplary write algorithm may treat an SSD as a circular log. The system may have a number of SSDs, and each SSD may have its own circular logging file system. This logging file system may use the characteristics of the SSD to improve the overall performance of file system. The logging file system differs from a conventional log-structured logging file system (referred to here as LFS), as the present system does not require any data movement, as required by LFS, for segment cleaning in order to create contiguous free space. The segment regions of the present approach can be scattered instead of being contiguous. The segments may be of various types, such as MetaData Segments, Data Segments, FreeSegments, and the like. In embodiments, a segment can consist of dis-contiguous blocks or dis-contiguous pages of the SSD.
The actual garbage collection process may be the conventional process used by a given SSD, but the system provided herein may help the SSD execute a better garbage collection process. For the sake of speed and other benefits, the system may, for example, keep certain types of segments, such as a MetaData Segment, in non-volatile memory (NVM) or in RAM, such as battery-backed RAM. However, if NVM or battery-backed RAM is not available, the system may host Metadata Segments on another fast medium, or on same SSD itself.
In the algorithm, SSD writes may all go to new locations on the drive, and corresponding old locations are maintained explicitly by the FreeRegionTable. The data on old pages is invalid. During the execution of the write path, the system issues writes into a Data Segment after writing a Metadata entry in a Metadata Segment. Once the write is completed, the system moves old blocks corresponding to the write into a FreeSegment.
Performance improvements occur for a number of reasons. First, whenever a region is overwritten, the flash translation layer (FTL) of the drive may write that data into a new location, and the old entry will be marked for garbage collection. The old entries are maintained in the FreeRegionTable. The blocks should have completed garbage collection by the time they are allocated to a new Segment. The system may see some improvement by explicitly maintaining these old regions in a temporary Table and moving them to FreeRegionTable after they are done, such as with handling by erase commands, such as TRIM or UNMAP commands. The system may issue garbage collection on old, invalid regions by asking the drive to garbage collection these regions before moving them to FreeSegmentTable. Issuing and waiting for this cleanup takes time, so in embodiments the system may maintain these regions in an InvalidRegionSegmentTable. An asynchronous thread may keep on issuing TRIM commands on these regions, and once these regions are done (completing the “TRIMming”), the system may move them to FreeRegionTable.
In embodiments, the DataSegment is allocated from Free Segment Table.
The volume layout for the system may include volumes that contain one or more plexes. In embodiments, more than one plex may be used to maintain data redundancy. Each plex can consist of one or more subvolumes. Each subvolume may be spread across multiple logical disks (LDs). A subvolume may completely reside on a host; however, different subvolumes of a host can come from different hosts to facilitate growth of a volume across hosts or for dynamic data distribution.
In embodiments, each LD may be from a single drive. Each file may form an LD for the subvolume. In embodiments, the system may use a facility like compressed B+-Tree/leveldb to maintain an Address Translation Table for logical-to-physical mappings for the files. The Address Translation Table can be noted in NVM or MD Segments or can be stored in DataSegments. The file system header may include a pointer to various file inode blocks, which point to these translation tables.
When volumes are created, various metadata of the volume or metadata related to write or other operations, and the like, may be maintained in one or more Metadata Segments. The metadata may be useful during write operations on volumes, during recovery from crash, for handling mirrored volumes, and for compression, snapshot, de-duplication and various other storage features.
An example of a write algorithm may include the following steps:
//Incoming write operation
int write(ld, offset, length, buf)
{
-> Get the existing entry from LD for offset, length
-> Make a MD entry to indicate write operation
(The entry contains existing entry info and new)
-> Issue the write to next-available entry in
DataSegment
->When the write completes, insert the existing entry
to the FreeRegionTable and update the existing entry with the new entry in
the file address translation table (which may be optimized further by
maintaining these regions in an InvalidRegionTable before moving them to
the FreeRegionTable)
-> Ack the Write
* The MD entry is used a journal to help in recovering
incase of crash at
any of the steps above
* Metadata entries goto NVM
* Data write happens to Data Segment
}
An example of data segment allocation algorithm may include the following:
Segment* AllocateSegment(Size)
{
allocatedSize := 0
Segment := { }
while (allocateSize < Size) {
region := FreeRegionTable[freeIndex]
Segment += region
allocateSize = allocateSize + region.Size
freeIndex = freeIndex + 1
}
return Segment
}
While only a few embodiments of the present disclosure have been shown and described, it will be obvious to those skilled in the art that many changes and modifications may be made thereunto without departing from the spirit and scope of the present disclosure as described in the following claims. All patent applications and patents, both foreign and domestic, and all other publications referenced herein are incorporated herein in their entireties to the full extent permitted by law.
The methods and systems described herein may be deployed in part or in whole through a machine that executes computer software, program codes, and/or instructions on a processor. The present disclosure may be implemented as a method on the machine, as a system or apparatus as part of or in relation to the machine, or as a computer program product embodied in a computer readable medium executing on one or more of the machines. In embodiments, the processor may be part of a server, cloud server, client, network infrastructure, mobile computing platform, stationary computing platform, or other computing platform. A processor may be any kind of computational or processing device capable of executing program instructions, codes, binary instructions and the like. The processor may be or may include a signal processor, digital processor, embedded processor, microprocessor or any variant such as a co-processor (math co-processor, graphic co-processor, communication co-processor and the like) and the like that may directly or indirectly facilitate execution of program code or program instructions stored thereon. In addition, the processor may enable execution of multiple programs, threads, and codes. The threads may be executed simultaneously to enhance the performance of the processor and to facilitate simultaneous operations of the application. By way of implementation, methods, program codes, program instructions and the like described herein may be implemented in one or more thread. The thread may spawn other threads that may have assigned priorities associated with them; the processor may execute these threads based on priority or any other order based on instructions provided in the program code. The processor, or any machine utilizing one, may include non-transitory memory that stores methods, codes, instructions and programs as described herein and elsewhere. The processor may access a non-transitory storage medium through an interface that may store methods, codes, and instructions as described herein and elsewhere. The storage medium associated with the processor for storing methods, programs, codes, program instructions or other type of instructions capable of being executed by the computing or processing device may include but may not be limited to one or more of a CD-ROM, DVD, memory, hard disk, flash drive, RAM, ROM, cache and the like.
A processor may include one or more cores that may enhance speed and performance of a multiprocessor. In embodiments, the process may be a dual core processor, quad core processors, other chip-level multiprocessor and the like that combine two or more independent cores (called a die).
The methods and systems described herein may be deployed in part or in whole through a machine that executes computer software on a server, client, firewall, gateway, hub, router, or other such computer and/or networking hardware. The software program may be associated with a server that may include a file server, print server, domain server, internet server, intranet server, cloud server, and other variants such as secondary server, host server, distributed server and the like. The server may include one or more of memories, processors, computer readable media, storage media, ports (physical and virtual), communication devices, and interfaces capable of accessing other servers, clients, machines, and devices through a wired or a wireless medium, and the like. The methods, programs, or codes as described herein and elsewhere may be executed by the server. In addition, other devices required for execution of methods as described in this application may be considered as a part of the infrastructure associated with the server.
The server may provide an interface to other devices including, without limitation, clients, other servers, printers, database servers, print servers, file servers, communication servers, distributed servers, social networks, and the like. Additionally, this coupling and/or connection may facilitate remote execution of program across the network. The networking of some or all of these devices may facilitate parallel processing of a program or method at one or more location without deviating from the scope of the disclosure. In addition, any of the devices attached to the server through an interface may include at least one storage medium capable of storing methods, programs, code and/or instructions. A central repository may provide program instructions to be executed on different devices. In this implementation, the remote repository may act as a storage medium for program code, instructions, and programs.
The software program may be associated with a client that may include a file client, print client, domain client, internet client, intranet client and other variants such as secondary client, host client, distributed client and the like. The client may include one or more of memories, processors, computer readable media, storage media, ports (physical and virtual), communication devices, and interfaces capable of accessing other clients, servers, machines, and devices through a wired or a wireless medium, and the like. The methods, programs, or codes as described herein and elsewhere may be executed by the client. In addition, other devices required for execution of methods as described in this application may be considered as a part of the infrastructure associated with the client.
The client may provide an interface to other devices including, without limitation, servers, other clients, printers, database servers, print servers, file servers, communication servers, distributed servers and the like. Additionally, this coupling and/or connection may facilitate remote execution of program across the network. The networking of some or all of these devices may facilitate parallel processing of a program or method at one or more location without deviating from the scope of the disclosure. In addition, any of the devices attached to the client through an interface may include at least one storage medium capable of storing methods, programs, applications, code and/or instructions. A central repository may provide program instructions to be executed on different devices. In this implementation, the remote repository may act as a storage medium for program code, instructions, and programs.
The methods and systems described herein may be deployed in part or in whole through network infrastructures. The network infrastructure may include elements such as computing devices, servers, routers, hubs, firewalls, clients, personal computers, communication devices, routing devices and other active and passive devices, modules and/or components as known in the art. The computing and/or non-computing device(s) associated with the network infrastructure may include, apart from other components, a storage medium such as flash memory, buffer, stack, RAM, ROM and the like. The processes, methods, program codes, instructions described herein and elsewhere may be executed by one or more of the network infrastructural elements. The methods and systems described herein may be adapted for use with any kind of private, community, or hybrid cloud computing network or cloud computing environment, including those which involve features of software as a service (SaaS), platform as a service (PaaS), and/or infrastructure as a service (IaaS).
The methods, program codes, and instructions described herein and elsewhere may be implemented on a cellular network has sender-controlled contact media content item multiple cells. The cellular network may either be frequency division multiple access (FDMA) network or code division multiple access (CDMA) network. The cellular network may include mobile devices, cell sites, base stations, repeaters, antennas, towers, and the like. The cell network may be a GSM, GPRS, 3G, EVDO, mesh, or other networks types.
The methods, program codes, and instructions described herein and elsewhere may be implemented on or through mobile devices. The mobile devices may include navigation devices, cell phones, mobile phones, mobile personal digital assistants, laptops, palmtops, netbooks, pagers, electronic books readers, music players and the like. These devices may include, apart from other components, a storage medium such as a flash memory, buffer, RAM, ROM and one or more computing devices. The computing devices associated with mobile devices may be enabled to execute program codes, methods, and instructions stored thereon. Alternatively, the mobile devices may be configured to execute instructions in collaboration with other devices. The mobile devices may communicate with base stations interfaced with servers and configured to execute program codes. The mobile devices may communicate on a peer-to-peer network, mesh network, or other communications network. The program code may be stored on the storage medium associated with the server and executed by a computing device embedded within the server. The base station may include a computing device and a storage medium. The storage device may store program codes and instructions executed by the computing devices associated with the base station.
The computer software, program codes, and/or instructions may be stored and/or accessed on machine readable media that may include: computer components, devices, and recording media that retain digital data used for computing for some interval of time; semiconductor storage known as random access memory (RAM); mass storage typically for more permanent storage, such as optical discs, forms of magnetic storage like hard disks, tapes, drums, cards and other types; processor registers, cache memory, volatile memory, non-volatile memory; optical storage such as CD, DVD; removable media such as flash memory (e.g. USB sticks or keys), floppy disks, magnetic tape, paper tape, punch cards, standalone RAM disks, Zip drives, removable mass storage, off-line, and the like; other computer memory such as dynamic memory, static memory, read/write storage, mutable storage, read only, random access, sequential access, location addressable, file addressable, content addressable, network attached storage, storage area network, bar codes, magnetic ink, and the like.
The methods and systems described herein may transform physical and/or or intangible items from one state to another. The methods and systems described herein may also transform data representing physical and/or intangible items from one state to another.
The elements described and depicted herein, including in flow charts and block diagrams throughout the figures, imply logical boundaries between the elements. However, according to software or hardware engineering practices, the depicted elements and the functions thereof may be implemented on machines through computer executable media has sender-controlled contact media content item a processor capable of executing program instructions stored thereon as a monolithic software structure, as standalone software modules, or as modules that employ external routines, code, services, and so forth, or any combination of these, and all such implementations may be within the scope of the present disclosure. Examples of such machines may include, but may not be limited to, personal digital assistants, laptops, personal computers, mobile phones, other handheld computing devices, medical equipment, wired or wireless communication devices, transducers, chips, calculators, satellites, tablet PCs, electronic books, gadgets, electronic devices, devices has sender-controlled contact media content item artificial intelligence, computing devices, networking equipment, servers, routers and the like. Furthermore, the elements depicted in the flow chart and block diagrams or any other logical component may be implemented on a machine capable of executing program instructions. Thus, while the foregoing drawings and descriptions set forth functional aspects of the disclosed systems, no particular arrangement of software for implementing these functional aspects should be inferred from these descriptions unless explicitly stated or otherwise clear from the context. Similarly, it will be appreciated that the various steps identified and described above may be varied, and that the order of steps may be adapted to particular applications of the techniques disclosed herein. All such variations and modifications are intended to fall within the scope of this disclosure. As such, the depiction and/or description of an order for various steps should not be understood to require a particular order of execution for those steps, unless required by a particular application, or explicitly stated or otherwise clear from the context.
The methods and/or processes described above, and steps associated therewith, may be realized in hardware, software or any combination of hardware and software suitable for a particular application. The hardware may include a general-purpose computer and/or dedicated computing device or specific computing device or particular aspect or component of a specific computing device. The processes may be realized in one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors or other programmable device, along with internal and/or external memory. The processes may also, or instead, be embodied in an application specific integrated circuit, a programmable gate array, programmable array logic, or any other device or combination of devices that may be configured to process electronic signals. It will further be appreciated that one or more of the processes may be realized as a computer executable code capable of being executed on a machine-readable medium.
The computer executable code may be created using a structured programming language such as C, an object oriented programming language such as C++, or any other high-level or low-level programming language (including assembly languages, hardware description languages, and database programming languages and technologies) that may be stored, compiled or interpreted to run on one of the above devices, as well as heterogeneous combinations of processors, processor architectures, or combinations of different hardware and software, or any other machine capable of executing program instructions.
Thus, in one aspect, methods described above and combinations thereof may be embodied in computer executable code that, when executing on one or more computing devices, performs the steps thereof. In another aspect, the methods may be embodied in systems that perform the steps thereof, and may be distributed across devices in a number of ways, or all of the functionality may be integrated into a dedicated, standalone device or other hardware. In another aspect, the means for performing the steps associated with the processes described above may include any of the hardware and/or software described above. All such permutations and combinations are intended to fall within the scope of the present disclosure.
While the disclosure has been disclosed in connection with the preferred embodiments shown and described in detail, various modifications and improvements thereon will become readily apparent to those skilled in the art. Accordingly, the spirit and scope of the present disclosure is not to be limited by the foregoing examples, but is to be understood in the broadest sense allowable by law.
The use of the terms “a” and “an” and “the” and similar referents in the context of describing the disclosure (especially in the context of the following claims) is to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “haa sender-controlled contact media content item,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the disclosure and does not pose a limitation on the scope of the disclosure unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the disclosure.
While the foregoing written description enables one of ordinary skill to make and use what is considered presently to be the best mode thereof, those of ordinary skill will understand and appreciate the existence of variations, combinations, and equivalents of the specific embodiment, method, and examples herein. The disclosure should therefore not be limited by the above described embodiment, method, and examples, but by all embodiments and methods within the scope and spirit of the disclosure.
All documents referenced herein are hereby incorporated by reference.
Sharma, Gopal, Singh, Abhay Kumar, Bandarupalli, Sambasiva Rao, Chou, Jeffrey
Patent | Priority | Assignee | Title |
11360899, | May 03 2019 | Western Digital Technologies, Inc.; Western Digital Technologies, INC | Fault tolerant data coherence in large-scale distributed cache systems |
11656992, | May 03 2019 | Western Digital Technologies, Inc.; Western Digital Technologies, INC | Distributed cache with in-network prefetch |
11675706, | Jun 30 2020 | Western Digital Technologies, Inc. | Devices and methods for failure detection and recovery for a distributed cache |
11720413, | Jun 08 2020 | Samsung Electronics Co., Ltd.; SAMSUNG ELECTRONICS CO , LTD | Systems and methods for virtualizing fabric-attached storage devices |
11736417, | Aug 17 2020 | Western Digital Technologies, Inc. | Devices and methods for network message sequencing |
11765250, | Jun 26 2020 | Western Digital Technologies, Inc. | Devices and methods for managing network traffic for a distributed cache |
11816503, | Mar 15 2019 | TOSHIBA MEMORY AMERICA, INC | Data storage resource management |
11941261, | Mar 18 2022 | Kioxia Corporation | Storage device |
Patent | Priority | Assignee | Title |
6347087, | Oct 05 1998 | Alcatel-Lucent USA Inc | Content-based forwarding/filtering in a network switching device |
6553000, | Jan 27 1998 | WSOU Investments, LLC | Method and apparatus for forwarding network traffic |
6678269, | Oct 05 1998 | Alcatel Lucent | Network switching device with disparate database formats |
6956854, | Oct 05 1998 | WSOU Investments, LLC | Network switching device with forwarding database tables populated based on use |
7065082, | Oct 05 1998 | Alcatel-Lucent USA Inc | Content-based forwarding/filtering in a network switching device |
7386546, | Jul 09 2002 | NetApp, Inc | Metadirectory namespace and method for use of the same |
7711789, | Dec 07 2007 | Intellectual Ventures II LLC | Quality of service in virtual computing environments |
8340005, | Feb 08 2005 | Cortina Systems, Inc.; Cisco Technology, Inc. | High speed packet interface and method |
8850130, | Aug 10 2011 | Nutanix, Inc | Metadata for managing I/O and storage for a virtualization |
8996644, | Dec 09 2010 | Xilinx, Inc | Encapsulated accelerator |
9137165, | Jun 17 2013 | Telefonaktiebolaget L M Ericsson (publ) | Methods of load balancing using primary and stand-by addresses and related load balancers and servers |
9621642, | Jun 17 2013 | Telefonaktiebolaget LM Ericsson (publ) | Methods of forwarding data packets using transient tables and related load balancers |
20010042074, | |||
20010053150, | |||
20030004975, | |||
20030110300, | |||
20040210584, | |||
20040233910, | |||
20070028138, | |||
20070088904, | |||
20080043732, | |||
20080123638, | |||
20090003361, | |||
20090161684, | |||
20090185551, | |||
20090248994, | |||
20090307292, | |||
20100005234, | |||
20100131881, | |||
20110191522, | |||
20120066430, | |||
20120072716, | |||
20120079096, | |||
20120284587, | |||
20120317393, | |||
20130019057, | |||
20130094356, | |||
20130131869, | |||
20130138912, | |||
20130198312, | |||
20130204849, | |||
20130232267, | |||
20130254829, | |||
20130268496, | |||
20130290601, | |||
20130340088, | |||
20140052706, | |||
20140095826, | |||
20140201541, | |||
20140301395, | |||
20140372616, | |||
20150006663, | |||
20150067086, | |||
20150160962, | |||
20150199151, | |||
20150254088, | |||
20180039412, | |||
20180095915, | |||
EP2372521, | |||
EP993156, | |||
EP993162, | |||
JP2012212192, | |||
KR1020080052846, | |||
WO2010048238, | |||
WO2015138245, | |||
WO2016196766, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Mar 01 2017 | Diamanti, Inc. | (assignment on the face of the patent) | / | |||
Mar 01 2017 | SINGH, ABHAY KUMAR | DIAMANTI, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 041447 | /0135 | |
Mar 01 2017 | BANDARUPALLI, SAMBASIVA RAO | DIAMANTI, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 041447 | /0135 | |
Mar 01 2017 | SHARMA, GOPAL | DIAMANTI, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 041447 | /0135 | |
Mar 01 2017 | CHOU, JEFFREY | DIAMANTI, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 041447 | /0135 | |
Aug 12 2020 | DIAMANTI, INC | VENTURE LENDING & LEASING IX, INC | SECURITY INTEREST SEE DOCUMENT FOR DETAILS | 053498 | /0832 |
Date | Maintenance Fee Events |
Dec 18 2023 | REM: Maintenance Fee Reminder Mailed. |
Apr 25 2024 | M2551: Payment of Maintenance Fee, 4th Yr, Small Entity. |
Apr 25 2024 | M2554: Surcharge for late Payment, Small Entity. |
Date | Maintenance Schedule |
Apr 28 2023 | 4 years fee payment window open |
Oct 28 2023 | 6 months grace period start (w surcharge) |
Apr 28 2024 | patent expiry (for year 4) |
Apr 28 2026 | 2 years to revive unintentionally abandoned end. (for year 4) |
Apr 28 2027 | 8 years fee payment window open |
Oct 28 2027 | 6 months grace period start (w surcharge) |
Apr 28 2028 | patent expiry (for year 8) |
Apr 28 2030 | 2 years to revive unintentionally abandoned end. (for year 8) |
Apr 28 2031 | 12 years fee payment window open |
Oct 28 2031 | 6 months grace period start (w surcharge) |
Apr 28 2032 | patent expiry (for year 12) |
Apr 28 2034 | 2 years to revive unintentionally abandoned end. (for year 12) |