A computer program product, system, and method for visiting each node of a snapshot tree within a content-based storage system having a plurality of volumes and/or snapshots; for each node, scanning an address-to-hash (A2H) table to calculate one or more resource usage metrics, wherein the A2H tables map logical I/O addresses to chunk hashes; and determining, based on the resource usage metrics, an amount of memory and/or disk capacity that would be freed by deleting one or more of the volumes and/or snapshots.
|
1. A method comprising:
visiting each node of a snapshot tree within a content-based storage system having a plurality of volumes and snapshots;
for each node, scanning an address-to-hash (A2H) table to calculate one or more resource usage metrics, wherein the A2H tables map logical I/O addresses to chunk hashes;
determining, based on the resource usage metrics, an amount of one or more of memory and disk capacity that would be freed by deleting one or more of the volumes and snapshots, the determining performed on a per-node basis and including determining the amount of memory for each ancestor node of the node that is not also an ancestor node of another node, determining the amount of memory for each leaf node of the node, and summing the resource usage metrics for each of the ancestor nodes of the node and for each of the leaf nodes of the node; and
deleting the one or more of the volumes and snapshots from the content-based storage system determined based on the per-node resource usage metrics.
9. A computer program product tangibly embodied in a non-transitory computer-readable medium, the computer-readable medium storing program instructions that are executable to:
visit each node of a snapshot tree within a content-based storage system having a plurality of volumes and snapshots;
for each node, scan an address-to-hash (A2H) table to calculate one or more resource usage metrics, wherein the A2H tables map logical I/O addresses to chunk hashes;
determine, based on the resource usage metrics, an amount of one or more of memory and disk capacity that would be freed by deleting one or more of the volumes and snapshots, the determining performed on a per-node basis and including determining the amount of memory for each ancestor node of the node that is not also an ancestor node of another node, determining the amount of memory for each leaf node of the node, and summing the resource usage metrics for each of the ancestor nodes of the node and for each of the leaf nodes of the node; and
delete the one or more of the volumes and snapshots from the content-based storage system determined based on the per-node resource usage metrics.
13. A system comprising:
a processor;
a volatile memory; and
a non-volatile memory storing computer program code that when executed on the processor causes the processor to execute a process operable to:
visit each node of a snapshot tree within a content-based storage system having a plurality of volumes and snapshots;
for each node, scan an address-to-hash (A2H) table to calculate one or more resource usage metrics, wherein the A2H tables map logical I/O addresses to chunk hashes;
determine, based on the resource usage metrics, an amount of one or more of memory and disk capacity that would be freed by deleting one or more of the volumes and snapshots, the determining performed on a per-node basis and including determining the amount of memory for each ancestor node of the node that is not also an ancestor node of another node, determining the amount of memory for each leaf node of the node, and summing the resource usage metrics for each of the ancestor nodes of the node and for each of the leaf nodes of the node; and
delete the one or more of the volumes and snapshots from the content-based storage system determined based on the per-node resource usage metrics.
2. The method of
3. The method of
4. The method of
5. The method of
finding one or more unique chunks associated with the node; and
determining a compression ratio for each of the unique chunks associated with the node.
6. The method of
7. The method of
determining a count of chunks associated with the node;
determining a compression ratio for each of the chunks associated with the node; and
determining a reference count for each of the chunks associated with the node.
8. The method of
10. The computer product of
11. The computer product of
12. The computer product of
determining accessible space provided by one or more of the volumes and snapshots.
14. The system of
15. The system of
16. The system of
17. The system of
finding one or more unique chunks associated with the node; and
determining a compression ratio for each of the unique chunks associated with the node.
18. The system of
19. The system of
determining a count of chunks associated with the node;
determining a compression ratio for each of the chunks associated with the node; and
determining a reference count for each of the chunks associated with the node.
20. The system of
|
Content-based storage (sometimes referred to as content-addressable storage or CAS) stores data based on its content, providing inherent data deduplication and facilitating in-line data compress, among other benefits. Existing content-based storage systems may utilize an array of storage device such as solid-state drives (SSDs, also known as solid-state disks) to provide high performance scale-out storage.
Within a content-based storage system, data may be organized into one or more volumes identified by respective logical unit numbers (LUNs). User applications can read/write data to/from a volume by specifying a LUN and an address (or “offset”) relative to the LUN. Some content-based storage systems allow for volumes to be cloned and for the creation of volume snapshots. To reduce system resource usage, internal data structures may be shared across different volumes and/or snapshots.
It is appreciated herein that it can be challenging to determine system resources (e.g., memory and/or disk capacity) used by individual volumes/snapshots within a content-based storage system. There is a need for new methods of determining volume/snapshot resource usage taking into account deduplication and compression, as well as the internal data structures used to maintain volumes/snapshots. Such information can be presented to a user (e.g., a storage administrator) to allow the user to make decisions about, for example, which volumes/snapshots to delete.
According to one aspect of the disclosure, a method comprises: visiting each node of a snapshot tree within a content-based storage system having a plurality of volumes and/or snapshots; for each node, scanning an address-to-hash (A2H) table to calculate one or more resource usage metrics, wherein the A2H tables map logical I/O addresses to chunk hashes; and determining, based on the resource usage metrics, an amount of memory and/or disk capacity that would be freed by deleting one or more of the volumes and/or snapshots.
In some embodiments, for each node, scanning the A2H table to calculate one or more resource usage metrics includes determining a count of entries in the A2H table. In certain embodiments, determining an amount of memory and/or disk capacity that would be freed by deleting one or more of the volumes and/or snapshots includes determining, for each node, an amount of memory based on the count of entries in the A2H table. In particular embodiments, the method further comprises determining accessible space provided by one or more of the volumes and/or snapshots.
In some embodiments, for each node, scanning the A2H table to calculate one or more resource usage metrics includes: finding one or more unique chunks associated with the node; and determining a compression ratio for each of the unique chunks associated with the node. In certain embodiments, determining an amount of memory and/or disk capacity that would be freed by deleting one or more of the volumes and/or snapshots includes determining, for each node, a minimum disk capacity that would be freed by deleting the node using a count of unique chunks and the compression ratios. In particular embodiments, for each node, scanning the A2H table to calculate one or more resource usage metrics includes: determining a count of chunks associated with the node; determining a compression ratio for each of the chunks associated with the node; and determining a reference count for each of the chunks associated with the node. In some embodiments, determining an amount of memory and/or disk capacity that would be freed by deleting one or more of the volumes and/or snapshots includes determining, for each node, an estimated disk capacity that would be freed by deleting the node based on the count of chunks associated with the node, the compression ratios, and the reference counts.
According to another aspect of the disclosure, a system comprises one or more processors; a volatile memory; and a non-volatile memory storing computer program code that when executed on the processor causes the processor to execute a process operable to perform embodiments of the method described above.
According to yet another aspect of the disclosure, a computer program product tangibly embodied in a non-transitory computer-readable medium, the computer-readable medium storing program instructions that are executable to perform embodiments of the method described hereinabove.
The foregoing features may be more fully understood from the following description of the drawings in which:
The drawings are not necessarily to scale, or inclusive of all elements of a system, emphasis instead generally being placed upon illustrating the concepts, structures, and techniques sought to be protected herein.
Before describing embodiments of the structures and techniques sought to be protected herein, some terms are explained. As used herein, the term “storage system” may be broadly construed so as to encompass, for example, private or public cloud computing systems for storing data as well as systems for storing data comprising virtual infrastructure and those not comprising virtual infrastructure. As used herein, the terms “client” and “user” may refer to any person, system, or other entity that uses a storage system to read/write data.
As used herein, the terms “disk” and “storage device” may refer to any non-volatile memory (NVM) device, including hard disk drives (HDDs), flash devices (e.g., NAND flash devices), and next generation NVM devices, any of which can be accessed locally and/or remotely (e.g., via a storage attached network (SAN)). The term “storage array” may be used herein to refer to any collection of storage devices. As used herein, the term “memory” may refer to volatile memory used by the storage system, such as dynamic random access memory (DRAM).
As used herein, the terms “I/O read request” and “I/O read” refer to a request to read data. The terms “I/O write request” and “I/O write” refer to a request to write data. The terms “I/O request” and “I/O” refer to a request that may be either an I/O read request or an I/O write request. As used herein the term “logical I/O address” and “I/O address” refers to a logical address used by users/clients to read/write data from/to a storage system.
While vendor-specific terminology may be used herein to facilitate understanding, it is understood that the concepts, techniques, and structures sought to be protected herein are not limited to use with any specific commercial products.
In the embodiment shown, the subsystems 102 include a routing subsystem 102a, a control subsystem 102b, a data subsystem 102c, and a system resource subsystem 102d. In one embodiment, the subsystems 102 may be provided as software modules, i.e., computer program code that, when executed on a processor, may cause a computer to perform functionality described herein. In a certain embodiment, the storage system 100 includes an operating system (OS) and one or more of the subsystems 102 may be provided as user space processes executable by the OS. In other embodiments, a subsystem 102 may be provided, at least in part, as hardware such as digital signal processor (DSP) or an application specific integrated circuit (ASIC) configured to perform functionality described herein.
The routing subsystem 102a may be configured to receive I/O requests from clients 116 and to translate client requests into internal commands. Each I/O request may be associated with a particular volume and may include one or more I/O addresses (i.e., logical addresses within that volume). The storage system 100 stores data in fixed-size chunks, for example 4 KB chunks, where each chunk is uniquely identified within the system using a “hash” value that is derived from the data/content stored within the chunk. The routing subsystem 102a may be configured to convert an I/O request for an arbitrary amount of data into one or more internal I/O requests each for a chunk-sized amount of data. The internal I/O requests may be sent to one or more available control subsystems 102b for processing. In some embodiments, the routing subsystem 102a is configured to receive Small Computer System Interface (SCSI) commands from clients. In certain embodiments, I/O requests may include one or more logical block addresses (LBAs).
For example, if a client 116 sends a request to write 8 KB of data starting at logical address zero (0), the routing subsystem 102a may split the data into two 4 KB chunks, generate a first internal I/O request to write 4 KB of data to logical address zero (0), and generate a second internal I/O request to write 4 KB of data to logical address one (1). The routing subsystem 102a may calculate hash values for each chunk of data to be written, and send the hashes to the control subsystem(s) 102b. In one embodiment, chunk hashes are calculated using a Secure Hash Algorithm 1 (SHA-1).
As another example, if a client 116 sends a request to read 8 KB of data starting at logical address one (1), the routing subsystem 102a may generate a first internal I/O request to read 4 KB of data from address zero (0) and a second internal I/O request to read 4 KB of data to address one (1).
The control subsystem 102b may also be configured to clone storage volumes and to generate snapshots of storage volumes using techniques known in the art. For each volume/snapshot, the control subsystem 102b may maintain a so-called “address-to-hash” (A2H) tables 112 that maps I/O addresses to hash values of the data stored at those logical addresses.
The data subsystem 102c may be configured to maintain one or more so-called “hash-to-physical address” (H2P) tables 114 that map chunk hash values to physical storage addresses (e.g., storage locations within the storage array 106 and/or within individual disks 108). Using the H2P tables 114, the data subsystem 102c handles reading/writing chunk data from/to the storage array 106. The H2P table may also include per-chunk metadata such as a compression ratio and a reference count. A chunk compression ratio indicates the size of the compressed chunk stored on disk compared to the uncompressed chunk size. For example, a compression ratio of 0.25 may indicate that the compressed chunk on disk is 25% smaller compared to its original size. A chunk reference count may indicate the number of times that the chunk's hash appears within A2H tables. For example, if the same chunk data is stored at two different logical addresses with the same volume/snapshots (or within two different volumes/snapshots), the H2P table may indicate that the chunk has a reference count of two (2).
It will be appreciated that combinations of the A2H 112 and H2P 114 tables can provide multiple levels of indirection between the logical (or “I/O”) address a client 116 uses to access data and the physical address where that data is stored. Among other advantages, this may give the storage system 100 freedom to move data within the storage array 106 without affecting a client's 116 access to that data (e.g., if a disk 108 fails). In some embodiments, an A2H 112 table and/or an H2P 114 table may be stored in memory.
The system resource subsystem 102d may be configured to determine system resource usage associated with individual volumes/snapshots. In particular embodiments, the system resource subsystem 102d may be configured to perform at least some of the processing described below in conjunction with
In some embodiments, storage system 100 corresponds to a node within a distributed storage system having a plurality of nodes, each of which may include one or more of the subsystems 102a-102d.
In some embodiments, the system 100 includes features used in EMC® XTREMIO®.
Referring to
An A2H table may be associated with a volume number and/or snapshot identifier managed. In the example of
Referring to
As illustrated in
To reduce memory usage, a technique similar to “copy-on-write” (COW) may be used when generating new A2H tables as part of a volume clone/snapshot. In particular, the new A2H tables 202a, 202b may be generated as empty tables that are linked to the existing table 200. In this arrangement, table 200 may be referred to as a “parent table,” and tables 202a, 202b may be referred to as “child tables” (generally denoted 202). If an I/O read is received for a volume/snapshot associated with a child table 202, the control subsystem first checks if the child table 202 includes an entry for the I/O address: if so, the control subsystem uses the hash value from the child table 202; otherwise, the control subsystem uses the hash value (if any) from the parent table 200. If an I/O write is received for a volume/snapshot associated with a child table 202, the control subsystem adds or updates an entry in the child table 202, but does not modify the parent table 200. Referring to the example of
A logical address that exists in a parent table and not in either of its child tables is referred to herein as a “shared address.” For example, in
As discussed above in conjunction with
Referring to
The volumes/snapshots associated with child nodes 304, 306 can likewise be cloned/snapped, resulting in additional nodes being added to the snapshot tree. For example, as shown by tree 300″ in
Within a snapshot tree, each leaf node represents either a volume or a snapshot. In addition to having its own A2H table, each volume/snapshot leaf node inherits the A2H tables of its ancestors, recursively up to the root node. When processing an I/O read for a given volume/snapshot, the control subsystem searches for the first A2H table containing the I/O address, starting from volume/snapshot leaf node and terminating at the root node. Thus, the copy-on-write semantics described above may be extended to an arbitrary number of clones/snapshots.
It is appreciated herein that determining the actual memory and/or disk storage capacity used by an individual volume/snapshot in a content-based storage system may be challenging due to aforementioned copy-on-write table semantics, along with data de-duplication and in-line compression. In particular, there is a need for techniques to determine (or estimate) the amount of memory/disk capacity that would be freed by deleting a particular volume/snapshot from the content-based storage system, taking into account that some (or all) of the chunk content associated with the volume/snapshot may be referenced by other volumes/snapshots. Various techniques for determining/estimating how much memory/disk capacity would be freed by deleting a given volume/snapshot are described below in conjunction with
Alternatively, the processing and decision blocks may represent steps performed by functionally equivalent circuits such as a digital signal processor (DSP) circuit or an application specific integrated circuit (ASIC). The flow diagrams do not depict the syntax of any particular programming language but rather illustrate the functional information one of ordinary skill in the art requires to fabricate circuits or to generate computer software to perform the processing required of the particular apparatus. It should be noted that many routine program elements, such as initialization of loops and variables and the use of temporary variables may be omitted for clarity. The particular sequence of blocks described is illustrative only and can be varied without departing from the spirit of the concepts, structures, and techniques sought to be protected herein. Thus, unless otherwise stated, the blocks described below are unordered meaning that, when possible, the functions represented by the blocks can be performed in any convenient or desirable order.
Referring to
At block 404, for each node of the snapshot tree, an associated A2H table is scanned to calculate one or more resource usage metrics for the node. An illustrative technique for scanning an A2H table is shown in
At block 406, an amount of memory and/or disk capacity that would be freed (or “released”) by deleting one or more of the volumes/snapshots is determined based on the per-node resource usage metrics. In various embodiments, determining memory/disk capacity that would be freed by deleting volumes/snapshots includes summing the resource usage metrics for the corresponding leaf nodes, as well as the usage metrics for any ancestor nodes that are not also ancestors of other volumes/snapshots. For example, referring to
In some embodiments, the information determined at block 404 may be presented to a user (e.g., a storage administrator). In other embodiments, the information determined at block 404 may be used to automatically delete volumes/snapshots within the content-based storage system. For example, in the event that memory/disk capacity is exhausted (or nearly exhausted), the storage system may automatically find and delete one or more snapshots that would free up sufficient memory/disk capacity to allow the storage system to continue operating.
Referring to
At block 422, a pointer address is initialized (e.g., P=0). At block 424, one or more table counters are initialized. The specific table counters used depend on the resource usage metric being calculated. For example, when calculating memory usage for a node, a table counter may include a count of the number of entries in the node's A2H table. Other examples are described below.
Blocks 426-430 represent a loop that is performed over the node's logical address space, e.g., starting at zero (0) and ending at the largest possible address within the nodes A2H table. In many embodiments, all volumes/snapshots within the same snapshot tree may have the same logical volume size and, thus, the largest possible address for any node is based on the logical volume size. In some embodiments, counters for multiple tables (e.g., all A2H tables associated with a snapshot tree) may be incremented within the loop 426-430. At block 432, one or more resource usage metrics are determined based on the table counters.
In certain embodiments, incrementing table counters (block 428) may include taking into account whether an A2H table entry corresponds to a shared address and/or a shadow address. For example, when scanning an A2H table associated with a non-leaf node (e.g., node 304 in
Referring to
Referring to
At block 444, a compression ratio is determined for each of the unique chunks, for example, using metadata in the H2P table.
At block 446, a minimum disk capacity (dcmin) that would be freed by deleting a snapshot tree node may be determined based on the number of unique chunks (nuniq) and the corresponding chunk compression ratios (ci). In some embodiments, the minimum disk capacity may be calculated as follows:
where C is the fixed chunk size.
For example, if a node includes n=2 unique chunks having respective compression ratios c1=0.25 and c2=0.4, where each chunk is 8 KB, then the minimum disk capacity that would be freed by deleting the node may be determined as 10.8 KB.
Referring to
At block 462, a number of chunks associated with the node is determined. In some embodiments, this includes scanning the node's A2H table and counting the total number of entries (including entries that have duplicate hash values). At blocks 464 and 466, a compression ratio and a reference count may be determined for each of the chunks (e.g., using an H2P table).
At block 468, an estimate of disk capacity (dcest) that would be freed by deleting the node is calculated based on the total number of chunks (ntotal) associated with the node, the chunk compression factors (ci), and the chunk reference counts (ri). In some embodiments, the estimate may be calculated as follows:
where C is the fixed chunk size. In will be appreciated that, in the equation above, the outmost numerator corresponds to a compression-adjust (or “weighted”) sum of the chunks and the outermost denominator corresponds to the average number of reference counts per chunk.
Referring to
At block 482, a number of user-readable addresses associated with a volume/snapshot is determined, for example, by scanning A2H tables associated with a leaf node and its ancestor nodes. In some embodiments, a storage system includes a process to find the differences between two snapshots and this process may be used to determine the number of user-readable addressable (e.g., by comparing the volume/snapshot against an empty snapshot). At block 484, accessible space provided by the volume/snapshot is determined based on the number user-readable addresses. In some embodiments, the accessible space is calculated by multiplying the number of user-readable addresses by a fixed chunk size.
In the embodiment shown, computer instructions 512 may include routing subsystem instructions 512a that may correspond to an implementation of a routing subsystem 102a (
Processing may be implemented in hardware, software, or a combination of the two. In various embodiments, processing is provided by computer programs executing on programmable computers/machines that each includes a processor, a storage medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform processing and to generate output information.
The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate.
Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
All references cited herein are hereby incorporated herein by reference in their entirety.
Having described certain embodiments, which serve to illustrate various concepts, structures, and techniques sought to be protected herein, it will be apparent to those of ordinary skill in the art that other embodiments incorporating these concepts, structures, and techniques may be used. Elements of different embodiments described hereinabove may be combined to form other embodiments not specifically set forth above and, further, elements described in the context of a single embodiment may be provided separately or in any suitable sub-combination. Accordingly, it is submitted that the scope of protection sought herein should not be limited to the described embodiments but rather should be limited only by the spirit and scope of the following claims.
Meiri, David, Kucherov, Anton, Buchman, Ophir
Patent | Priority | Assignee | Title |
10705753, | May 04 2018 | EMC IP HOLDING COMPANY LLC | Fan-out asynchronous replication logical level caching |
10852987, | Nov 01 2018 | EMC IP HOLDING COMPANY LLC | Method to support hash based xcopy synchronous replication |
10908830, | Jan 31 2019 | EMC IP HOLDING COMPANY LLC | Extent lock resolution in active/active replication |
11281407, | Sep 23 2020 | EMC IP HOLDING COMPANY LLC | Verified write command in active-active replication |
11403233, | Oct 15 2019 | EMC IP HOLDING COMPANY LLC | Determining capacity in a global deduplication system |
11429493, | Jan 20 2020 | EMC IP HOLDING COMPANY LLC | Remote rollback of snapshots for asynchronous replication |
11481371, | Jul 27 2020 | Hewlett Packard Enterprise Development LP | Storage system capacity usage estimation |
11593396, | Sep 23 2020 | EMC IP HOLDING COMPANY LLC | Smart data offload sync replication |
11836349, | Mar 05 2018 | Pure Storage, Inc. | Determining storage capacity utilization based on deduplicated data |
11861170, | Mar 05 2018 | Pure Storage, Inc.; Pure Storage, Inc | Sizing resources for a replication target |
ER227, |
Patent | Priority | Assignee | Title |
7921086, | Dec 23 2008 | Veritas Technologies LLC | Deterministic space management in deduplication applications |
20100281081, | |||
20120144149, | |||
20140114933, | |||
20140136491, | |||
20140351214, | |||
20160253291, | |||
20170024142, | |||
20170308305, | |||
20180074745, |
Date | Maintenance Fee Events |
Nov 16 2022 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Jun 25 2022 | 4 years fee payment window open |
Dec 25 2022 | 6 months grace period start (w surcharge) |
Jun 25 2023 | patent expiry (for year 4) |
Jun 25 2025 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jun 25 2026 | 8 years fee payment window open |
Dec 25 2026 | 6 months grace period start (w surcharge) |
Jun 25 2027 | patent expiry (for year 8) |
Jun 25 2029 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jun 25 2030 | 12 years fee payment window open |
Dec 25 2030 | 6 months grace period start (w surcharge) |
Jun 25 2031 | patent expiry (for year 12) |
Jun 25 2033 | 2 years to revive unintentionally abandoned end. (for year 12) |