Distributed multi-level protection in a RAID array based storage system

Distributed multi-level protection in a RAID array based storage system
RE48448

A system and method for dynamic raid geometries. A computer system comprises client computers and data storage arrays coupled to one another via a network. A data storage array utilizes solid-state drives and Flash memory cells for data storage. A storage controller within a data storage array is configured to configure a first subset of the storage devices for use in a first raid layout, the first raid layout including a first set of redundant data. The controller further configures a second subset of the storage devices for use in a second raid layout, the second raid layout including a second set of redundant data. Additionally, when writing a stripe, the controller may select from any of the plurality of storage devices for one or more of the first raid layout, the second raid layout, and storage of redundant data by the additional logical device.

PTO Wrapper PDF
Dossier Espace Google

Patent RE48448
Priority Oct 01 2010
Filed Mar 09 2018
Issued Feb 23 2021
Expiry Oct 01 2030 TERM.DISCL.
Inventors Colgrove, …
Assg.orig Pure Stora…
Assg.curr Pure Stora…
Entity Large
Referenced by 12
References 35
Maint.: EXPIRED<2yrs

0. 1. A computer system comprising:

a data storage subsystem comprising a plurality of storage devices in a redundant array of independent disks (raid) configuration; and

a storage controller configured to:

write a first raid stripe to the plurality of storage devices including:

for each storage device of a subset of the plurality of storage devices, writing within a page of the storage device, user data and a checksum that validates the user data stored on the storage device;

writing within a page of a particular storage device of the plurality storage devices,

inter-device protection data, the inter-device protection data protecting the user data stored on each storage device of the subset of the plurality of storage devices;

intra-page protection data, the intra-page protection data protecting the inter-device protection data stored on the particular storage device; and

inter-page protection data, the inter-page protection data protecting the checksums stored on each storage device of the subset of storage devices.

0. 15. A non-transitory computer readable storage medium storing program instructions, wherein the program instructions are executable to:

write a raid stripe to a plurality of storage devices in a redundant array of independent disks (raid) configuration, wherein writing the raid stripe includes:

writing within a page of a particular storage device of the plurality storage devices,

inter-device protection data, the inter-device protection data protecting the user data stored on each storage device of the subset of the plurality of storage devices;

intra-page protection data, the intra-page protection data protecting the inter-device protection data stored on the particular storage device; and

inter-page protection data, the inter-page protection data protecting the checksums stored on each storage device of the subset of storage devices.

0. 19. A computer system comprising:

a data storage subsystem comprising a plurality of storage devices in a redundant array of independent drives (raid) configuration; and

a storage controller to:

write a first raid stripe to the plurality of storage devices including:

for each storage device of a subset of the plurality of storage devices, writing within a page of the storage device, user data, and a checksum that validates the user data stored on each storage device of the subset of the plurality of storage devices; and

writing, within a page of a particular storage device of the plurality of storage devices:

inter-device redundancy data, the inter-device redundancy data to protect the user data stored on each storage device of a first subset of the plurality of storage devices,

intra-page error recovery data, the intra-page error recovery data to protect the inter-device redundancy data stored on the particular storage device, and

inter-page protection data, the inter-page protection data to protect the checksums stored on each storage device of the subset of the plurality of storage devices.

0. 26. A method, comprising:

writing, by a storage controller of a data storage subsystem comprising a plurality of storage devices in a redundant array of independent drives (raid) configuration, a first raid stripe to the plurality of storage devices, wherein writing the first raid stripe comprises:

writing, within a page of a particular storage device of the plurality of storage devices:

inter-device redundancy data, the inter-device redundancy data to protect the user data stored on each storage device of a first subset of the plurality of storage devices,

intra-page error recovery data, the intra-page error recovery data to protect the inter-device protection data stored on the particular storage device, and

inter-page protection data, the inter-page protection data to protect the checksums stored on each storage device of the subset of the plurality of storage devices.

0. 2. The computer system as recited in claim 1, wherein the storage controller is further configured to write a second raid stripe to a subset of the plurality of storage devices, the first raid stripe having a first raid layout and the second raid stripe having a second raid layout.

0. 3. The computer system as recited in claim 2, wherein the first raid layout is an L+x layout, and the second raid layout is an M+y layout, wherein L, x, M, and, y are integers, and wherein either or both (1) L is not equal to M, and (2) x is not equal to y.

0. 4. The computer system as recited in claim 2, wherein the first raid layout is selected from a first device group and the second raid layout is selected from a second device group.

0. 5. The computer system as recited in claim 2, wherein the first raid layout and the second raid layout include at least one device that has a larger storage capacity than other devices included in the first raid layout and the second raid layout.

0. 6. The computer system as recited in claim 2, wherein the storage controller is further configured to configure an additional logical device not included in either the first raid layout or the second raid layout to store redundant data for both the first raid layout and the second raid layout.

0. 7. The computer system as recited in claim 1, wherein the plurality of storage devices are solid state storage devices.

0. 8. The computer system as recited in claim 1, wherein the storage controller is configured to store metadata, user data, and protection data in pages, each page including a header with a checksum.

0. 9. A method for use in a computing system, the method comprising:

writing a raid stripe to a plurality of storage devices in a redundant array of independent disks (raid) configuration, wherein writing the raid stripe includes:

writing within a page of a particular storage device of the plurality storage devices,

inter-device protection data, the inter-device protection data protecting the user data stored on each storage device of the subset of the plurality of storage devices;

intra-page protection data, the intra-page protection data protecting the inter-device protection data stored on the particular storage device; and

inter-page protection data, the inter-page protection data protecting the checksums stored on each storage device of the subset of storage devices.

0. 10. The method as recited in claim 9, further comprising writing a second raid stripe to a subset of the plurality of storage devices, the first raid stripe having a first raid layout and the second raid stripe having a second raid layout.

0. 11. The method as recited in claim 10, wherein the first raid layout is an L+x layout, and the second raid layout is an M+y layout, wherein L, x, M, and, y are integers, and wherein either or both (1) L is not equal to M, and (2) x is not equal to y.

0. 12. The method as recited in claim 10, wherein the first raid layout is selected from a first device group, and the second raid layout is selected from a second device group.

0. 13. The method as recited in claim 9, wherein the plurality of storage devices are solid state storage devices.

0. 14. The method as recited in claim 9, further comprising storing metadata, user data, and protection data in pages, each page including a header with a checksum.

0. 16. The non-transitory computer readable storage medium as recited in claim 15, wherein the storage controller is further configured to write a second raid stripe to a subset of the plurality of storage devices, the first raid stripe having a first raid layout and the second raid stripe having a second raid layout.

0. 17. The non-transitory computer readable storage medium as recited in claim 16, wherein the first raid layout is an L+x layout, and the second raid layout is an M+y layout, wherein L, x, M, and, y are integers, and wherein either or both (1) L is not equal to M, and (2) x is not equal to y.

0. 18. The non-transitory computer readable storage medium as recited in claim 15, wherein the plurality of storage devices are solid state storage devices.

0. 20. The computer system of claim 19, wherein the storage controller is further configured to write a second raid stripe to a second subset of the plurality of storage devices, the first raid stripe having a first raid layout and the second raid stripe having a second raid layout.

0. 21. The computer system of claim 20, wherein the first raid layout is an L+x layout, and the second raid layout is an M+y layout, wherein L, x, M, and, y are positive integers, and wherein at least one of: (1) L is not equal to M, or (2) x is not equal to y.

0. 22. The computer system of claim 20, wherein the first raid layout is selected from a first device group and the second raid layout is selected from a second device group.

0. 23. The computer system of claim 19, wherein the plurality of storage devices are solid state storage devices.

0. 24. The computer system of claim 19, wherein the plurality of storage devices comprise flash memory cells.

0. 25. The computer system of claim 19, wherein the computer system is a flash memory based system.

0. 27. The method of claim 26, further comprising writing a second raid stripe to a second subset of the plurality of storage devices, the first raid stripe having a first raid layout and the second raid stripe having a second raid layout.

0. 28. The method of claim 27, wherein the first raid layout is an L+x layout, and the second raid layout is an M+y layout, wherein L, x, M, and, y are positive integers, and wherein at least one of: (1) L is not equal to M, or (2) x is not equal to y.

0. 29. The method of claim 27, wherein the first raid layout is selected from a first device group and the second raid layout is selected from a second device group.

0. 30. The method of claim 26, wherein the plurality of storage devices are solid state storage devices.

0. 31. The method of claim 26, wherein the plurality of storage devices comprise flash memory cells.

0. 32. The method of claim 26, wherein the data storage subsystem is a flash memory based system.

In FIG. 6, state table 522 may correspond to one of state tables 522a-522b of FIG. 5. 713a-173b 173a-173b shown in FIG. 1. Each partition comprises multiple storage devices. In one embodiment, an algorithm such as the CRUSH algorithm may be utilized to select which devices to use in a RAID data layout architecture to use for data storage.

In the example shown, an L+1 RAID array, M+1 RAID array, and N+1 RAID array are shown. In various embodiments, L, M, and N may all be different, the same, or a combination thereof. For example, RAID array 1210 is shown in partition 1. The other storage devices 1212 are candidates for other RAID arrays within partition 1. Similarly, RAID array 1220 illustrates a given RAID array in partition 2. The other storage devices 1222 are candidates for other RAID arrays within partition 2. RAID array 1230 illustrates a given RAID array in partition 3. The other storage devices 1232 are candidates for other RAID arrays within partition 3.

Within each of the RAID arrays 1210, 1220 and 1230, a storage device P1 provides RAID single parity protection within a respective RAID array. Storage devices D1-DN store user data within a respective RAID array. Again, the storage of both the user data and the RAID single parity information may rotate between the storage devices D1-DN and P1. However, the storage of user data is described as being stored in devices D1-DN. Similarly, the storage of RAID single parity information is described as being stored in device P1 for ease of illustration and description.

One or more logical storage devices among each of the three partitions may be chosen to provide an additional amount of supported redundancy for one or more given RAID arrays. In various embodiments, a logical storage device may correspond to a single physical storage device. Alternatively, a logical storage device may correspond to multiple physical storage devices. For example, logical storage device Q1 in partition 3 may be combined with each of the RAID arrays 1210, 1220 and 1230. The logical storage device Q1 may provide RAID double parity information for each of the RAID arrays 1210, 1220 and 1230. This additional parity information is generated and stored when a stripe is written to one of the arrays 1210, 1220, or 1230. Further this additional parity information may cover stripes in each of the arrays 1210, 1220, and 1230. Therefore, the ratio of a number of storage devices storing RAID parity information to a total number of storage devices is lower. For example, if each of the partitions used N+2 RAID arrays, then the ratio of a number of storage devices storing RAID parity information to a total number of storage devices is 3(2)/(3(N+2)), or 2/(N+2). In contrast, the ratio for the hybrid RAID layout 1200 is (3+1)/(3(N+1)), or 4/(3(N+1)).

It is possible to reduce the above ratio by increasing a number of storage devices used to store user data. For example, rather than utilize storage device Q1, each of the partitions may utilize a 3N+2 RAID array. In such a case, the ratio of a number of storage devices storing RAID parity information to a total number of storage devices is 2/(3N+2). However, during a reconstruct read operation, (3N+1) storage devices receive a reconstruct read request for a single device failure. In contrast, for the hybrid RAID layout 1200, only N storage devices receive a reconstruct read request for a single device failure.

It is noted each of the three partitions may utilize a different RAID data layout architecture. A selection of a given RAID data layout architecture may be based on a given ratio number of storage devices storing RAID parity information to a total number of storage devices. In addition, the selection may be based on a given number of storage devices, which may receive a reconstruct read request during reconstruction. For example, the RAID arrays 1210, 1220 and 1230 may include geometries such as L+a, M+b and N+c, respectively.

In addition, one or more storage devices, such as storage device Q1, may be chosen based on the above or other conditions to provide an additional amount of supported redundancy for one or more of the RAID arrays within the partitions. In an example with three partitions comprising the above RAID arrays and a number Q of storage devices providing extra protection for each of the RAID arrays, a ratio of a number of storage devices storing RAID parity information to a total number of storage devices is (a+b+c+Q)/(L+a+M+b+N+c+Q). For a single device failure, a number of storage devices to receive a reconstruct read request is L, M and N, respectively, for partitions 1 to 3 in the above example. It is noted that the above discussion generally describes 3 distinct partitions in FIG. 12. In such an embodiment, this type of “hard” partitioning where a given layout is limited to a particular group of devices may guarantee that reconstruct reads in one partition will not collide with those in another partition. However, in other embodiments the partitions may not be hard as described above. Rather, given a pool of devices, layouts may be selected from any of the devices. For example, treating the devices as on big pool it is possible to configure layouts such as (L+1, M+1, N+1)+1. Consequently, there is a chance that geometries overlap and reconstruct reads could collide. If L, M, and N are small relative to the size of the pool then the percentage of reconstruct reads relative to normal reads may be kept low. As noted above, the additional redundancy provided by Q1 may not correspond to a single physical device. Rather, the data corresponding to the logical device Q1 may in fact be distributed among two or more of the devices depicted in FIG. 12. In addition, in various embodiments, the user data (D), parity data (P), and additional data (Q) may all be distributed across a plurality of devices. In such a case, each device may store a mix of user data (D), parity data (P), and additional parity data (Q).

In addition to the above, in various embodiments, when writing a stripe, the controller may select from any of the plurality of storage devices for one or more of the first RAID layout, the second RAID layout, and storage of redundant data by the additional logical device. In this manner, all of these devices may participate in the RAID groups and for different stripes the additional logical device may be different. In various embodiments, a stripe is a RAID layout on the first subset plus a RAID layout on the second subset plus the additional logical device.

Referring now to FIG. 13, one embodiment of a method 1300 for selecting alternate RAID geometries in a data storage subsystem is shown. The components embodied in network architecture 100 and data storage arrays 120a-120b described above may generally operate in accordance with method 1300. The steps in this embodiment are shown in sequential order. However, some steps may occur in a different order than shown, some steps may be performed concurrently, some steps may be combined with other steps, and some steps may be absent in another embodiment.

In block 1302, a RAID engine 178 or other logic within a storage controller 174 determines to use a given number of devices to store user data in a RAID array within each partition of a storage subsystem. A RUSH or other algorithm may then be used to select which devices are to be used. In one embodiment, each partition utilizes a same number of storage devices. In other embodiments, each partition may utilize a different, unique number of storage devices to store user data. In block 1304, the storage controller 174 may determine to support a number of storage devices to store corresponding Inter-Device Error Recovery (parity) data within each partition of the subsystem. Again, each partition may utilize a same number or a different, unique number of storage devices for storing RAID parity information.

In block 1306, the storage controller may determine to support a number Q of storage devices to store extra Inter-Device Error Recovery (parity) data across the partitions of the subsystem. In block 1308, both user data and corresponding RAID parity data may be written in selected storage devices. Referring again to FIG. 12, when a given RAID array is written, such as RAID array 1210 in partition 1, one or more bits of parity information may be generated and stored in storage device Q1 in partition 3.

If the storage controller 174 detects a condition for performing read reconstruction in a given partition (conditional block 1310), and if the given partition has a sufficient number of storage devices holding RAID parity information to handle a number of unavailable storage devices (conditional block 1312), then in block 1314, the reconstruct read operation(s) is performed with one or more corresponding storage devices within the given partition. The condition may include a storage device within a given RAID array is unavailable due to a device failure or the device operates below a given performance level. The given RAID array is able to handle a maximum number of unavailable storage devices with the number of storage devices storing RAID parity information within the given partition. For example, if RAID array 1210 in partition 1 in the above example is an L+a RAID array, then RAID array 1210 is able to perform read reconstruction utilizing only storage devices within partition 1 when k storage devices are unavailable, where 1<=k<=a.

If the given partition does not have a sufficient number of storage devices holding RAID parity information to handle a number of unavailable storage devices (conditional block 1312), and if there is a sufficient number of Q storage devices to handle the number of unavailable storage devices (conditional block 1316), then in block 1318, the reconstruct read operation(s) is performed with one or more corresponding Q storage devices. One or more storage devices in other partitions, which are storing user data, may be accessed during the read reconstruction. A selection of these storage devices may be based on a manner of a derivation of the parity information stored in the one or more Q storage devices. For example, referring again to FIG. 12, storage device D2 in partition 2 may be accessed during the read reconstruction, since this storage device may have been used to generate corresponding RAID parity information stored in storage device Q1. If there are not a sufficient number of Q storage devices to handle the number of unavailable storage devices (conditional block 1316), then in block 1320, the corresponding user data may be read from another source or be considered lost.

It is noted that the above-described embodiments may comprise software. In such an embodiment, the program instructions that implement the methods and/or mechanisms may be conveyed or stored on a computer readable medium. Numerous types of media which are configured to store program instructions are available and include hard disks, floppy disks, CD-ROM, DVD, flash memory, Programmable ROMs (PROM), random access memory (RAM), and various other forms of volatile or non-volatile storage.

In various embodiments, one or more portions of the methods and mechanisms described herein may form part of a cloud-computing environment. In such embodiments, resources may be provided over the Internet as services according to one or more various models. Such models may include Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). In IaaS, computer infrastructure is delivered as a service. In such a case, the computing equipment is generally owned and operated by the service provider. In the PaaS model, software tools and underlying equipment used by developers to develop software solutions may be provided as a service and hosted by the service provider. SaaS typically includes a service provider licensing software as a service on demand. The service provider may host the software, or may deploy the software to a customer for a given period of time. Numerous combinations of the above models are possible and are contemplated.

Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

INVENTORS:

Colgrove, John, Hayes, John, Miller, Ethan, Hong, Bo

THIS PATENT IS REFERENCED BY THESE PATENTS:

Patent	Priority	Assignee	Title
11340987,	Mar 04 2021	NetApp, Inc.; NetApp, Inc	Methods and systems for raid protection in zoned solid-state drives
11698836,	Mar 04 2021	NetApp, Inc.	Methods and systems for raid protection in zoned solid-state drives
11775381,	Sep 17 2021	Micron Technology, Inc	Redundancy metadata schemes for RAIN protection of large codewords
11789611,	Apr 24 2020	NetApp, Inc	Methods for handling input-output operations in zoned storage systems and devices thereof
11797377,	Oct 05 2021	NetApp, Inc.	Efficient parity determination in zoned solid-state drives of a storage system
11803329,	Nov 22 2021	NetApp, Inc.	Methods and systems for processing write requests in a storage system
11816359,	Dec 16 2021	NetApp, Inc.	Scalable solid-state storage system and methods thereof
11861231,	Dec 16 2021	NetApp, Inc.	Scalable solid-state storage system and methods thereof
11940911,	Dec 17 2021	NetApp, Inc.; NETAPP INC	Persistent key-value store and journaling system
11960766,	Dec 06 2021	SanDisk Technologies, Inc	Data storage device and method for accidental delete protection
12135905,	Dec 16 2021	NetApp, Inc.	Scalable solid-state storage system and methods thereof
ER4186,

THIS PATENT REFERENCES THESE PATENTS:

Patent	Priority	Assignee	Title
5822782,	Oct 27 1995	AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD	Methods and structure to maintain raid configuration information on disks of the array
6275898,	May 13 1999	AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED	Methods and structure for RAID level migration within a logical unit
6311251,	Nov 23 1998	Oracle America, Inc	System for optimizing data storage in a RAID system
6321345,
6681290,	Jan 29 2001	International Business Machines Corporation	Physical data layout to reduce seeks in a raid system
6742081,	Apr 30 2001	Oracle America, Inc	Data storage array employing block checksums and dynamic striping
6854071,	May 14 2001	GLOBALFOUNDRIES Inc	Method and apparatus for providing write recovery of faulty data in a non-redundant raid system
6938123,	Jul 19 2002	Oracle America, Inc	System and method for raid striping
6983335,	Dec 12 2002	VIA Technologies, Inc.	Disk drive managing method for multiple disk-array system
7069381,	Jul 01 2003	ACQUIOM AGENCY SERVICES LLC, AS ASSIGNEE	Automated Recovery from data corruption of data volumes in RAID storage
7080278,	Mar 08 2002	Network Appliance, Inc.	Technique for correcting multiple storage device failures in a storage array
7200715,	Mar 21 2002	NetApp, Inc	Method for writing contiguous arrays of stripes in a RAID storage system using mapped block writes
7206991,	Oct 15 2003	AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED	Method, apparatus and program for migrating between striped storage and parity striped storage
7257674,	Jun 24 2003	International Business Machines Corporation	Raid overlapping
7315976,	Jan 31 2002	AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD	Method for using CRC as metadata to protect against drive anomaly errors in a storage array
7484137,	Mar 31 2005	Western Digital Technologies, INC	RAID system using regional error statistics for redundancy grouping
7506187,	Sep 02 2003	International Business Machines Corporation	Methods, apparatus and controllers for a raid storage system
7681072,	Aug 13 2004	Panasas, Inc.	Systems and methods for facilitating file reconstruction and restoration in data storage systems where a RAID-X format is implemented at a file level within a plurality of storage devices
7904782,	Mar 09 2007	Microsoft Technology Licensing, LLC	Multiple protection group codes having maximally recoverable property
7930475,	Mar 21 2002	NetApp, Inc.	Method for writing contiguous arrays of stripes in a RAID storage system using mapped block writes
7934055,	Dec 06 2006	SanDisk Technologies LLC	Apparatus, system, and method for a shared, front-end, distributed RAID
7958303,	Apr 27 2007	STRIPE, INC	Flexible data storage system
8015440,	Dec 06 2006	SanDisk Technologies, Inc	Apparatus, system, and method for data storage using progressive raid
8019938,	Dec 06 2006	SanDisk Technologies, Inc	Apparatus, system, and method for solid-state storage as cache for high-capacity, non-volatile storage
8037391,	May 22 2009	Nvidia Corporation	Raid-6 computation system and method
8171379,	Feb 18 2008	Dell Products L.P.	Methods, systems and media for data recovery using global parity for multiple independent RAID levels
8417987,	Dec 01 2009	NetApp, Inc	Mechanism for correcting errors beyond the fault tolerant level of a raid array in a storage system
20030115412,
20060143507,
20070288401,
20080168225,
20090055584,
20090210742,
20100125695,
20110099321,

ASSIGNMENT RECORDS Assignment records on the USPTO

Executed on	Assignor	Assignee	Conveyance	Frame	Reel	Doc
Mar 09 2018		Pure Storage, Inc	(assignment on the face of the patent)

MAINTENANCE FEES AND DATES: Maintenance records on the USPTO

Date	Maintenance Fee Events
Mar 09 2018	BIG: Entity status set to Undiscounted (note the period is included in the code).
Jan 15 2024	REM: Maintenance Fee Reminder Mailed.
Jul 01 2024	EXP: Patent Expired for Failure to Pay Maintenance Fees.

Date	Maintenance Schedule
Feb 23 2024	4 years fee payment window open
Aug 23 2024	6 months grace period start (w surcharge)
Feb 23 2025	patent expiry (for year 4)
Feb 23 2027	2 years to revive unintentionally abandoned end. (for year 4)
Feb 23 2028	8 years fee payment window open
Aug 23 2028	6 months grace period start (w surcharge)
Feb 23 2029	patent expiry (for year 8)
Feb 23 2031	2 years to revive unintentionally abandoned end. (for year 8)
Feb 23 2032	12 years fee payment window open
Aug 23 2032	6 months grace period start (w surcharge)
Feb 23 2033	patent expiry (for year 12)
Feb 23 2035	2 years to revive unintentionally abandoned end. (for year 12)