data re-protection in a distributed replicated data storage system is disclosed. The method may be implemented on a server or controller. A method includes storing first data in a first zone and storing a replica of the first data in a second zone. The zones are at different, separate locations. When an actual or impending failure with the first data in the first zone is detected, the system automatically initiates transitioning to a copy of impacted data at the first zone obtained from the second zone. The transitioning includes creating a remote copy of the impacted data at the second zone within a local area network before transferring the copy to the first zone over a wide area network. The methods allow the system to return to a fully protected state faster than if the impacted data was transferred from the second zone to the first zone without making a copy at the second zone.
|
11. A storage medium having instructions stored thereon which when executed by a processor cause the processor to perform actions comprising:
storing first data in a first storage zone at a first location of a distributed replicated data storage system
storing a replica of the first data as replicated data in a second storage zone at a remote location as replicated data, wherein the remote location is different from and separate from the first location
detecting an actual or impending failure or problem with a portion of the first data or a storage device included in the first storage zone, including designating an at risk portion of the first data, a missing portion of the first data or an impaired portion of the first data as impacted data
automatically transitioning to a replacement copy of the impacted data at the first storage zone in response to the detecting, the transitioning including
creating a remote copy of the impacted data from the replicated data at the second zone at the remote location within a local area network
transferring the remote copy from the second storage zone at the remote location to the first storage zone at the first location over the wide area network
reconfiguring the first location so that the copy of the impacted data is accessed in place of the impacted data.
1. A method of re-protecting data included in a storage zone of a distributed replicated data storage system, the method comprising:
storing first data in a first storage zone at a first location of the distributed replicated data storage system
storing a replica of the first data as replicated data in a second storage zone at a remote location as replicated data, wherein the remote location is different from and separate from the first location
detecting an actual or impending failure or problem with a portion of the first data or a storage device included in the first storage zone including designating an at risk portion of the first data, a missing portion of the first data or an impaired portion of the first data as impacted data
automatically transitioning to a replacement copy of the impacted data at the first storage zone in response to the detecting, the transitioning including
creating a remote copy of the impacted data from the replicated data at the second storage zone at the remote location within a local area network
transferring the remote copy from the second storage zone at the remote location to the first storage zone at the first location over the wide area network
reconfiguring the first location so that the copy of the impacted data is accessed in place of the impacted data.
21. A computing device to manage a plurality of storage nodes of storage devices arranged as two or more storage zones in a distributed replicated data storage system, the computing device comprising:
a processor;
a memory coupled with the processor;
a computer readable storage medium having instructions stored thereon which when executed cause the computing device to perform actions comprising:
storing first data in a first storage zone at a first location of the distributed replicated data storage system
storing a replica of the first data as replicated data in a second storage zone at a remote location as replicated data, wherein the remote location is different from and separate from the first location
detecting an actual or impending failure or problem with a portion of the first data or a storage device included in the first storage zone, including designating an at risk portion of the first data, a missing portion of the first data or an impaired portion of the first data as impacted data
automatically transitioning to a replacement copy of the impacted data at the first storage zone in response to the detecting, the transitioning including
creating a remote copy of the impacted data from the replicated data at the second zone at the remote location within a local area network
transferring the remote copy from the second storage zone at the remote location to the first storage zone at the first location over the wide area network
reconfiguring the first location so that the copy of the impacted data is accessed in place of the impacted data.
2. The method of
3. The method of
4. The method of
5. The method of
removing the remote copy from the second storage zone at the remote location after the reconfiguring is completed.
6. The method of
7. The method of
8. The method of
9. The method of
10. The method of
12. The storage medium of
13. The storage medium of
14. The storage medium of
15. The storage medium of
removing the remote copy from the second storage zone at the remote location after the reconfiguring is completed.
16. The storage medium of
17. The storage medium of
18. The storage medium of
19. The storage medium of
20. The storage medium of
22. The computing device of
23. The computing device of
24. The computing device of
25. The computing device of
removing the remote copy from the second storage zone at the remote location after the reconfiguring is completed.
26. The computing device of
27. The computing device of
28. The computing device of
29. The computing device of
30. The computing device of
|
A portion of the disclosure of this patent document contains material which is subject to copyright protection. This patent document may show and/or describe matter which is or may become trade dress of the owner. The copyright and trade dress owner has no objection to the facsimile reproduction by anyone of the patent disclosure as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright and trade dress rights whatsoever.
1. Field
This disclosure relates to data stored in a distributed replicated data storage system and the re-protection of the data.
2. Description of the Related Art
A file system is used to store and organize computer data stored as electronic files. File systems allow files to be found, read, deleted, and otherwise accessed. File systems store files on one or more storage devices. File systems store files on storage media such as hard disk drives and silicon storage devices.
Various applications may store large numbers of documents, images, videos and other as data as objects using a distributed replicated data storage system in which data is replicated and stored in at least two locations.
In another embodiment, the storage zones 110 and 120 may be configured such that the first storage zone is a primary storage zone and the second storage zone is a secondary or backup. This is referred to herein as an active-passive storage system. In this embodiment, stored data is accessed from the primary storage zone, and the secondary storage zone may be accessed as needed such as when there is a problem or failure in the primary storage zone. The accessing of data from the secondary storage zone may be based on system rules.
The storage zones 110 and 120 are separated geographically. The storage zones 110 and 120 communicate with each other and share objects over wide area network 130. The wide area network 130 may be or include the Internet. The wide area network 130 may be wired, wireless, or a combination of these. The wide area network 130 may be public or private, may be a segregated network, and may be a combination of these. The wide area network 130 includes networking devices such as routers, hubs, switches and the like.
The term data as used herein includes a bit, byte, word, block, stripe or other unit of information. In one embodiment the data is stored within and by the distributed replicated data storage system as objects. As used herein, the term data is inclusive of entire files or portions of a computer readable file. The computer readable file may include or represent text, numbers, data, images, photographs, graphics, audio, video, computer programs, computer source code, computer object code, executable computer code, and/or a combination of these and similar information. Many data intensive applications store a large quantity of data, these applications include scientific applications, newspaper and magazine websites (for example, nytimes.com and life.com), scientific lab data capturing and analysis programs, video and film creation software, and consumer web based applications such as social networking websites (for example, FACEBOOK), photo sharing websites (for example, FLIKR), video sharing websites (for example, YOUTUBE) and music sharing websites (for example, ITUNES).
Referring again to
The storage zones 110, 120 and 104 may include servers and/or a controller on which software may execute. The server and/or controller may include one or more of logic arrays, memories, analog circuits, digital circuits, software, firmware, and processors such as microprocessors, a field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), programmable logic device (PLDs) and programmable logic array (PLAs). The hardware and firmware components of the servers and/or controller may include various specialized units, circuits, software and interfaces for providing the functionality and features described herein. The processes, functionality and features described herein may be embodied in whole or in part in software which operates on a controller and/or one or more server computers and may be in the form of one or more of firmware, an application program, object code, machine code, an executable file, an applet, a COM object, a dynamic linked library (DLL), a script, one or more subroutines, or an operating system component or service, and other forms of software. The hardware and software and their functions may be distributed such that some components are performed by a controller, server or other computing device, and others by other controllers, servers or other computing devices within a storage zone and/or within the distributed replicated data storage system.
The server may be a computing device. A computing device as used herein refers to any device with a processor, memory and a storage device that may execute instructions such as software including, but not limited to, personal computers, server computers, computing tablets, set top boxes, video game systems, personal video recorders, telephones, personal digital assistants (PDAs), portable computers, and laptop computers. These computing devices may run an operating system, including, for example, versions of the Linux, Unix, MS-DOS, Microsoft Windows, Solaris, Symbian, Android, Chrome, and Apple Mac OS X operating systems. Computing devices may include a network interface in the form of a card, chip or chip set that allows for communication over a wired and/or wireless network. The network interface may allow for communications according to various protocols and standards, including, for example, versions of Ethernet, INFINIBAND® network, Fibre Channel, and others. A computing device with a network interface is considered network capable.
Referring again to
The storage media included in a storage node may be of the same capacity, may have the same physical size, and may conform to the same specification, such as, for example, a hard disk drive specification. Example sizes of storage media include, but are not limited to, 2.5″ and 3.5″. Example hard disk drive capacities include, but are not limited to, 500 Mbytes, 1 terabyte and 2 terabytes. Example hard disk drive specifications include Serial Attached Small Computer System Interface (SAS), Serial Advanced Technology Attachment (SATA), and others. An example storage node may include 16 one terabyte 3.5″ hard disk drives conforming to the SATA standard. In other configurations, the storage nodes 150 may include more and fewer drives, such as, for example, 10, 12, 24 32, 40, 48, 64, etc. In other configurations, the storage media 160 in a storage node 150 may be hard disk drives, silicon storage devices, magnetic tape devices, or a combination of these. In some embodiments, the physical size of the media in a storage node may differ, and/or the hard disk drive or other storage specification of the media in a storage node may not be uniform among all of the storage devices in a storage node 150.
The storage media 160 in a storage node 150 may, but need not, be included in a single cabinet, rack, shelf or blade. When the storage media in a storage node are included in a single cabinet, rack, shelf or blade, they may be coupled with a backplane. A controller may be included in the cabinet, rack, shelf or blade with the storage devices. The backplane may be coupled with or include the controller. The controller may communicate with and allow for communications with the storage media according to a storage media specification, such as, for example, a hard disk drive specification. The controller may include a processor, volatile memory and non-volatile memory. The controller may be a single computer chip such as an FPGA, ASIC, PLD and PLA. The controller may include or be coupled with a network interface.
In another embodiment, multiple storage nodes 150 are included in a single cabinet or rack. When in a single cabinet or rack, storage nodes and/or constituent storage media may be coupled with a backplane. A controller may be included in the cabinet with the storage media and/or storage nodes. The backplane may be coupled with the controller. The controller may communicate with and allow for communications with the storage media. The controller may include a processor, volatile memory and non-volatile memory. The controller may be a single computer chip such as an FPGA, ASIC, PLD and PLA.
The rack, shelf or cabinet containing a storage node 150 may include a communications interface that allows for connection to a computing device and/or to a network. The communications interface may allow for the transmission of and receipt of information according to one or more of a variety of standards, including, but not limited to, universal serial bus (USB), IEEE 1394 (also known as FIREWIRE® and I.LINK®), Fibre Channel, Ethernet, WiFi (also known as IEEE 802.11). The backplane or controller in a rack or cabinet containing one or more storage nodes 150 may include a network interface chip, chipset, card or device that allows for communication over a wired and/or wireless network, including Ethernet. In various embodiments, the storage node, controller or backplane may provide for and support 1, 2, 4, 8, 12, 16, etc. network connections and may have an equal number of network interfaces to achieve this.
The techniques discussed herein are described with regard to storage media including, but not limited to, hard disk drives and solid-state drives. The techniques may be implemented with other readable and writable storage media.
As used herein, a storage device is a device that allows for reading from and/or writing to a storage medium. Storage devices include hard disk drives (HDDs), solid-state drives (SSDs), DVD drives, flash memory devices, and others. Storage media include magnetic media such as hard disks and tape, flash memory, and optical disks such as CDs, DVDs and BLU-RAY® discs.
In some embodiments, files and other data may be broken into smaller portions and stored as multiple objects among multiple storage media 160 in a storage node 150. In some embodiments, files and other data may be broken into smaller portions such as objects and stored among multiple storage nodes 150 in a storage zone.
Referring again to
In the distributed replicated data storage system described herein, when writing data to a storage zone, the data may be replicated on one or more additional storage zones to provide for redundancy and/or to allow for ready (that is, quick) access from each of multiple user sites. In various embodiments, replication may be performed synchronously, that is, completed before the write operation is acknowledged; asynchronously, that is, the replicas may be written before, after or during the write of the first copy; or a combination of each.
Because of the cost of data storage devices, it is common to have only one copy of data in each storage zone of the digital replicated data storage system. In an active-passive system, because of the cost of data storage, it is common to have only one primary copy of data, the primary version at a primary location, and one or more remote replicas of the data at a remote or distant secondary location. In the example shown in
The software running on a controller or server in storage zone 110 may monitor the health of the storage nodes 150 and/or the storage media 160 in the storage zone 110. When an impending or actual problem or failure is detected in the first storage zone, the first storage zone may identify impacted data. As used herein, impacted data is data that is missing, is impaired or is included on a storage device or portion thereof that has been detected as unhealthy or compromised. After identifying impacted data, the first storage zone requests that a copy of the impacted data be transmitted from a geographically remote replica storage zone, the second storage zone. However, if a copy of the impacted data is to be transferred to the first storage zone from the second storage zone located at a different geographical location over a wide area network, the system will be in a precarious and possibly reduced capability state during the time period when the transfer over the wide area network occurs. That is, during the transfer over the wide area network, the distributed replicated data storage system may not operate in a fully replicated manner as the impacted data includes missing and/or impaired data or data stored on an unhealthy or compromised storage device or portion thereof. In this way a customer of the distributed replicated data storage system may not be receiving the performance, reliability or service level desired or required. This is pertinent to both fully active storage systems and active-passive storage systems.
Because the data transfer speed over the wide area network is slower than the data transfer speed within the local area network within a storage zone, according to the methods described herein, the distributed replicated data storage system makes a copy of the replica of the impacted data at the remote location, the second storage zone, before transferring the impacted data to the first storage zone. This is done so that if during the transfer of the impacted data from the second storage zone to the first storage zone an actual or impending problem is detected or an actual problem or failure occurs to the impacted data at the second location, a copy of the replica data will still exist. This averts any data loss that could result from a failure of the replica during transfer of the replica from the second storage zone to the first storage zone. In this way, the impacted data is re-protected before the impacted data is transferred from the second location to the first location.
In one embodiment, where the impacted data is located on one storage device within the second storage zone, the copy is made to a different storage device within the second storage zone (which may be within or external to a particular storage node) to reduce the risk of failure should the storage device in the second storage zone on which the impacted data is stored fails. In another embodiment, where the impacted data is located on one node within the second storage zone, a copy of the impacted data is made to a different node within the second storage zone to reduce the risk of failure should the node on which the impacted data is stored fails. In another embodiment, where the impacted data is located on a first group of two or more nodes within the second storage zone, a copy of the impacted data is made to a second different group of two or more nodes within the second storage zone such that there is no overlap between the nodes included in the first and second groups of storage nodes. The distribution of the data between two different groups of nodes, without shared nodes, is arranged to reduce the risk of failure should the first group of nodes on which the impacted data is stored fails.
The copy of the impacted data at the second location serves as insurance should a problem occur with the storage node or storage medium on which the impacted data is stored at the second location.
The system then transfers the copy of the impacted data from the remote second location to the first location over a wide area network, such as from the second storage zone at the second location (120) to the first storage zone at the first location (110), as shown in block 360. The system may use either or both the impacted data at the second location or the copy of the impacted data at the second location to transfer the impacted data from the second location to the first location during the reconfiguration of the impacted data at the first location.
The system then reconfigures the first location to use and access the copy of the impacted data received from remote second location in place of the at risk, impaired or missing impacted data, as shown in block 370. In the example shown in
As an extra precaution, more than one copy of the impacted data may be made at the first location. This may be done as a precautionary measure in the event a storage device or a storage node of the first storage zone proves to be unstable. After the impacted data at the first location has been reconfigured such that there are no actual or impending problems or failures at the first location, the system may remove the remote copy of the impacted data from the remote second location, as shown in block 380. Any additional copies of the original data at the first location may be saved for a period of time in an effort to insure reliability of the data at the first storage zone and to insure the ongoing replicated nature of the storage system.
Although the examples described herein involve a first storage zone and a second storage zone, multiple storage zones may be included in the distributed replicated data storage system 100 and the methods may be implemented equally to all storage zones. Further, although the examples provided herein describe actions that result when actual or impending problems or failures are detected in the first storage zone, the same methods apply when actual or impending problems or failures are detected in the second storage zone or any other storage zones (not shown).
The methods described above and shown in
Throughout this description, the embodiments and examples shown should be considered as exemplars, rather than limitations on the apparatus and procedures disclosed or claimed. Although many of the examples presented herein involve specific combinations of method acts or system elements, it should be understood that those acts and those elements may be combined in other ways to accomplish the same objectives. With regard to flowcharts, additional and fewer steps may be taken, and the steps as shown may be combined or further refined to achieve the methods described herein. Acts, elements and features discussed only in connection with one embodiment are not intended to be excluded from a similar role in other embodiments.
As used herein, “plurality” means two or more.
As used herein, a “set” of items may include one or more of such items.
As used herein, whether in the written description or the claims, the terms “comprising”, “including”, “carrying”, “having”, “containing”, “involving”, and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of”, respectively, are closed or semi-closed transitional phrases with respect to claims.
Use of ordinal terms such as “first”, “second”, “third”, etc., “primary”, “secondary”, “tertiary”, etc. in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
As used herein, “and/or” means that the listed items are alternatives, but the alternatives also include any combination of the listed items.
Patent | Priority | Assignee | Title |
9513996, | Jun 19 2013 | Fujitsu Limited | Information processing apparatus, computer-readable recording medium having stored program for controlling information processing apparatus, and method for controlling information processing apparatus |
Patent | Priority | Assignee | Title |
4092732, | May 31 1977 | International Business Machines Corporation | System for recovering data stored in failed memory unit |
4761785, | Jun 12 1986 | International Business Machines Corporation | Parity spreading to enhance storage access |
5208813, | Oct 23 1990 | EMC Corporation | On-line reconstruction of a failed redundant array system |
5390187, | Oct 23 1990 | EMC Corporation | On-line reconstruction of a failed redundant array system |
5504892, | Sep 08 1994 | Apple Inc | Extensible object-oriented file system |
5758153, | Sep 08 1994 | Apple Inc | Object oriented file system in an object oriented operating system |
6154853, | Mar 26 1997 | EMC IP HOLDING COMPANY LLC | Method and apparatus for dynamic sparing in a RAID storage system |
6442659, | Feb 17 1998 | EMC IP HOLDING COMPANY LLC | Raid-type storage system and technique |
6658439, | Feb 23 2000 | Fujitsu Limited | File system |
7047376, | Dec 17 2003 | GOOGLE LLC | Backup system and method and program |
7370228, | Sep 01 2004 | Hitachi, Ltd. | Data processing system and copy processing method thereof |
7421614, | Jan 03 2002 | Hitachi, Ltd. | Data synchronization of multiple remote storage after remote copy suspension |
7454655, | Sep 08 2003 | International Business Machines Corporation | Autonomic recovery of PPRC errors detected by PPRC peer |
7480827, | Aug 11 2006 | Chicago Mercantile Exchange | Fault tolerance and failover using active copy-cat |
7809906, | Feb 26 2004 | Hitachi, LTD | Device for performance tuning in a system |
8090977, | Dec 21 2009 | Intel Corporation | Performing redundant memory hopping |
8250202, | Jan 04 2003 | International Business Machines Corporation | Distributed notification and action mechanism for mirroring-related events |
8600945, | Mar 29 2012 | EMC IP HOLDING COMPANY LLC | Continuous data replication |
8775753, | Jan 04 2011 | International Business Machines Corporation | Synchronization of logical copy relationships |
8788877, | Aug 03 2011 | International Business Machines Corporation | Acquiring a storage system into copy services management software |
20010016841, | |||
20030115438, | |||
20040098383, | |||
20040236769, | |||
20050216502, | |||
20110078494, | |||
20120284555, | |||
20120303999, | |||
20130024720, | |||
20130036326, | |||
20130067268, | |||
20130091377, | |||
20130254587, | |||
20140136880, | |||
20140195846, | |||
RE39421, | May 03 1996 | Nvidia Corporation | On-the-fly redundancy operation for forming redundant drive data and reconstructing missing data as data transferred between buffer memory and disk drives during write and read operation respectively |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 18 2012 | DataDirect Networks, Inc. | (assignment on the face of the patent) | / | |||
Sep 26 2012 | OLSTER, DAN | DATA DIRECT NETWORKS, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 029046 | /0043 | |
Jan 12 2015 | DATADIRECT NETWORKS, INC | PREFERRED BANK, AS LENDER | SECURITY INTEREST SEE DOCUMENT FOR DETAILS | 034693 | /0698 | |
Oct 03 2018 | DATADIRECT NETWORKS, INC | Triplepoint Capital LLC | SECURITY INTEREST SEE DOCUMENT FOR DETAILS | 047228 | /0734 |
Date | Maintenance Fee Events |
Apr 27 2018 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Apr 15 2022 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Oct 28 2017 | 4 years fee payment window open |
Apr 28 2018 | 6 months grace period start (w surcharge) |
Oct 28 2018 | patent expiry (for year 4) |
Oct 28 2020 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 28 2021 | 8 years fee payment window open |
Apr 28 2022 | 6 months grace period start (w surcharge) |
Oct 28 2022 | patent expiry (for year 8) |
Oct 28 2024 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 28 2025 | 12 years fee payment window open |
Apr 28 2026 | 6 months grace period start (w surcharge) |
Oct 28 2026 | patent expiry (for year 12) |
Oct 28 2028 | 2 years to revive unintentionally abandoned end. (for year 12) |