Exemplary storage network architectures, data architectures, and methods for creating and using snapdifference files in storage networks are described. One exemplary method may be implemented in a processor in a storage network. The method comprises detecting a failure in a source volume, and in response to the failure: terminating communication with one or more applications that generate I/O requests to the source volume; refreshing the source volume; copying a backup data set to the source volume, and while the backup data set is being copied: activating a new snapdifference file; restarting communication with one or more applications that generate I/O requests to the source volume; and recording I/O operations to the source volume in the snapdifference file.
|
1. A method of computing, comprising:
detecting a failure in a source volume; and
in response to the failure:
terminating communication with one or more applications that generate I/O requests to the source volume;
refreshing the source volume;
copying a backup data set to the source volume, and while the backup data set is being copied:
activating a new snapdifference file;
restarting communication with one or more applications that generate I/O requests to the source volume; and
recording I/O operations to the source volume in the snapdifference file.
10. A data storage system, comprising:
a processor;
one or more storage devices providing mass storage media;
a memory module communicatively connected to the processor;
logic instructions in the memory module which, when executed by the processor, configure the processor to detect a failure in a source volume resident on the one or more storage devices; and in response to the failure:
terminate communication with one or more applications that generate I/O requests to the source volume;
refresh the source volume;
copy a backup data set to the source volume, and while the backup data set is being copied:
activate a new snapdifference file;
restart communication with one or more applications that generate I/O requests to the source volume; and
record I/O operations to the source volume in the snapdifference file.
16. A method of recovering from a failure in a source volume of a data storage system, comprising:
terminating I/O operations to the source volume;
deleting a first snapdifference file refreshing the source volume;
copying the data from a validated snapdifference file to the source volume;
restarting I/O operations to the source volume while the data from the validated snapdifference file is being copied to the source volume;
activating a second snapdifference file;
establishing a logical link to the validated snapdifference file; and
recording I/O operations to the source volume in the second snapdifference file;
wherein activating a second snapdifference file comprises:
creating a new logical disk state block;
traversing a logical disk state block pointer until a null successor pointer is encountered;
resetting the null successor pointer to point to the new logical disk state block; and
resetting a predecessor pointer of the new logical disk state block.
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
7. The method of
closing the active snapdifference file;
contemporaneously activating a successor snapdifference file; and
recording I/O operations to the source disk volume in the successor snapdifference file.
8. The method of
9. The method of
creating a new logical disk state block;
traversing a logical disk state block pointer until a null successor pointer is encountered;
resetting the null pointer to point to the new logical disk state block; and
resetting a predecessor pointer of the new logical disk state block.
11. The data storage system of
12. The data storage system of
13. The data storage system of
14. The data storage system of
close the active snapdifference file;
contemporaneously activate a successor snapdifference file; and
record I/O operations to the source disk volume in the successor snapdifference file.
15. The data storage system of
create a new logical disk state block;
traverse a logical disk state block pointer until a null successor pointer is encountered;
reset the null pointer to point to the new logical disk state block; and
reset a predecessor pointer of the new logical disk state block.
18. The method of
19. The method of
|
The described subject matter relates to electronic computing, and more particularly to recovery operations in storage networks.
The ability to duplicate and store the contents of a storage device an important feature of a storage system. Data may be stored in parallel to safeguard against the failure of a single storage device or medium. Upon a failure of the first storage device or medium, the system may then retrieve a copy of the data contained in a second storage device or medium. The ability to duplicate and store the contents of the storage device also facilitates the creation of a fixed record of contents at the time of duplication. This feature allows users to recover a prior version of inadvertently edited or erased data.
There are space and processing costs associated with copying and storing the contents of a storage device. For example, some storage devices cannot accept input/output (I/O) operations while its contents are being copied. Furthermore, the storage space used to keep the copy cannot be used for other storage needs.
Storage systems and storage software products can provide ways to make point-in-time copies of disk volumes. In some of these products, the copies may be made very quickly, without significantly disturbing applications using the disk volumes. In other products, the copies may be made space efficient by sharing storage instead of copying all the disk volume data.
However, known methodologies for copying data files include limitations. Some of the known disk copy methods do not provide fast copies. Other known disk copy methods solutions are not space-efficient. Still other known disk copy methods provide fast and space-efficient snapshots, but do not do so in a scaleable, distributed, table-driven virtual storage system.
Storage systems also present a need for efficient recovery operations in the event of a failure in the hardware, software, or data associated with a primary data set. Thus, there remains a need for improved copy and failure recovery operations in storage devices.
In an exemplary implementation a method of computing may be implemented in a processor in a storage network. The method comprises detecting a failure in a source volume, and in response to the failure: terminating communication with one or more applications that generate I/O requests to the source volume; refreshing the source volume; copying a backup data set to the source volume, and while the backup data set is being copied: activating a new snapdifference file; restarting communication with one or more applications that generate I/O requests to the source volume; and recording I/O operations to the source volume in the snapdifference file.
Described herein are exemplary storage network architectures, data architectures, and methods for creating and using difference files in storage networks. The methods described herein may be embodied as logic instructions on a computer-readable medium. When executed on a processor, the logic instructions cause a general purpose computing device to be programmed as a special-purpose machine that implements the described methods. The processor, when configured by the logic instructions to execute the methods recited herein, constitutes structure for performing the described methods.
Exemplary Network Architectures
The subject matter described herein may be implemented in a storage architecture that provides virtualized data storage at a system level, such that virtualization is implemented within a SAN. In the implementations described herein, the computing systems that utilize storage are referred to as hosts. In a typical implementation, a host is any computing system that consumes data storage resources capacity on its own behalf, or on behalf of systems coupled to the host. For example, a host may be a supercomputer processing large databases, a transaction processing server maintaining transaction records, and the like. Alternatively, the host may be a file server on a local area network (LAN) or wide area network (WAN) that provides storage services for an enterprise.
In a direct-attached storage solution, such a host may include one or more disk controllers or RAID controllers configured to manage multiple directly attached disk drives. By contrast, in a SAN a host connects to the SAN in accordance via a high-speed connection technology such as, e.g., a fibre channel (FC) fabric in the particular examples.
A virtualized SAN architecture comprises a group of storage cells, where each storage cell comprises a pool of storage devices called a disk group. Each storage cell comprises parallel storage controllers coupled to the disk group. The storage controllers coupled to the storage devices using a fibre channel arbitrated loop connection, or through a network such as a fibre channel fabric or the like. The storage controllers may also be coupled to each other through point-to-point connections to enable them to cooperatively manage the presentation of storage capacity to computers using the storage capacity.
The network architectures described herein represent a distributed computing environment such as an enterprise computing system using a private SAN. However, the network architectures may be readily scaled upwardly or downwardly to meet the needs of a particular application.
A plurality of logical disks (also called logical units or LUNs) 112a, 112b may be allocated within storage pool 110. Each LUN 112a, 112b comprises a contiguous range of logical addresses that can be addressed by host devices 120, 122, 124 and 128 by mapping requests from the connection protocol used by the host device to the uniquely identified LUN 112a, 112b. A host such as server 128 may provide services to other computing or data processing systems or devices. For example, client computer 126 may access storage pool 110 via a host such as server 128. Server 128 may provide file services to client 126, and may provide other services such as transaction processing services, email services, etc. Hence, client device 126 may or may not directly use the storage consumed by host 128.
Devices such as wireless device 120, and computers 122, 124, which also may serve as hosts, may logically couple directly to LUNs 112a, 112b. Hosts 120-128 may couple to multiple LUNs 112a, 112b, and LUNs 112a, 112b may be shared among multiple hosts. Each of the devices shown in
A LUN such as LUN 112a, 112b comprises one or more redundant stores (RStore) which are a fundamental unit of reliable storage. An RStore comprises an ordered set of physical storage segments (PSEGs) with associated redundancy properties and is contained entirely within a single redundant store set (RSS). By analogy to conventional storage systems, PSEGs are analogous to disk drives and each RSS is analogous to a RAID storage set comprising a plurality of drives.
The PSEGs that implements a particular LUN may be spread across any number of physical storage disks. Moreover, the physical storage capacity that a particular LUN 102 represents may be configured to implement a variety of storage types offering varying capacity, reliability and availability features. For example, some LUNs may represent striped, mirrored and/or parity-protected storage. Other LUNs may represent storage capacity that is configured without striping, redundancy or parity protection.
In an exemplary implementation an RSS comprises a subset of physical disks in a Logical Device Allocation Domain (LDAD), and may include from six to eleven physical drives (which can change dynamically). The physical drives may be of disparate capacities. Physical drives within an RSS may be assigned indices (e.g., 0, 1, 2, . . . , 11) for mapping purposes, and may be organized as pairs (i.e., adjacent odd and even indices) for RAID-1 purposes. One problem with large RAID volumes comprising many disks is that the odds of a disk failure increase significantly as more drives are added. A sixteen drive system, for example, will be twice as likely to experience a drive failure (or more critically two simultaneous drive failures), than would an eight drive system. Because data protection is spread within an RSS in accordance with the present invention, and not across multiple RSSs, a disk failure in one RSS has no effect on the availability of any other RSS. Hence, an RSS that implements data protection must suffer two drive failures within the RSS rather than two failures in the entire system. Because of the pairing in RAID-1 implementations, not only must two drives fail within a particular RSS, but a particular one of the drives within the RSS must be the second to fail (i.e. the second-to-fail drive must be paired with the first-to-fail drive). This atomization of storage sets into multiple RSSs where each RSS can be managed independently improves the performance, reliability, and availability of data throughout the system.
A SAN manager appliance 109 is coupled to a management logical disk set (MLD) 111 which is a metadata container describing the logical structures used to create LUNs 112a, 112b, LDADs 103a, 103b, and other logical structures used by the system. A portion of the physical storage capacity available in storage pool 101 is reserved as quorum space 113 and cannot be allocated to LDADs 103a, 103b, and hence cannot be used to implement LUNs 112a, 112b. In a particular example, each physical disk that participates in storage pool 110 has a reserved amount of capacity (e.g., the first “n” physical sectors) that may be designated as quorum space 113. MLD 111 is mirrored in this quorum space of multiple physical drives and so can be accessed even if a drive fails. In a particular example, at least one physical drive is associated with each LDAD 103a, 103b includes a copy of MLD 111 (designated a “quorum drive”). SAN management appliance 109 may wish to associate information such as name strings for LDADs 103a, 103b and LUNs 112a, 112b, and timestamps for object birthdates. To facilitate this behavior, the management agent uses MLD 111 to store this information as metadata. MLD 111 is created implicitly upon creation of each LDAD 103a, 103b.
Quorum space 113 is used to store information including physical store ID (a unique ID for each physical drive), version control information, type (quorum/non-quorum), RSS ID (identifies to which RSS this disk belongs), RSS Offset (identifies this disk's relative position in the RSS), Storage Cell ID (identifies to which storage cell this disk belongs), PSEG size, as well as state information indicating whether the disk is a quorum disk, for example. This metadata PSEG also contains a PSEG free list for the entire physical store, probably in the form of an allocation bitmap. Additionally, quorum space 113 contains the PSEG allocation records (PSARs) for every PSEG on the physical disk. The PSAR comprises a PSAR signature, Metadata version, PSAR usage, and an indication a RSD to which this PSEG belongs.
CSLD 114 is another type of metadata container comprising logical drives that are allocated out of address space within each LDAD 103a, 103b, but that, unlike LUNs 112a, 112b, may span multiple LDADs 103a, 103b. Preferably, each LDAD 103a, 103b includes space allocated to CSLD 114. CSLD 114 holds metadata describing the logical structure of a given LDAD 103, including a primary logical disk metadata container (PLDMC) that contains an array of descriptors (called RSDMs) that describe every RStore used by each LUN 112a, 112b implemented within the LDAD 103a, 103b. The CSLD 114 implements metadata that is regularly used for tasks such as disk creation, leveling, RSS merging, RSS splitting, and regeneration. This metadata includes state information for each physical disk that indicates whether the physical disk is “Normal” (i.e., operating as expected), “Missing” (i.e., unavailable), “Merging” (i.e., a missing drive that has reappeared and must be normalized before use), “Replace” (i.e., the drive is marked for removal and data must be copied to a distributed spare), and “Regen” (i.e., the drive is unavailable and requires regeneration of its data to a distributed spare).
A logical disk directory (LDDIR) data structure in CSLD 114 is a directory of all LUNs 112a, 112b in any LDAD 103a, 103b. An entry in the LDDS comprises a universally unique ID (UUID) an RSD indicating the location of a Primary Logical Disk Metadata Container (PLDMC) for that LUN 102. The RSD is a pointer to the base RSDM or entry point for the corresponding LUN 112a, 112b. In this manner, metadata specific to a particular LUN 112a, 112b can be accessed by indexing into the LDDIR to find the base RSDM of the particular LUN 112a, 112b. The metadata within the PLDMC (e.g., mapping structures described hereinbelow) can be loaded into memory to realize the particular LUN 112a, 112b.
Hence, the storage pool depicted in
Each of the devices shown in
In an exemplary implementation an individual LDAD 103a, 103b may correspond to from as few as four disk drives to as many as several thousand disk drives. In particular examples, a minimum of eight drives per LDAD is required to support RAID-1 within the LDAD 103a, 103b using four paired disks. LUNs 112a, 112b defined within an LDAD 103a, 103b may represent a few megabytes of storage or less, up to 2 TByte of storage or more. Hence, hundreds or thousands of LUNs 112a, 112b may be defined within a given LDAD 103a, 103b, and thus serve a large number of storage needs. In this manner a large enterprise can be served by a single storage pool 1101 providing both individual storage dedicated to each workstation in the enterprise as well as shared storage across the enterprise. Further, an enterprise may implement multiple LDADs 103a, 103b and/or multiple storage pools 1101 to provide a virtually limitless storage capability. Logically, therefore, the virtual storage system in accordance with the present description offers great flexibility in configuration and access.
Client computers 214a, 214b, 214c may access storage cells 210a, 210b, 210c through a host, such as servers 216, 220. Clients 214a, 214b, 214c may be connected to file server 216 directly, or via a network 218 such as a Local Area Network (LAN) or a Wide Area Network (WAN). The number of storage cells 210a, 210b, 210c that can be included in any storage network is limited primarily by the connectivity implemented in the communication network 212. By way of example, a switching fabric comprising a single FC switch can interconnect 256 or more ports, providing a possibility of hundreds of storage cells 210a, 210b, 210c in a single storage network.
Hosts 216, 220 are typically implemented as server computers.
Computing device 330 further includes a hard disk drive 344 for reading from and writing to a hard disk (not shown), and may include a magnetic disk drive 346 for reading from and writing to a removable magnetic disk 348, and an optical disk drive 350 for reading from or writing to a removable optical disk 352 such as a CD ROM or other optical media. The hard disk drive 344, magnetic disk drive 346, and optical disk drive 350 are connected to the bus 336 by a SCSI interface 354 or some other appropriate interface. The drives and their associated computer-readable media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for computing device 330. Although the exemplary environment described herein employs a hard disk, a removable magnetic disk 348 and a removable optical disk 352, other types of computer-readable media such as magnetic cassettes, flash memory cards, digital video disks, random access memories (RAMs), read only memories (ROMs), and the like, may also be used in the exemplary operating environment.
A number of program modules may be stored on the hard disk 344, magnetic disk 348, optical disk 352, ROM 338, or RAM 340, including an operating system 358, one or more application programs 360, other program modules 362, and program data 364. A user may enter commands and information into computing device 330 through input devices such as a keyboard 366 and a pointing device 368. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are connected to the processing unit 332 through an interface 370 that is coupled to the bus 336. A monitor 372 or other type of display device is also connected to the bus 336 via an interface, such as a video adapter 374.
Computing device 330 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 376. The remote computer 376 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computing device 330, although only a memory storage device 378 has been illustrated in
When used in a LAN networking environment, computing device 330 is connected to the local network 380 through a network interface or adapter 384. When used in a WAN networking environment, computing device 330 typically includes a modem 386 or other means for establishing communications over the wide area network 382, such as the Internet. The modem 386, which may be internal or external, is connected to the bus 336 via a serial port interface 356. In a networked environment, program modules depicted relative to the computing device 330, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
Hosts 216, 220 may include host adapter hardware and software to enable a connection to communication network 212. The connection to communication network 212 may be through an optical coupling or more conventional conductive cabling depending on the bandwidth requirements. A host adapter may be implemented as a plug-in card on computing device 330. Hosts 216, 220 may implement any number of host adapters to provide as many connections to communication network 212 as the hardware and software support.
Generally, the data processors of computing device 330 are programmed by means of instructions stored at different times in the various computer-readable storage media of the computer. Programs and operating systems may distributed, for example, on floppy disks, CD-ROMs, or electronically, and are installed or loaded into the secondary memory of a computer. At execution, the programs are loaded at least partially into the computer's primary electronic memory.
Each NSC 410a, 410b further includes a communication port 428a, 428b that enables a communication connection 438 between the NSCs 410a, 410b. The communication connection 438 may be implemented as a FC point-to-point connection, or pursuant to any other suitable communication protocol.
In an exemplary implementation, NSCs 410a, 410b further include a plurality of Fiber Channel Arbitrated Loop (FCAL) ports 420a-426a, 420b-426b that implement an FCAL communication connection with a plurality of storage devices, e.g., arrays of disk drives 440, 442. While the illustrated embodiment implement FCAL connections with the arrays of disk drives 440, 442, it will be understood that the communication connection with arrays of disk drives 440, 442 may be implemented using other communication protocols. For example, rather than an FCAL configuration, a FC switching fabric or a small computer serial interface (SCSI) connection may be used.
In operation, the storage capacity provided by the arrays of disk drives 440, 442 may be added to the storage pool 110. When an application requires storage capacity, logic instructions on a host computer 128 establish a LUN from storage capacity available on the arrays of disk drives 440, 442 available in one or more storage sites. It will be appreciated that, because a LUN is a logical unit, not necessarily a physical unit, the physical storage space that constitutes the LUN may be distributed across multiple storage cells. Data for the application is stored on one or more LUNs in the storage network. An application that needs to access the data queries a host computer, which retrieves the data from the LUN and forwards the data to the application.
One or more of the storage cells 210a, 210b, 210c in the storage network 200 may implement RAID-based storage. RAID (Redundant Array of Independent Disks) storage systems are disk array systems in which part of the physical storage capacity is used to store redundant data. RAID systems are typically characterized as one of six architectures, enumerated under the acronym RAID. A RAID 0 architecture is a disk array system that is configured without any redundancy. Since this architecture is really not a redundant architecture, RAID 0 is often omitted from a discussion of RAID systems.
A RAID 1 architecture involves storage disks configured according to mirror redundancy. Original data is stored on one set of disks and a duplicate copy of the data is kept on separate disks. The RAID 2 through RAID 5 architectures all involve parity-type redundant storage. Of particular interest, a RAID 5 system distributes data and parity information across a plurality of the disks. Typically, the disks are divided into equally sized address areas referred to as “blocks”. A set of blocks from each disk that have the same unit address ranges are referred to as “stripes”. In RAID 5, each stripe has N blocks of data and one parity block, which contains redundant information for the data in the N blocks.
In RAID 5, the parity block is cycled across different disks from stripe-to-stripe. For example, in a RAID 5 system having five disks, the parity block for the first stripe might be on the fifth disk; the parity block for the second stripe might be on the fourth disk; the parity block for the third stripe might be on the third disk; and so on. The parity block for succeeding stripes typically “precesses” around the disk drives in a helical pattern (although other patterns are possible). RAID 2 through RAID 4 architectures differ from RAID in how they compute and place the parity block on the disks. The particular RAID class implemented is not important.
The memory representation described herein enables each LUN 112a, 112b to implement from 1 Mbyte to 2 TByte in storage capacity. Larger storage capacities per LUN 112a, 112b are contemplated. For purposes of illustration a 2 Terabyte maximum is used in this description. Further, the memory representation enables each LUN 112a, 112b to be defined with any type of RAID data protection, including multi-level RAID protection, as well as supporting no redundancy at all. Moreover, multiple types of RAID data protection may be implemented within a single LUN 112a, 112b such that a first range of logical disk addresses (LDAs) correspond to unprotected data, and a second set of LDAs within the same LUN 112a, 112b implement RAID 5 protection. Hence, the data structures implementing the memory representation must be flexible to handle this variety, yet efficient such that LUNs 112a, 112b do not require excessive data structures.
A persistent copy of the memory representation shown in
A logical disk mapping layer maps a LDA specified in a request to a specific RStore as well as an offset within the RStore. Referring to the embodiment shown in
L2MAP 501 includes a plurality of entries where each entry represents 2 Gbyte of address space. For a 2 Tbyte LUN 112a, 112b, therefore, L2MAP 501 includes 1024 entries to cover the entire address space in the particular example. Each entry may include state information corresponding to the corresponding 2 Gbyte of storage, and a pointer a corresponding LMAP descriptor 503. The state information and pointer are only valid when the corresponding 2 Gbyte of address space have been allocated, hence, some entries in L2MAP 501 will be empty or invalid in many applications.
The address range represented by each entry in LMAP 503, is referred to as the logical disk address allocation unit (LDAAU). In the particular implementation, the LDAAU is 1 MByte. An entry is created in LMAP 503 for each allocated LDAAU irrespective of the actual utilization of storage within the LDAAU. In other words, a LUN 102 can grow or shrink in size in increments of 1 Mbyte. The LDAAU is represents the granularity with which address space within a LUN 112a, 112b can be allocated to a particular storage task.
An LMAP 503 exists only for each 2 Gbyte increment of allocated address space. If less than 2 Gbyte of storage are used in a particular LUN 112a, 112b, only one LMAP 503 is required, whereas, if 2 Tbyte of storage is used, 1024 LMAPs 503 will exist. Each LMAP 503 includes a plurality of entries where each entry optionally corresponds to a redundancy segment (RSEG). An RSEG is an atomic logical unit that is roughly analogous to a PSEG in the physical domain—akin to a logical disk partition of an RStore. In a particular embodiment, an RSEG is a logical unit of storage that spans multiple PSEGs and implements a selected type of data protection. Entire RSEGs within an RStore are bound to contiguous LDAs in a preferred implementation. In order to preserve the underlying physical disk performance for sequential transfers, it is desirable to adjacently locate all RSEGs from an RStore in order, in terms of LDA space, so as to maintain physical contiguity. If, however, physical resources become scarce, it may be necessary to spread RSEGs from RStores across disjoint areas of a LUN 102. The logical disk address specified in a request 501 selects a particular entry within LMAP 503 corresponding to a particular RSEG that in turn corresponds to IMbyte address space allocated to the particular RSEG#. Each LMAP entry also includes state information about the particular RSEG, and an RSD pointer.
Optionally, the RSEG#s may be omitted, which results in the RStore itself being the smallest atomic logical unit that can be allocated. Omission of the RSEG# decreases the size of the LMAP entries and allows the memory representation of a LUN 102 to demand fewer memory resources per MByte of storage. Alternatively, the RSEG size can be increased, rather than omitting the concept of RSEGs altogether, which also decreases demand for memory resources at the expense of decreased granularity of the atomic logical unit of storage. The RSEG size in proportion to the RStore can, therefore, be changed to meet the needs of a particular application.
The RSD pointer points to a specific RSD 505 that contains metadata describing the RStore in which the corresponding RSEG exists. As shown in
In operation, each request for storage access specifies a LUN 112a, 112b, and an address. A NSC such as NSC 410a, 410b maps the logical drive specified to a particular LUN 112a, 112b, then loads the L2MAP 501 for that LUN 102 into memory if it is not already present in memory. Preferably, all of the LMAPs and RSDs for the LUN 102 are loaded into memory as well. The LDA specified by the request is used to index into L2MAP 501, which in turn points to a specific one of the LMAPs. The address specified in the request is used to determine an offset into the specified LMAP such that a specific RSEG that corresponds to the request-specified address is returned. Once the RSEG# is known, the corresponding RSD is examined to identify specific PSEGs that are members of the redundancy segment, and metadata that enables a NSC 410a, 410b to generate drive specific commands to access the requested data. In this manner, an LDA is readily mapped to a set of PSEGs that must be accessed to implement a given storage request.
The L2MAP consumes 4 Kbytes per LUN 112a, 112b regardless of size in an exemplary implementation. In other words, the L2MAP includes entries covering the entire 2 Tbyte maximum address range even where only a fraction of that range is actually allocated to a LUN 112a, 112b. It is contemplated that variable size L2MAPs may be used, however such an implementation would add complexity with little savings in memory. LMAP segments consume 4 bytes per Mbyte of address space while RSDs consume 3 bytes per MB. Unlike the L2MAP, LMAP segments and RSDs exist only for allocated address space.
RStores are allocated in their entirety to a specific LUN 102. RStores may be partitioned into 1 Mbyte segments (RSEGs) as shown in
RStores are essentially a fixed quantity (8 MByte in the examples) of virtual address space. RStores consume from four to eight PSEGs in their entirety depending on the data protection level. A striped RStore without redundancy consumes 4 PSEGs (4-2048 KByte PSEGs=8 MB), an RStore with 4+1 parity consumes 5 PSEGs and a mirrored RStore consumes eight PSEGs to implement the 8 Mbyte of virtual address space.
An RStore is analogous to a RAID disk set, differing in that it comprises PSEGs rather than physical disks. An RStore is smaller than conventional RAID storage volumes, and so a given LUN 102 will comprise multiple RStores as opposed to a single RAID storage volume in conventional systems.
It is contemplated that drives 405 may be added and removed from an LDAD 103 over time. Adding drives means existing data can be spread out over more drives while removing drives means that existing data must be migrated from the exiting drive to fill capacity on the remaining drives. This migration of data is referred to generally as “leveling”. Leveling attempts to spread data for a given LUN 102 over as many physical drives as possible. The basic purpose of leveling is to distribute the physical allocation of storage represented by each LUN 102 such that the usage for a given logical disk on a given physical disk is proportional to the contribution of that physical volume to the total amount of physical storage available for allocation to a given logical disk.
Existing RStores can be modified to use the new PSEGs by copying data from one PSEG to another and then changing the data in the appropriate RSD to indicate the new membership. Subsequent RStores that are created in the RSS will use the new members automatically. Similarly, PSEGs can be removed by copying data from populated PSEGs to empty PSEGs and changing the data in LMAP 502 to reflect the new PSEG constituents of the RSD. In this manner, the relationship between physical storage and logical presentation of the storage can be continuously managed and updated to reflect current storage environment in a manner that is invisible to users.
Snapdifference Files
In one aspect, the system is configured to implement files referred to herein as snapdifference files or snapdifference objects. Snapdifference files are entities designed to combine certain characteristics of snapshots (i.e., capacity efficiency by sharing data with a successor and predecessor files when there has been no change to the data during the life of the snapdifference) with time characteristics of log files. Snapdifference files may also be used in combination with a base snapclone and other snapdifferences to provide the ability to view different copies of data through time. Snapdifference files also capture all new data targeted at a LUN starting at a point in time, until it is decided to deactivate the snapdifference, and start a new one
Snapdifference files may be structured similar to snapshots. Snapdifference may use metadata structures similar to the metadata structures used in snapshots to enable snapshot files to share data with a predecessor LUN when appropriate, but to contain unique or different data when the time of data arrival occurs during the active period of a snapdifference. A successor snapdifference can reference data in a predecessor snapdifference or predecessor LUN via the same mechanism.
By way of example, assume LUN A is active until 1:00 pm Sep. 12, 2004. Snapdifference 1 of LUN A is active from 1:00 pm+until 2:00 pm Sep. 12, 2004. Snapdifference 2 of LUN A is active from 2:00 pm+until 3:00 pm Sep. 12, 2004. Data in each of LUN A, Snapdifference 1 and Snapdifference 2 may be accessed using the same virtual metadata indexing methods. Snapdifference 1 contains unique data that has changed (at the granularity of the indexing scheme used) from after 1:00 pm to 2:00 pm and shares all other data with LUN A. Snapdifference 2 contains unique data that has changed from after 2:00 pm to 3:00 pm and shares all other data with either snapdifference 1 or LUN A. This data is accessed using the above mentioned indexing, sharing bit scheme referred to as a snap tree. So changes over time are maintained—LUN A view of data prior to 1:00 pm, Snapdifference 1 and LUN A view of data prior to 2:00 pm and earlier, Snapdifference 2 and Snapdifference 1 and LUN A—view of data 3:00 pm and earlier. Alternatively, segmented time views Snapdifference 1 view of data from 1:00 pm to 2:00 pm, or Snapdifference 2 view of data from 2:00 pm to 3:00 pm.
Hence, snapdifferences share similarities with log files in that snapdifference files associate data with time (i.e., they collect new data from time a to time b), while being structurally to a snapshot, (i.e., they have characteristics of a snapshot, namely speed of data access and space efficiency along with the ability to maintain changes over time).
By combining key snapshot characteristics and structure with a the log file time model snapdifferences may be used to provide an always in synch mirroring capability, time maintenance for data, straightforward space efficient incremental backup and powerful instant recovery mechanisms.
As used herein, the term prenormalized snapclone refers to a snapclone that synchronizes with the source volume 710 before the snapclone is split from the source volume 710. A prenormalized snapclone represents a point-in-time copy of the source volume at the moment the snapclone is split from the source volume. By contrast, a postnormalized snapclone is created at a specific point in time, but a complete, separate copy of the data in the source volume 710 is not completed until a later point in time.
A snapdifference file is created and activated at a particular point in time, and subsequently all I/O operations that affect data in the source volume 710 are copied contemporaneously to the active snapdifference file. At a desired point in time or when a particular threshold is reached (e.g., when a snapdifference file reaches a predetermined size), the snapdifference file may be closed and another snapdifference file may be activated. After a snapdifference file 730, 732, 734 has been inactivated it may be merged into the snapclone 720. In addition, snapdifference files may be backed up to a tape drive such as tape drive 742, 744, 746.
In one implementation, a snapdifference file is created and activated contemporaneous with the creation of a snapclone such as snapclone 720. I/O operations directed to source volume 710 are copied to the active snapdifference file, such as snapdifference file 730.
Snapdifference files will be explained in greater detail with reference to
Accordingly, blocks of the flowchart illustrations support combinations of means for performing the specified functions and combinations of steps for performing the specified functions. It will also be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
If, at operation 1115, it is determined that the read request is not directed to a snapdifference file, then control passes to operation 1135 and the read request may be executed from the LD identified in the read request pursuant to normal operating procedures. By contrast, if at operation 1115 it is determined that the read request is directed to a snapdifference file, then operations 1120-1130 are executed to traverse the existing snapdifference files to locate the LBA identified in the read request.
At operation 1120 the active snapdifference file is examined to determine whether the sharing bit associated with the LBA identified in the read request is set. If the sharing bit is not set, which indicates that the active snapdifference file includes new data in the identified LBA, then control passes to operation 1135 and the read request may be executed from the LBA in the snapdifference file identified in the read request.
By contrast, if at operation 1120 the sharing bit is not set, then control passes to operation 1125, where it is determined whether the active snapdifference file's predecessor is another snapdifference file. In an exemplary implementation this may be determined by analyzing the LDSB identified by the active snapdifference's predecessor pointer, as depicted in
If, at operation 1215, it is determined that the read request is not directed to a snapdifference file, then control passes to operation 1245 and the write request is executed against the LD identified in the write request pursuant to normal operating procedures, and an acknowledgment is returned to the host computer (operation 1255). By contrast, if at operation 1215 it is determined that the write request is directed to a snapdifference file, then operations 1220-1230 are executed to traverse the existing snapdifference files to locate the LBA identified in the write request.
At operation 1220 the active snapdifference file is examined to determine whether the sharing bit associated with the LBA identified in the read request is set. If the sharing bit is not set, which indicates that the active snapdifference file includes new data in the identified LBA, then control passes to operation 1250 and the write request may be executed against the LBA in the snapdifference file identified in the write request. It will be appreciated that the write operation may re-write only the LBAs changed by the write operation, or the entire RSEG(s) containing the LBAs changed by the write operation, depending upon the configuration of the system.
By contrast, if at operation 1220 the sharing bit is not set, then control passes to operation 1225, where it is determined whether the active snapdifference file's predecessor is another snapdifference file. In an exemplary implementation this may be determined by analyzing the LDSB identified by the active snapdifference's predecessor pointer, as depicted in
By contrast, if at operation 1225 it is determined that the write request is directed to a snapdifference file, then operations 1225-1230 are executed to traverse the existing snapdifference files until the LBA identified in the write request is located, either in a snapdifference file or in a LD. Operations 1235-1250 are then executed to copy the RSEG changed by the write operation into the active snapdifference file.
As noted above, in one implementation a snapdifference file may be time-bound, i.e., a snapdifference file may be activated at a specific point in time and may be deactivated at a specific point in time.
The process begins at operation 1310, when a request to merge the snapdifference file is received. In an exemplary implementation the merge request may be generated by a host computer and may identify one or more snapdifference files and the snapclone into which the snapdifference file(s) are to be merged.
At operation 1315 the “oldest” snapdifference file is located. In an exemplary implementation the oldest snapdifference may be located by following the predecessor/successor pointer trail of the LDSB maps until an LDSB having a predecessor pointer that maps to the snapclone is located. Referring again to
Operation 1320 initiates an iterative loop through each RSEG in each RSTORE mapped in the snapdifference file. If, at operation 1325 there are no more RSEGs in the RSTORE to analyze, then control passes to operation 1360, which determines whether there are additional RSTORES to analyze.
If at operation 1325 there are additional RSEGS in the RSTORE to analyze, then control passes to operation 1330, where it is determined whether either the successor sharing bit or the predecessor sharing bit is set for the RSEG. If either of these sharing bits is set, then there is need to merge the data in the RSEG, so control passes to operation 1355.
By contrast, if at operation 1330 if the sharing bit is not set, then control passes to operation 1335 and the RSEG is read, and the data in the RSEG is copied (operation 1340) into the corresponding memory location in the predecessor, i.e., the snapclone. At operation 1345 the sharing bit is reset in the RSEG of the snapdifference being merged. If, at operation 1355, there are more RSEGs in the RSTORE to analyze, then control passes to back to operation 1330. Operations 1330-1355 are repeated until all RSEGs in the RSTORE have been analyzed, whereupon control passes to operation 1360, which determines whether there are more RSTORES to analyze. If, at operation 1360, there are more RSTORES to analyze, then control passes back to operation 1325, which restarts the loop of operations 1330 through 1355 for the selected RSTORE.
The operations of 1325 through 1360 are repeated until there are no more RSTORES to analyze in operation 1360, in which case control passes to operation 1365 and the successor pointer in the predecessor LDSB (i.e., the LDSB associated with the snapclone) is set to point to the successor of the LDSB that was merged. At operation 1370 the LDSB that was merged is set to NULL, effectively terminating the existence of the merged LDSB. This process may be repeated to successively merge the “oldest” snapdifference files into the snapclone. This also frees up the merged snapdifference LDSB for reuse.
Described herein are file structures referred to as snapdifference files, and exemplary methods for creating and using snapdifference files. In one exemplary implementation snapdifference files may be implemented in conjunction with snapclones in remote copy operations. A difference file may be created and activated contemporaneous with the generation of a snapclone. I/O operations that change the data in the source volume associated with the snapclone are recorded in the active snapdifference file. The active snapdifference file may be closed at a specific point in time or when a specific threshold associated with the snapdifference file is satisfied. Another snapdifference file may be activated contemporaneous with closing an existing snapdifference file, and the snapdifference files may be linked using pointers that indicate the temporal relationship between the snapdifference files. After a snapdifference file has been closed, the file may be merged into the snapclone with which it is associated.
Data Recovery Operations
In exemplary implementations, snapdifference files may be used for implementing failure recovery procedures in storage networks and/or storage devices. One such implementation is illustrated with reference to
The operations of
Accordingly, blocks of the flowchart illustrations support combinations of means for performing the specified functions and combinations of steps for performing the specified functions. It will also be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
The data architecture may further include one or more application transaction recovery logs 1440 that record I/O operations. An application transaction recovery log is an artifact of the application and is commonly referred to as a transaction log. This is a secondary backup mechanism provided by the application, during the period when a new snapdifference file is collecting data, but has not yet been validated. This log allows us to ultimately recover to the latest validated recovery set, plus data received by the application since that time, until the corruption was introduced, if the cause of the corruption can be understood and isolated. Application transaction recovery log 1440 need not be directly associated with either source volume 1410, snapclone, or snapdifference files 1430, 1432, 1434. Application transaction recovery log 1440 may be stored in an entirely separate LUN, and may log I/O operations from multiple disparate host computers and/or source applications. The coherency of data in the application transaction recovery log 1440 is managed by the application.
Referring to
As described above, snapdifference files may be activated at predetermined points in time, and may be deactivated at predetermined points in time or in accordance with one or more other thresholds such as, e.g., maximum file size. Snapdifference files are implemented in a sequence, and when one snapdifference file is deactivated a subsequent snapdifference file is contemporaneously activated and configured to receive I/O operations. One or more background processes may be executed by a processor such as, e.g., the array controller, to: (1) validate the data in the snapdifference file; and (2) to merge the snapdifference file into the snapclone. Hence, each snapdifference file represents the mirrored data set at the point in time at which the snapdifference file was deactivated. Data validation will be application specific and an artifact of the application.
At operation 1535 a recovery process is initiated using the most current validated snapdifference file. An exemplary recovery process replaces the data in source volume 1410 with a copy of the most current snapdifference file. This may be implemented by making the source volume an active snapclone of the most current snapdifference file and copying the data set represented by the snapdifference tree (the split mirror/snapclone along with the selected and older snapdifference files) into the source volume 1410 using standard snapclone mechanism of background normalization in conjunction with on demand unsharing.
Execution of the snapclone operation restores the data in source volume 1410 to a data set that is validated as of the point in time at which the snapdifference file 1432 was validated. Data from the application transaction recovery log 1440 may be retrieved to restore the data in source volume 1410 to a data set that reflects I/O operations executed in the time period between validation of snapdifference file 1432 and the failure event. Data retrieved from the application transaction recovery log 1440 may be validated before being written to the source volume 1410. This use of the application transaction recovery log is driven by the application and is application specific.
The process of copying the data set into the source volume 1410 consumes time in proportion to the size of the data set. Advantageously, applications which generate I/O operations to the source volume 1410 may be restarted while the data set is being copied to the source volume 1410. The copy operations may be executed in a background process. Similarly, the operations associated with retrieving data from the application transaction recovery log 1440 may be executed in a background process. Accordingly, the operational downtime of the source volume may be minimized, or at least reduced, when compared to conventional recovery processes.
Although the described arrangements and procedures have been described in language specific to structural features and/or methodological operations, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or operations described. Rather, the specific features and operations are disclosed as preferred forms of implementing the claimed present subject matter.
Nelson, Lee, Daniels, Rodger, Dallmann, Andrew
Patent | Priority | Assignee | Title |
8060713, | Dec 21 2005 | EMC BENELUX B V , S A R L | Consolidating snapshots in a continuous data protection system using journaling |
Patent | Priority | Assignee | Title |
6560615, | Dec 17 1999 | JPMORGAN CHASE BANK, N A , AS SUCCESSOR AGENT | Method and apparatus for implementing a highly efficient, robust modified files list (MFL) for a storage system volume |
6591264, | Oct 23 2000 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Method that allows I/O requests to run concurrently with a rollback from a snapshot in a drive array |
6594744, | Dec 11 2000 | NetApp, Inc | Managing a snapshot volume or one or more checkpoint volumes with multiple point-in-time images in a single repository |
6594745, | Jan 31 2001 | Hewlett Packard Enterprise Development LP | Mirroring agent accessible to remote host computers, and accessing remote data-storage devices, via a communcations medium |
6606690, | Feb 20 2001 | Hewlett Packard Enterprise Development LP | System and method for accessing a storage area network as network attached storage |
7363633, | Apr 24 2000 | ZHIGU HOLDINGS LIMITED | Registering and storing dependencies among applications and objects in a computer system and communicating the dependencies to a recovery or backup service |
20020016827, | |||
20020103968, | |||
20020104008, | |||
20020199073, | |||
20030074492, | |||
20030079092, | |||
20030079102, | |||
20030084241, | |||
20030093444, | |||
20030120676, | |||
20030145179, | |||
20050055603, | |||
20070220308, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Nov 01 2004 | DANIELS, RODGER | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 015955 | /0265 | |
Nov 01 2004 | NELSON, LEE | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 015955 | /0265 | |
Nov 01 2004 | DALLMANN, ANDREW | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 015955 | /0265 | |
Nov 02 2004 | Hewlett-Packard Development Company, L.P. | (assignment on the face of the patent) | / | |||
Oct 27 2015 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Hewlett Packard Enterprise Development LP | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037079 | /0001 |
Date | Maintenance Fee Events |
Jul 02 2012 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
May 30 2016 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
May 20 2020 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Dec 30 2011 | 4 years fee payment window open |
Jun 30 2012 | 6 months grace period start (w surcharge) |
Dec 30 2012 | patent expiry (for year 4) |
Dec 30 2014 | 2 years to revive unintentionally abandoned end. (for year 4) |
Dec 30 2015 | 8 years fee payment window open |
Jun 30 2016 | 6 months grace period start (w surcharge) |
Dec 30 2016 | patent expiry (for year 8) |
Dec 30 2018 | 2 years to revive unintentionally abandoned end. (for year 8) |
Dec 30 2019 | 12 years fee payment window open |
Jun 30 2020 | 6 months grace period start (w surcharge) |
Dec 30 2020 | patent expiry (for year 12) |
Dec 30 2022 | 2 years to revive unintentionally abandoned end. (for year 12) |