Embodiments may include a consistency measurement component that utilizes memory-efficient sets (e.g., Bloom filters) to generate consistency metrics for read operations performed on different replicated data objects of distributed storage system. Based on the consistency metrics, the consistency measurement component may identify a subset of replicated data objects associated with low levels of consistency. The consistency measurement component may target this subset for consistency improvement by generating instructions to improve the consistency of the subset. In other cases, the consistency measurement component may notify a consistency improvement component about the targeted subset. In response, the consistency improvement component may generate instructions to improve the consistency of the targeted subset.
|
10. A computer-readable storage medium, storing program instructions computer-executable on a computer system to:
utilize multiple memory-efficient sets to generate consistency metrics for different replicated data objects stored within a distributed storage system, wherein the consistency metrics are based on a measure of inconsistent read operations performed on the replicated data objects; wherein an inconsistent read operation includes retrieving from a replicated data object, a value that is older than the most recent write operation performed on that replicated data object;
identify a subset of the replicated data objects having consistency metrics that indicate a lower level of consistency than the consistency metrics of other replicated data objects; and
for each given replicated data object of said subset, perform one or more operations to improve the data consistency of that replicated data object.
19. A system, comprising:
a memory; and
one or more processors coupled to the memory, wherein the memory comprises program instructions executable by the one or more processors to:
utilize multiple memory-efficient sets to generate consistency metrics for different replicated data objects stored within a distributed storage system, wherein the consistency metrics are based on a measure of inconsistent read operations performed on the replicated data objects; wherein an inconsistent read operation includes retrieving from a replicated data object, a value that is older than the most recent write operation performed on that replicated data object;
identify a subset of the replicated data objects having consistency metrics that indicate a lower level of consistency than the consistency metrics of other replicated data objects; and
provide an indication of the replicated data objects within said subset to a consistency improvement component configured to improve consistency of replicated data objects.
1. A computer-implemented method for targeted consistency improvement within a distributed storage system, the method comprising:
utilizing multiple memory-efficient sets to generate consistency metrics for different replicated data objects stored within the distributed storage system, wherein the consistency metrics are based on a measure of inconsistent read operations performed on the replicated data objects;
wherein an inconsistent read operation includes retrieving from a replicated data object, a value that is older than the most recent write operation performed on that replicated data object;
identifying a subset of the replicated data objects having consistency metrics that indicate a lower level of consistency than the consistency metrics of other replicated data objects; and
for each given replicated data object of said subset:
identifying multiple individual data objects that represent the given replicated data object within the distributed data store;
identifying the individual data object storing a most recent value relative to values of the other individual data objects; and
writing the most recent value to each of the other individual data objects.
2. The computer-implemented method of
3. The computer-implemented method of
for each given read operation of multiple read operations performed on the given replicated data object: comparing a last-modified time at which a value retrieved during that read operation was last modified to a time period assigned to a memory-efficient set that includes the key of the given replicated data object;
if the last modified time is older than the most recent time that a write operation was performed, determining that the value retrieved is inconsistent with a most recent value written to the given replicated data object; and
generating the consistency metric for the given replicated data object based on a quantity of instances in which retrieved values are determined to be inconsistent.
4. The computer-implemented method of
5. The computer-implemented method of
6. The computer-implemented method of
7. The computer-implemented method of
8. The computer-implemented method of
9. The computer-implemented method of
11. The computer-readable storage medium of
12. The computer-readable storage medium of
for each given read operation of multiple read operations performed on the given replicated data object: compare a last-modified time at which a value retrieved during that read operation was last modified to a time period assigned to a memory-efficient set that includes the key of the given replicated data object;
if the last modified time is older than the most recent time that a write operation was performed, determine that the value retrieved is inconsistent with a most recent value written to the given replicated data object; and
generate the consistency metric for the given replicated data object based on a quantity of instances in which retrieved values are determined to be inconsistent.
13. The computer-readable storage medium of
14. The computer-readable storage medium of
15. The computer-readable storage medium of
16. The computer-readable storage medium of
17. The computer-readable storage medium of
18. The computer-readable storage medium of
20. The system of
21. The system of
for each given read operation of multiple read operations performed on the given replicated data object: compare a last-modified time at which a value retrieved during that read operation was last modified to a time period assigned to a memory-efficient set that includes the key of the given replicated data object;
if the last modified time is older than the most recent time that a write operation was performed, determine that the value retrieved is inconsistent with a most recent value written to the given replicated data object; and
generate the consistency metric for the given replicated data object based on a quantity of instances in which retrieved values are determined to be inconsistent.
22. The system of
24. The system of
25. The system of
26. The system of
27. The system of
|
Distributed data stores often utilize data redundancy to provide reliability. For instance, multiple instances of the same data object may be stored on different storage devices. If a storage device failure causes the loss of one instance of the data object, other instances of the data object may be stored safely on storage devices that have not failed. In some cases, distributed data stores may provide data resilience by storing multiple instances of the same data object in geographically diverse locations. For instance, data objects may be replicated across different data centers in different regions. In these cases, data objects at one data center should be isolated from a disaster at another data center and vice versa.
Various properties are used to characterize distributed systems, such as the availability of data from the distributed system and the consistency of data within the distributed system. Typically, consistency and availability within a distributed data store are subject to the constraints of the CAP theorem (also known as Brewer's theorem), which generally states that distributed systems can provide one but not both of availability guarantees and consistency guarantees in the presence of network partitioning. With respect to availability and consistency, architects of distributed systems typically choose to prioritize one over the other. A high level of availability can be provided at the expense of full consistency; a high level of consistency can be provided at the expense of complete availability.
While the system and method for targeted consistency improvement in a distributed storage system is described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the system and method for targeted consistency improvement in a distributed storage system is not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit the system and method for targeted consistency improvement in a distributed storage system to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the system and method for targeted consistency improvement in a distributed storage system as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.
Various embodiments of a system and method for targeted consistency improvement in a distributed storage system are described.
Each storage node may include one or more storage devices, such as storage devices 114a-n. Note that while storage devices of a given storage node are illustrated with common numerals, these storage devices need not be the same. In various embodiments, storage devices 114a-n may include hard disk drive devices, such as Small Computer System Interface (SCSI) devices or AT Attachment Packet Interface (ATAPI) devices (which may also be known as Integrated Drive Electronics (IDE) devices).
However, storage devices 114a-n may encompass any type of mass storage device including magnetic- or optical-medium-based devices, solid-state mass storage devices (e.g., nonvolatile- or “Flash”-memory-based devices), magnetic tape, etc. The storage devices may store data objects input into the distributed storage system 106. These data objects may be replicated across the storage devices of different storage nodes, such as to provide data reliability through redundancy.
Clients systems may access distributed storage system 106 through storage system interface 108, which may be implemented on one or more computer systems. As illustrated, the storage system interface may process inbound requests, which may be received from one or more clients of the distributed storage system 106. In various embodiments, these clients may be remote computers that rely on distributed storage system 106 as an offsite storage solution. Examples of inbound requests include requests to write data objects to the distributed storage system 106, delete data objects within the distributed storage system 106, and read data objects from the distributed storage system 106. Distributed storage system 106 may perform various operations in response to the inbound requests. For example, in response to a request to write data, distributed storage system 106 may perform a write operation that generates or updates a replicated data object that represents the data sent with the request. In general, a replicated data object may collectively refer to multiple instances of the same data object stored within the distributed storage system 106. In another example, in response to a delete request, distributed storage system 106 may perform a delete operation on a replicated data object within the distributed storage system 106. In some cases, this delete operation may be a special case of the write operation described above. For instance, deleting a replicated data object may include using a write operation to write a certain value to the replicated data object or to set a delete flag associated with that object. In general, it should be understood that any reference herein to a write operation (or a “WRITE”) may include delete operations, modification operations, and/or any type of mutation operation that can affect the data of replicated data objects. Outbound data may include responses to inbound requests, such as data objects read and/or request acknowledgements. For example, in response to a read request, distributed storage system 106 may locate the respective data object, retrieve a value from that data object, and return the value to the requestor. In various embodiments, inbound requests and/or outbound data may be communicated over one or more electronic networks configured to transport data, an example of which is described below with respect to network 1485 of
In various embodiments, the distributed storage system may include one or more cache layers 109, which may cache (e.g., store) instances of data objects from lower levels of the distributed storage system (e.g., storage nodes 110a-n). In various embodiments, one or more of these caching layers may be utilized to increase performance of the distributed storage system. In some embodiments, cache layers may be implemented on one or nodes, each node being implemented as a computer (e.g., similar to computer system 1400 of
In various embodiments, each storage node may maintain a respective transaction (“TXN”) log 112a-n. The transaction log of a given storage node may include a record of each operation (e.g., read operations and write operations) performed by the given storage node. As described in more detail below with respect to subsequent Figures, these logs may also include other information, such as timestamps indicating when certain operations were performed.
It should be noted that distributed storage system 106 of
Various embodiments may include a consistency measurement component 100, which may be configured to measure the consistency of replicated data objects stored in distributed storage system 106. Consistency measurement component 100 may be configured to collect log files from the various storage nodes of distributed storage system 106, as illustrated by the “log collection” of
While embodiments are largely described herein within the context of a system that processes batches of logs collected from the distributed storage system, embodiments are not limited to operating in this manner. For instance, in some embodiments, logs may be collected and analyzed by consistency metric component 100 in real-time or near real-time. In yet other cases, the information described as being collected in the form of logs (e.g., information specifying the operations performed in the distributed storage system) may instead be streamed directly from the distributed storage system to the consistency metric component. In some embodiments, this streamed data may be evaluated in real-time or near-real time to determine inconsistencies. It should be understood that embodiments are not limited to any of these techniques. In various embodiments, any technique for determining the operations performed in the distributed storage system may be utilized.
In various embodiments, replicated data objects may be identified within the distributed storage system 106 using a key. Generally speaking, keys may include alphanumeric strings or other types of symbols that may be interpreted within the context of the namespace of the distributed storage system 106 as a whole, although keys may be interpreted in different ways. Generally speaking, a key may persist as a valid identifier through which a client may access a corresponding replicated data object as long as that object exists within the distributed storage system 106.
In various embodiments, the logs collected by consistency measurement component 100 may be sampled. These sampled log records are illustrated as sampled log records 116, which may be stored in memory of host system 102. In various embodiments, sampling the log records may reduce the computing resources and/or time needed to generate consistency metrics 104. In various embodiments, a requisite number of log records may be sampled in order to provide statistically significant values for consistency metrics 104. Techniques for sampling are described in more detail below with respect to
Consistency measurement component 100 may utilize memory-efficient sets 118 to store temporal indications that specify when different write operations were performed. In some embodiments, the memory efficiency of a memory-efficient set may stem from the set having limited removal characteristics and probabilistic guarantees of correctness. One example of a memory-efficient set is a Bloom filter. Generally speaking, Bloom filters are data structures that are configured to specify which elements are currently members of a set (with some chance of a false positive, described below). For example, given a key (or other identifier), the Bloom filter may be tested or probed to determine whether that key has already been inserted into the Bloom filter. In this example, the members of the Bloom filter's set include the keys that have already been inserted into the Bloom filter. Typically, once a key is inserted into the Bloom filter, it cannot be removed (although some embodiments may utilize sets that allow key removal). These types of Bloom filters may be utilized to answer a query that asks “is key k1 a member of the set of keys that have been inserted?” but not a query that asks “what are the values of the keys that have been inserted?”. Examples of Bloom filters are described in more detail below. In general, the memory-efficient sets described herein may be configured to provide an indication of which keys have been inserted into that set.
In various embodiments, the memory-efficient sets described herein may be any data structure that tracks the constituency of a set while displacing a data footprint that is less than the data footprint that would be required to store the actual values of the constituents of that set. This may be achieved through a variety of techniques (e.g., hashing into a bit array), examples of which are described with respect to subsequent Figures.
As described in more detail below, consistency measurement component 100 may utilize memory-efficient sets 118 to categorize different write operations into different time periods. For instance, consistency measurement component 100 may insert write operations specified by sampled log records 115 into different sets assigned to different time periods. Consistency measurement component 100 may also compare read operations from sampled log records to the sets in order to generate consistency metrics 104. For a given set of read operations from sampled log records 116, consistency metrics 104 may indicate a measurement or estimation of how consistent those read operations were. Generally, for each individual read operation that is sampled, consistency measurement component 100 may determine whether that read operation was consistent. For a given group of read operations, a consistency metric may depend on the quantity of those read operations that resulted in an inconsistent read. Situations in which inconsistent reads may occur are described in more detail below with respect to
In various embodiments, for each replicated data object, distributed storage system 106 may maintain a corresponding key that maps to each stored instance of that data object. In the illustrated embodiment, replicated data object 200 includes three stored instances; key k1 maps to each of the stored instances illustrated as objects 200a-c. In various embodiments, a replicated data object's key may be the primary identifier utilized to identify and/or locate all stored instances of the same replicated data object. In some embodiments, distributed storage system 106 may utilize a key map that maps keys to the respective memory locations at which objects are stored (although this is not required in all embodiments).
In various embodiments, each object may include a value that may be written to or read from the object. In some embodiments, each data object may also include various metadata. In the illustrated embodiment, an example of such metadata is shown as a time stamp (denoted as “last_mod_TS”). In various embodiments, for a given data object, this time stamp may indicate the last time that the value within that data object was modified. For instance, distributed storage system 106 may update such a time stamp within a data object each time a value is written to that data object (or when any other modification is performed to that data object). In other cases, other types of metadata may be stored within the data objects.
In state 1 of
State 2 of
Eventual consistency may mean that replicated data objects may be guaranteed to reach a consistent state without any strong guarantees as to how much time may be required to reach such state. In some cases, even if a maximum inconsistency window can be maintained, the data and results may still be potentially inconsistent when within the window. In this example, state 3 (described below) represents reaching consistency in a system with eventual consistency semantics.
In state 3 (some time after state 2), distributed storage system 106 has completed performing all write operations for replicated data object 200. In this state, all values are consistent across objects 200a-c. For instance, the value of each object is v2 in state 3. Similarly, each last-modified timestamp has been updated to t5 at state 3. In various embodiments, consistency measurement component 100 described herein may be configured to detect and/or quantify the types of inconsistent reads described above with respect to state 2. This process is described in more detail below.
Note that the method or mechanism by which consistency is reached may vary in different embodiments. While
By sampling log records, the consistency measurement component 100 may reduce the quantity of computing resources (e.g., processor cycles and/or memory utilization) required to perform measure consistency of reads performed in a distributed storage system. At block 302, the method may begin with consistency measurement component 100 selecting a group of keys for which log records are to be sampled. As described above, a given key may identify a replicated data object stored within the distributed storage system 106. One example of such a key is described above with respect to k1 of
In some cases, embodiments may utilize composite keys, such as keys including prefixes or suffixes in addition to other information. For instance, in some embodiments, each key of a group of keys associated with a particular customer entity may include a prefix portion and an identifier portion. In this case, the prefix may be the same for each key of the group (e.g., the prefix may be common across all keys associated with the particular customer entity) while each key's identifier portion may be unique. In some embodiments, the identifier portion of each key may be globally unique. In other cases, the identifier portion may be unique only within the domain of keys sharing a common prefix.
At block 304, consistency measurement component 100 may extract from the collected logs, records of read operations (which may be referred to herein as “READs”) and write operations (which may be referred to herein as “WRITEs”) performed on replicated data objects identified by the selected keys. Examples of the records extracted are described in more detail below with respect to
In various other embodiments, techniques other than those described with respect to
For log records pertaining to READs, such as the third row of sampled log file 116, the transaction timestamp “TXN_TS” represents the time that the READ was performed. Log records for READs also indicate the key of the replicated data object from which a value is retrieved. In the third row of the sampled log file, this key is k1. For READs, log records also indicate a last-modified timestamp (denoted as “last_mod_TS”). The last-modified timestamp may specify, for the specific value retrieved during the READ, the time at which that value was last modified.
In various embodiments, in order to measure consistency, consistency measurement component 100 may also utilize various data structures, such as memory-efficient sets 118 and consistency metric(s) 104 described above.
As illustrated at block 504, the method may also include initializing a consistency metric. In general, the consistency metric may be any metric that is dependent upon the quantity of inconsistent reads detected within distributed storage system 106 for a given period of time (e.g., TW) and a given group of replicated data objects (e.g., the group of objects corresponding to the keys selected at block 302).
If at block 804 consistency measurement component 100 determines that the record is a READ record, consistency measurement component 100 may proceed to block 810. At block 810, consistency measurement component 100 determines whether the key of the replicated data object targeted by the READ has already been inserted into one or more of the memory-efficient sets. In various embodiments, this may include testing or probing each of the memory-efficient sets, a process that is described in more detail below. In various embodiments, due to the nature of the memory-efficient sets, it may be determined that a key has been inserted into a set without that set actually storing the value of the key. Examples of these types of memory-efficient sets are described in more detail below with respect to
As indicated by the negative output of block 810, if that key is not present in the window of memory-efficient sets, consistency measurement component 100 may perform block 808 (described below). As can be seen from the illustrated flowchart, block 810 effectively ensures that the consistency metric is based on READ-after-WRITE operations. For instance, if the READ pertains to a particular key, this portion of the method serves as a test to ensure that the key is present within the time window of the memory-efficient sets, which indicates that the replicated data object identified by that key has already been written to. If the key is not present within the window of multiple memory-efficient sets, consistency measurement component 100 proceeds to block 808, as illustrated by the negative output of block 810.
In various embodiments, if the distributed storage system is configured to provided bounded inconsistency, the window of memory-efficient sets may be sized to have a temporal width that ensures all (or nearly all) inconsistent read operations are detected. For example, if bounded inconsistency is characterized by a temporal width of x (e.g., all write operations are guaranteed to reach consistency within x quantity of time), sizing the window of memory-efficient sets to have a temporal width (e.g., (n) Tf of
If the key in question has been inserted into at least one of the memory-efficient sets, consistency measurement component 100 may proceed to block 812 as illustrated by the positive output of block 810. In this portion of the method, consistency measurement component 100 may identify, from a candidate group of the one or more memory-efficient sets in which the key has been inserted, the memory-efficient sets assigned to a time period that is most recent relative to the other memory-efficient sets (if any). At block 814, consistency measurement component 100 determines whether the last-modified timestamp indicated by the read record specifies a time that is older that the time period to which the identified memory-efficient set is assigned. In various embodiments, this portion of the method may serve as a test to determine whether the value read from a replicated data object is older than the most recent WRITE performed on that replicated data object. If the last-modified timestamp is older than the time period of the most recent set that includes the key (i.e., if the value read during the READ is older than the most recent WRITE), consistency measurement component 100 may determine that the READ being evaluated is an inconsistent read. In these cases, consistency measurement component 100 may increment the numerator of the consistency metric, as illustrated at block 816. As described above with respect to
In cases where it is determined that the last modified timestamp is not older than the time period of the most recent set that includes the key (i.e., in cases where the value read during the READ has not been determined to be older than the most recent WRITE), the method may proceed from block 814 to block 818. In this case, the denominator of the consistency metric is incremented to account for a READ-after-WRITE as described above. However, the numerator of the consistency metric is not incremented because the READ was not determined to be inconsistent in this case.
After block 818, consistency measurement component 100 may proceed to block 808 to determine whether there are more sampled records to evaluate. If not, the method may end. If there are more records to evaluate, the method proceeds loops back to block 802 and the method is repeated for the next record.
At 902, consistency measurement component 100 evaluates the oldest record of log file 116 that has not yet been evaluated (e.g., according to the techniques of 802 and 804). In the illustrated embodiment, this record is represented in the first row of the log file. At 904, after consistency measurement component 100 determines that the record is a WRITE record, consistency measurement component 100 inserts the key of that record into the particular memory-efficient set that is assigned to the time period inclusive of the time indicated by the WRITE's transaction timestamp. As illustrated, this time is t1 which occurs between 0 and TF; accordingly, consistency measurement component 100 inserts the key k1 into memory-efficient set 118a, which is assigned to the time period 0<t≦TF in this example. At this point, consistency measurement component 100 determines that there are additional records to evaluate (e.g., similar to 808 described above). Consistency measurement component 100 then evaluates the next log record, as shown in
In
In
In
In
As described above, log records may be sampled from a larger population of log records. In various embodiments, consistency measurement component 100 may be configured to generate consistency metrics expressed in terms of confidence interval. In one non-limiting example, the population of logs may be considered to have a standard normal distribution. Based on this assumption, consistency measurement component 100 may generate consistency metrics expressed in terms of a confidence interval with some degree of certainty (e.g., 95% certainty).
As described above, a window of memory-efficient sets having a temporal width may be utilized to identify inconsistent read operations performed in the distributed storage system. In some embodiments, this window of memory-efficient sets may be utilized to ensure that all (or nearly all) inconsistent read operations are identified in the distributed storage system. For instance, embodiments may utilize a distributed storage system having bounded inconsistency. In these cases, neither full availability nor full consistency is guaranteed (e.g., a compromise is struck between full availability and full consistency). However, embodiments may ensure that all (or nearly all) inconsistent read operations are detected by sizing the window of memory-efficient sets to be larger than the temporal width of the distributed storage system's bounded inconsistency. In one non-limiting example, bounded inconsistency for the distributed storage system is one hour. (This may mean that, for a given write operation performed on a given replicated data object, the system can ensure that all instances of that replicated data object will be consistent within one hour from the initiation of that write operation.) In this example, if the window of memory-efficient sets is structured such that the temporal width of the window is greater than one hour (e.g., a temporal width of two hours), embodiments may ensure that all (or nearly all) inconsistent read operations will be detected.
In various instances, the memory-efficient sets described herein may be described as “including” various keys. It should be noted that in various embodiments, the memory-efficient sets described herein may be data structures that do not actually store the values of such keys internally. Instead, in some embodiments, the memory-efficient sets may store an indication of which keys have been inserted into the sets without actually storing the values themselves.
It should be noted that the Bloom filters may be constructed such that, in most instances, different keys will map to different bit array positions. For instance, in
Various properties of the Bloom filters (and other memory-efficient sets described herein) may contribute to their memory efficiency. One of such properties includes the fixed-size nature of the data structure. For instance, the size of bit array 1020 of the Bloom filter may be sized based on the memory constraints of the host system in which it is implemented (e.g., the memory of host system 102). In this way, generating records of the various keys that have been added to the sets may be performed without worrying about exceeding the memory resources on hand. Another property that contributes to the memory efficiency of the Bloom filters may include the fact that the data footprint of a bit array may be much smaller than the data footprint of an array of multiple-bit key values.
These memory-conserving properties also introduce certain tradeoffs into some embodiments, namely the chance of a “false positive” when querying a set for the presence of a key.
In various embodiments, consistency measurement component 100 described herein may utilize any of the techniques described above with respect to
As time progresses during operation of consistency measurement component 100 and the distributed storage system, additional memory-efficient sets (e.g., Bloom filters) may be needed to store keys corresponding to additional WRITEs performed. In some cases, this may include generating additional memory-efficient sets within memory if available memory space permits. In other cases, the memory-efficient sets associated with the oldest time periods may be discarded (e.g., deleted) to free up memory space in which new sets may be generated. In yet other cases, memory efficient sets associated with the oldest time periods may be cleared of any keys and reassigned to a more recent time period. In some cases, this may serve to create a sliding window of sets that changes over time as new log data is processed. As described above, ensuring that the window of memory-efficient sets has a temporal width greater than the bounded inconsistency of the distributed storage system may ensure that the consistency measurement component detects all inconsistencies within the system. As such, the embodiments described above that generate new memory-efficient sets and/or reassign older memory-efficient sets to new time periods may also ensure that the total temporal width of the active window of memory-efficient sets remains larger than the bounded inconsistency of the distributed storage system.
As illustrated at block 1102, the method may include utilizing multiple memory-efficient sets to generate consistency metrics for different replicated data objects stored in the distributed storage system. In various embodiments, this portion of the method may include any of the techniques described above with respect to the generation of consistency metrics. At 1104, the method may include identifying a subset of the replicated data objects (within distributed storage system) having a consistency metric lower than that of one or more other replicated data objects. In various embodiments, consistency metrics may be evaluated for different customer entities of distributed storage system. For instance, in some cases, a common consistency metric may be generated for all of the replicated data objects of a given customer entity. The consistency metrics of different customer entities may be compared. In this example, data objects of the customer entities with the lowest consistency metrics may be targeted for consistency improvement before other customers. In other cases, consistency metrics may be segmented across different entities or items. For example, in various embodiments, consistency metrics may be generated for different data centers that include a portion of the distributed storage system, different hardware (e.g., a node, hard drive, or other storage device), or different batches of hardware. In one example, these consistency metrics may be utilized to identify poor consistency performance in a particular data center, a particular hardware device, or a particular batch of hardware devices (e.g., a batch of faulty hard drives).
As illustrated at block 1106, the method may include generating one or more instructions to improve the consistency of the identified subset of replicated data objects. In some cases, this may include generating an instruction that specifies the identified subset of data objects having poor consistency. This instruction may be provided to a consistency improvement component (described below). The consistency improvement component may then perform operations within the distributed storage system to improve the consistency of the identified subset. In other cases, this portion of the method may include providing consistency improvement instructions directly to distributed storage system.
To improve the consistency of replicated data object 1204, consistency improvement component 1202 may provide consistency improvement instructions to distributed storage system. In various embodiments, these instructions may achieve the following, which may be performed by consistency improvement component 1202 and/or nodes of distributed storage system 106. First, the various individual objects of replicated data object 1204 may be identified. These individual objects may be evaluated to determine that age of the values stored within them. For instance, the last-modified time stamp (e.g., as described above with respect to
In various embodiments, consistency improvement techniques other than those described herein may be utilized. For instance, one type of consistency improvement technique implemented by consistency improvement component 1202 may include “anti entropy” techniques, such as those describe in U.S. Pat. No. 7,716,180, which is incorporated herein by reference in its entirety.
In various embodiments, the consistency measurement component 100 may generate reports regarding the consistency of the distributed storage system. Such reports may be helpful in determining the impact of code and configuration changes to the distributed storage system. For instance, if a consistency report includes consistency metrics generated after a configuration change to distributed storage system, the report may provide useful information as to how the distributed storage system was impacted by the configuration change.
As illustrated at block 1304, the method may include, based on the generated consistency metrics, generating one or more reports indicating the consistency metrics. In various reports, such consistency metrics may be shown with respect to a period of time. For instance, a time-series plot of consistency metrics could be generated. These types of reports may be useful for monitoring the effects of different system changes over time. In some cases, reports may indicate customer-specific information. For instance, reports may specify different consistency metrics for data objects of different customers. If the data of a particular customer entity is experiencing poor consistency relative to other customer entities, extra resources can be dedicated to improving the consistency of that customer entity's data.
In addition to the reports described above, some embodiments may generate alerts if consistency metrics indicate a level of consistency that is less than a desired level of consistency. For instance, if consistency is exceptionally low for certain replicated data objects, the consistency measurement component may generate one or more alert messages that may be provided to management entities.
Example Computer System
Various embodiments of the system and method for targeted consistency improvement in a distributed storage system, as described herein, may be executed on one or more computer systems, which may interact with various other devices. Note that any component, action, or functionality described above with respect to
In various embodiments, computer system 1400 may be a uniprocessor system including one processor 1410, or a multiprocessor system including several processors 1410 (e.g., two, four, eight, or another suitable number). Processors 1410 may be any suitable processor capable of executing instructions. For example, in various embodiments processors 1410 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In multiprocessor systems, each of processors 1410 may commonly, but not necessarily, implement the same ISA.
System memory 1420 may be configured to store program instructions 1422 and/or data 1432 accessible by processor 1410. In various embodiments, system memory 1420 may be implemented using any suitable memory technology, such as static random access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flash-type memory, or any other type of memory. In the illustrated embodiment, program instructions 1422 implementing consistency measurement component 100 are shown stored within memory. Additionally, data 1432 of memory 1420 may store any of the information or data structures described above, such as consistency metrics 104, memory-efficient sets 118, and sampled log records 116. In some embodiments, program instructions and/or data may be received, sent or stored upon different types of computer-accessible media or on similar media separate from system memory 1420 or computer system 1400. While computer system 1400 is described as implementing the functionality of consistency measurement component 100, any of the components or systems illustrated above may be implemented via such a computer system.
In one embodiment, I/O interface 1430 may be configured to coordinate I/O traffic between processor 1410, system memory 1420, and any peripheral devices in the device, including network interface 1440 or other peripheral interfaces, such as input/output devices 1450. In some embodiments, I/O interface 1430 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 1420) into a format suitable for use by another component (e.g., processor 1410). In some embodiments, I/O interface 1430 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 1430 may be split into two or more separate components, such as a north bridge and a south bridge, for example. Also, in some embodiments some or all of the functionality of I/O interface 1430, such as an interface to system memory 1420, may be incorporated directly into processor 1410.
Network interface 1440 may be configured to allow data to be exchanged between computer system 1400 and other devices (e.g., host system 1200 including consistency improvement component 1202 and distributed storage system 106 including storage nodes 110) attached to a network 1485 or between nodes of computer system 1400. Network 1485 may in various embodiments include one or more networks including but not limited to Local Area Networks (LANs) (e.g., an Ethernet or corporate network), Wide Area Networks (WANs) (e.g., the Internet), wireless data networks, some other electronic data network, or some combination thereof. In various embodiments, network interface 1440 may support communication via wired or wireless general data networks, such as any suitable type of Ethernet network, for example; via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks; via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
Input/output devices 1450 may, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or accessing data by one or more computer systems 1400. Multiple input/output devices 1450 may be present in computer system 1400 or may be distributed on various nodes of computer system 1400. In some embodiments, similar input/output devices may be separate from computer system 1400 and may interact with one or more nodes of computer system 1400 through a wired or wireless connection, such as over network interface 1440.
As shown in
Those skilled in the art will appreciate that computer system 1400 is merely illustrative and is not intended to limit the scope of embodiments. In particular, the computer system and devices may include any combination of hardware or software that can perform the indicated functions, including computers, network devices, Internet appliances, PDAs, wireless phones, pagers, etc. Computer system 1400 may also be connected to other devices that are not illustrated, or instead may operate as a stand-alone system. In addition, the functionality provided by the illustrated components may in some embodiments be combined in fewer components or distributed in additional components. Similarly, in some embodiments, the functionality of some of the illustrated components may not be provided and/or other additional functionality may be available.
Those skilled in the art will also appreciate that, while various items are illustrated as being stored in memory or on storage while being used, these items or portions of them may be transferred between memory and other storage devices for purposes of memory management and data integrity. Alternatively, in other embodiments some or all of the software components may execute in memory on another device and communicate with the illustrated computer system via inter-computer communication. Some or all of the system components or data structures may also be stored (e.g., as instructions or structured data) on a computer-accessible medium or a portable article to be read by an appropriate drive, various examples of which are described above. In some embodiments, instructions stored on a computer-accessible medium separate from computer system 1400 may be transmitted to computer system 1400 via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link. Various embodiments may further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium. Generally speaking, a computer-accessible medium may include a computer-readable storage medium or memory medium such as magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g. SDRAM, DDR, RDRAM, SRAM, etc.), ROM, etc. In some embodiments, a computer-accessible medium may include transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as network and/or a wireless link.
The methods described herein may be implemented in software, hardware, or a combination thereof, in different embodiments. In addition, the order of the blocks of the methods may be changed, and various elements may be added, reordered, combined, omitted, modified, etc. Various modifications and changes may be made as would be obvious to a person skilled in the art having the benefit of this disclosure. The various embodiments described herein are meant to be illustrative and not limiting. Many variations, modifications, additions, and improvements are possible. Accordingly, plural instances may be provided for components described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of claims that follow. Finally, structures and functionality presented as discrete components in the exemplary configurations may be implemented as a combined structure or component. These and other variations, modifications, additions, and improvements may fall within the scope of embodiments as defined in the claims that follow.
McHugh, Jason G., Henry, Alyssa H., Theriault, Eric Yves, Markle, Seth W., Uhlar, Michael A.
Patent | Priority | Assignee | Title |
10095980, | Apr 29 2011 | GOOGLE LLC | Moderation of user-generated content |
10127301, | Sep 26 2014 | Oracle International Corporation | Method and system for implementing efficient classification and exploration of data |
10496671, | May 05 2014 | EMC IP HOLDING COMPANY LLC | Zone consistency |
10657121, | Aug 21 2014 | NEC Corporation | Information processing device, data processing method, and recording medium |
11068510, | Sep 26 2014 | Oracle International Corporation | Method and system for implementing efficient classification and exploration of data |
11443214, | Apr 29 2011 | GOOGLE LLC | Moderation of user-generated content |
11509718, | Sep 19 2014 | NetApp Inc. | Techniques for coordinating parallel performance and cancellation of commands in a storage cluster system |
11734315, | Sep 26 2014 | Oracle International Corporation | Method and system for implementing efficient classification and exploration of data |
11868914, | Apr 29 2011 | GOOGLE LLC | Moderation of user-generated content |
11973829, | Sep 19 2014 | NetApp, Inc. | Techniques for coordinating parallel performance and cancellation of commands in a storage cluster system |
8700580, | Apr 29 2011 | GOOGLE LLC | Moderation of user-generated content |
8832116, | Jan 11 2012 | GOOGLE LLC | Using mobile application logs to measure and maintain accuracy of business information |
8862492, | Apr 29 2011 | GOOGLE LLC | Identifying unreliable contributors of user-generated content |
9292566, | Jul 30 2012 | Hewlett Packard Enterprise Development LP | Providing a measure representing an instantaneous data consistency level |
9436502, | Dec 10 2010 | Microsoft Technology Licensing, LLC | Eventually consistent storage and transactions in cloud based environment |
9552552, | Apr 29 2011 | GOOGLE LLC | Identification of over-clustered map features |
ER7283, |
Patent | Priority | Assignee | Title |
5701464, | Sep 15 1995 | Intel Corporation | Parameterized bloom filters |
6098078, | Dec 29 1995 | THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT | Maintaining consistency of database replicas |
6434662, | Nov 02 1999 | Juniper Networks, Inc | System and method for searching an associative memory utilizing first and second hash functions |
6597700, | Jun 30 1999 | RPX CLEARINGHOUSE LLC | System, device, and method for address management in a distributed communication environment |
6920477, | Apr 06 2001 | BREED AUTOMOTIVE TECHNOLOGY, INC | Distributed, compressed Bloom filter Web cache server |
7277905, | Mar 31 2004 | Microsoft Technology Licensing, LLC | System and method for a consistency check of a database backup |
7324999, | Aug 26 1999 | Microsoft Technology Licensing, LLC | Method and system for detecting object inconsistency in a loosely consistent replicated directory service |
7366740, | May 03 2004 | Microsoft Technology Licensing, LLC | Systems and methods for automatic maintenance and repair of enitites in a data model |
7506011, | Jul 26 2006 | International Business Machines Corporation | System and apparatus for optimally trading off the replication overhead and consistency level in distributed applications |
7743013, | Jun 11 2007 | Microsoft Technology Licensing, LLC | Data partitioning via bucketing bloom filters |
7765187, | Nov 29 2005 | EMC IP HOLDING COMPANY LLC | Replication of a consistency group of data storage objects from servers in a data network |
7769722, | Dec 08 2006 | EMC IP HOLDING COMPANY LLC | Replication and restoration of multiple data storage object types in a data network |
7822721, | Apr 15 2004 | SAP SE | Correction server for large database systems |
8037023, | Jul 26 2006 | International Business Machines Corporation | System and apparatus for optimally trading off the replication overhead and consistency level in distributed applications |
20090300022, | |||
20100228701, | |||
20110099187, | |||
20110219106, | |||
20110219205, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 20 2010 | MCHUGH, JASON G | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025038 | /0104 | |
Sep 20 2010 | THERIAULT, ERIC YVES | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025038 | /0104 | |
Sep 20 2010 | MARKLE, SETH W | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025038 | /0104 | |
Sep 20 2010 | UHLAR, MICHAEL A | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025038 | /0104 | |
Sep 22 2010 | Amazon Technologies, Inc. | (assignment on the face of the patent) | / | |||
Sep 22 2010 | HENRY, ALYSSA H | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025038 | /0104 |
Date | Maintenance Fee Events |
Sep 12 2016 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Nov 02 2020 | REM: Maintenance Fee Reminder Mailed. |
Apr 19 2021 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Mar 12 2016 | 4 years fee payment window open |
Sep 12 2016 | 6 months grace period start (w surcharge) |
Mar 12 2017 | patent expiry (for year 4) |
Mar 12 2019 | 2 years to revive unintentionally abandoned end. (for year 4) |
Mar 12 2020 | 8 years fee payment window open |
Sep 12 2020 | 6 months grace period start (w surcharge) |
Mar 12 2021 | patent expiry (for year 8) |
Mar 12 2023 | 2 years to revive unintentionally abandoned end. (for year 8) |
Mar 12 2024 | 12 years fee payment window open |
Sep 12 2024 | 6 months grace period start (w surcharge) |
Mar 12 2025 | patent expiry (for year 12) |
Mar 12 2027 | 2 years to revive unintentionally abandoned end. (for year 12) |