Methods, systems, and computer-readable media for indexing partitions using distributed bloom filters are disclosed. A data indexing system generates a plurality of indices for a plurality of partitions in a distributed object store. The indices comprise a plurality of bloom filters. An individual one of the bloom filters corresponds to one or more fields of an individual one of the partitions. Using the bloom filters, the data indexing system determines a first portion of the partitions that possibly comprise a value and a second portion of the partitions that do not comprise the value. Based (at least in part) on a scan of the first portion of the partitions and not the second portion of the partitions, the data indexing system determines one or more partitions of the first portion of the partitions that comprise the value.
|
5. A method, comprising:
generating, by a data indexing system, a plurality of indices for a plurality of partitions in a distributed object store, wherein the indices comprise a plurality of probabilistic data structures, and wherein an individual one of the probabilistic data structures corresponds to one or more fields of an individual one of the partitions;
determining, by the data indexing system using the probabilistic data structures, a first portion of the partitions that possibly comprise a value and a second portion of the partitions that do not comprise the value; and
determining, by the data indexing system based at least in part on a scan of the first portion of the partitions and not the second portion of the partitions, one or more partitions of the first portion of the partitions that comprise the value.
13. One or more non-transitory computer-readable storage media storing program instructions that, when executed on or across one or more processors, perform:
generating a plurality of indices for a plurality of partitions in a distributed set of object stores, wherein the indices comprise a plurality of bloom filters, and wherein an individual one of the bloom filters corresponds to one or more fields of an individual one of the partitions;
determining, using the bloom filters, a first portion of the partitions that possibly comprise a value and a second portion of the partitions that do not comprise the value; and
performing a scan of the first portion of the partitions and not the second portion of the partitions; and
determining, based at least in part on the scan, one or more records that comprise the value in one or more partitions of the first portion of the partitions.
1. A system, comprising:
a data lake comprising a distributed object store; and
a data indexing system comprising one or more processors and one or more memories to store computer-executable instructions that, when executed, cause the one or more processors to:
generate a plurality of indices for a plurality of partitions in the data lake, wherein the partitions are archived using a plurality of storage resources, wherein the indices comprise a plurality of bloom filters, and wherein an individual one of the bloom filters corresponds to one or more fields of a plurality of records in the partitions;
receive a query indicating a value;
determine, using the bloom filters, a candidate portion of the partitions that possibly comprise the value and a non-candidate portion of the partitions that do not comprise the value; and
determine, using the candidate portion of the partitions and not the non-candidate portion of the partitions, one or more records that comprise the value in one or more partitions of the candidate portion of the partitions.
2. The system as recited in
3. The system as recited in
delete the one or more records from the one or more partitions.
4. The system as recited in
6. The method as recited in
7. The method as recited in
8. The method as recited in
deleting one or more records associated with the value from the one or more partitions that comprise the value.
9. The method as recited in
10. The method as recited in
11. The method as recited in
12. The method as recited in
generating, by the data indexing system, a larger version of the individual one of the probabilistic data structures;
determining, by the data indexing system using the larger version of the individual one of the probabilistic data structures, a third portion of the partitions that possibly comprise an additional value and a fourth portion of the partitions that do not comprise the additional value; and
determining, by the data indexing system based at least in part on a scan of the third portion of the partitions and not the fourth portion of the partitions, one or more partitions of the third portion of the partitions that comprise the additional value.
14. The one or more non-transitory computer-readable storage media as recited in
15. The one or more non-transitory computer-readable storage media as recited in
deleting the one or more records from the one or more partitions of the first portion of the partitions.
16. The one or more non-transitory computer-readable storage media as recited in
17. The one or more non-transitory computer-readable storage media as recited in
18. The one or more non-transitory computer-readable storage media as recited in
19. The one or more non-transitory computer-readable storage media as recited in
20. The one or more non-transitory computer-readable storage media as recited in
generating a smaller version of the individual one of the bloom filters;
determining, using the smaller version of the individual one of the bloom filters, a third portion of the partitions that possibly comprise an additional value and a fourth portion of the partitions that do not comprise the additional value; and
determining, based at least in part on a scan of the third portion of the partitions and not the fourth portion of the partitions, one or more partitions of the third portion of the partitions that comprise the additional value.
|
Many companies and other organizations operate computer networks that interconnect numerous computing systems to support their operations, such as with the computing systems being co-located (e.g., as part of a local network) or instead located in multiple distinct geographical locations (e.g., connected via one or more private or public intermediate networks). For example, distributed systems housing significant numbers of interconnected computing systems have become commonplace. Such distributed systems may provide back-end services or systems that interact with clients. For example, such distributed systems may provide database systems to clients. As the scale and scope of database systems have increased, the tasks of provisioning, administering, and managing system resources have become increasingly complicated. For example, the costs to search, analyze, and otherwise manage data sets can increase with the size and scale of the data sets.
schema
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning “having the potential to”), rather than the mandatory sense (i.e., meaning “must”). Similarly, the words “include,” “including,” and “includes” mean “including, but not limited to.”
Embodiments of methods, systems, and computer-readable media for indexing partitions using distributed Bloom filters are described. Business entities and other organizations are increasingly reliant on very large data sets. For example, an entity that enables Internet-based sales from an electronic catalog of goods and services may maintain tens of thousands of data sets that collectively use petabytes of storage. Databases, data warehouses, and data lakes that are hosted using distributed systems may provide access to such data sets. Such systems may provide clients with access to collections of structured or unstructured data. A data set may include many records, each record having values in a plurality of fields. A data set may be divided into partitions to improve performance, e.g., to improve the performance of queries using query filters to restrict the number of partitions that are accessed. A very large data set may have tens of thousands or hundreds of thousands of partitions. For example, a data set may be partitioned by a field such as “day” such that data timestamped for one day is stored in a different partition than data timestamped for another day.
In some circumstances, queries for very large data sets can be prohibitively time-consuming and expensive. For example, when queries cannot be filtered by a partitioning key to restrict the number of partitions to be accessed, the entire set of partitions may need to be scanned. Some prior approaches for sorting data sets have used traditional indices, e.g., indices built using B+ trees. However, in a big data environment, sorting may not be feasible to perform across a large number of partitions and/or a large volume of data. Some prior approaches to searching data sets have used hash tables as data structures for efficient searches. However, in a big data environment, hashing can produce a very large volume of output such that the resulting indices are too expensive to maintain.
The aforementioned challenges, among others, are addressed by embodiments of the techniques described herein, whereby very large data sets may be searched efficiently using a distributed set of Bloom filters. A data lake comprising a distributed object store may contain a large amount of data that is infrequently updated. For example, an entity that enables Internet-based sales from an electronic catalog of goods and services may maintain one or more object stores to store data for customer orders. Older order information may be archived, e.g., in a data set that is partitioned by order date, such that no additional data is added to a partition after sufficient time has passed. Due to their infrequently changing nature, partitions in such a data set may be scanned once to create indices such as Bloom filters. In some embodiments, a Bloom filter is a space-efficient, probabilistic data structure that indicates whether a value is possibly included in a set or whether the value is definitely not in the set. For a given partition, Bloom filters may be generated for one or more fields to capture the possibility that particular values are found in the field(s). To determine the particular partitions that include a particular value, the Bloom filters may be used to identify candidate partitions that may possibly include the value while excluding non-candidate partitions that definitely do not include the value. By excluding a large number of non-candidate partitions using the Bloom filters, the remaining candidate partitions may be scanned efficiently to identify the relevant partitions that actually include records with the value. Using these techniques, for example, Bloom filters may be used to quickly find user data in a very large data set in order to provide a copy of the user data back to the user or delete the user data according to regulatory requirements (e.g., General Data Protection Regulation [GDPR] requirements).
As one skilled in the art will appreciate in light of this disclosure, embodiments may be capable of achieving certain technical advantages, including some or all of the following: (1) improving input/output (I/O) and network usage in a big data environment by using Bloom filters to restrict queries to a relatively small set of candidate partitions such that a larger set of irrelevant partitions need not be accessed; (2) improving the use of computing resources in a big data environment by using Bloom filters to restrict queries to a relatively small set of candidate partitions such that a larger set of irrelevant partitions need not be accessed; (3) improving the use of storage and memory resources in a big data environment by generating space-efficient Bloom filters to index a large number of partitions instead of using larger indices or hash tables; (4) improving the latency and performance of queries in a big data environment by adaptively sizing newly generated Bloom filters based (at least in part) on metrics such as data volume, number of records, and entropy; (5) improving the latency and performance of queries in a big data environment by replacing or augmenting existing Bloom filters with larger or smaller Bloom filters; (6) improving the latency of queries by using Bloom filters to restrict queries to a relatively small set of candidate partitions such that a larger set of irrelevant partitions need not be accessed; and so on.
The data lake 180 may include a plurality of object stores that are stored in a distributed manner. The object stores may differ in their performance characteristics, application programming interfaces (APIs), storage architectures, and/or other attributes. Objects in one object store in the data lake 180 may represent a different structure and/or different data types than objects in another object store. Objects in the data lake 180 may include object blobs or files. Objects in the data lake 180 may include semi-structured data (e.g., CSV files, logs, XML files, JSON files, and so on). Objects in the data lake 180 may include unstructured data (e.g., e-mails, word processing documents, PDFs, and so on). Objects in the data lake 180 may include binary data (e.g., images, audio, and video). At least some of the objects in the data lake 180 may not be tables that organize data by rows and columns. In some embodiments, at least some of the records may be stored in the data lake 180 without using a schema. A schema may represent a formal definition of the structure or organization of a data store. In some embodiments, at least some of the records may be stored in the data lake 180 according to a partial schema. A partial schema may partially but not completely define the structure or organization of a data store. In some embodiments, some of the records may be stored in one object store according to a partial schema that differs from others of the records in another object store.
At least some of the data lake 180 may be archived, infrequently updated, and/or read-only under normal use, at least after a period of time. For example, an entity that enables Internet-based sales from an electronic catalog of goods and services may maintain one or more data sets to store data for customer orders. Older order information may be archived, e.g., in a data set that is partitioned by order date such that no additional data is added to a partition after sufficient time has passed since the corresponding date of the partition. Due to their infrequently changing nature, the partitions 180A-180Z may be scanned once to create indices that can be used again and again for new queries of the partitions. The data indexing system 100 may include a component 110 for indexing of the data lake 180. The indexing 110 may generate a plurality of Bloom filters 120. In some embodiments, a Bloom filter is a space-efficient, probabilistic data structure that indicates whether a value is possibly included in a set of values or whether the value is definitely not in the set. A query of a Bloom filter may return false positives but not false negatives.
A Bloom filter may be generated by applying one or more hash functions to a set of values. A Bloom filter may include a bit array, and values in the set may be mapped (via the hash function(s)) to positions in the bit array. An empty Bloom filter may represent an array of n bits that are initially set to zero. Each hash function in a set of h hash functions (h 1) may map some value to one of the n array positions in a uniform random distribution. The size n of the Bloom filter may be proportional to a small constant representing a desired false positive rate and/or proportional to the number of values to be added to the filter. A value may be added to the Bloom filter by providing it to each of the h hash functions to get h array positions. The bits at those array positions may be set to 1. In some embodiments, additional values may be added to a Bloom filter, but values may not be removed from the filter.
For a given partition that includes data values in different fields, a plurality of Bloom filters may be generated for one or more fields to capture the possibility that particular values are found in the field(s). As shown in the example of
The data indexing system 100 may include a component 150 for efficiently querying the data lake 180 using the Bloom filters 120. To begin searching a particular data set for a particular value, the querying component 150 may search the Bloom filters corresponding to the data set's partitions to determine the partitions that definitely do not include the value and also determine the partitions that possibly include the value. For example, to search a data set of customer order data for a particular customer ID, the querying component 150 may search the Bloom filters corresponding to the data set's partitions to exclude the partitions that definitely do not include the customer ID from additional scanning. To determine the particular partitions that include a particular value, a component 160 for candidate partition identification may use the Bloom filters 120 to identify candidate partitions 165 that may possibly include the value (false positives and/or true positives) while excluding non-candidate partitions 166 that definitely do not include the value (true negatives).
The querying component 150 may determine whether a value is present in a Bloom filter by providing the value to each of the h hash functions to get h array positions. If any of the bits at these positions is zero, then the querying component 150 may determine that the value is definitely not in the set (and thus definitely not present in the field(s) corresponding to the Bloom filter). However, if all of the bits at these positions are 1, then the querying component 150 may determine that the value is possibly in the set (and thus may or may not be present in the field(s) corresponding to the Bloom filter). The “possible yes” result may represent a false positive if the bits were set to 1 during the insertion of other values. If all of the Bloom filters for a given partition yielded a “definite no” result, then the querying component 150 may assign that partition to the set of non-candidate partitions 166. If any of the Bloom filters for a given partition yielded a “possible yes” result, then the querying component 150 may assign that partition to the set of candidate partitions 165.
To determine the particular partitions that actually include a particular value, a component 170 for partition scanning may examine the candidate partitions 165 and not the non-candidate partitions 166 to identify one or more partitions 175 that actually include the value. The partition(s) 175 that actually include the value may be referred to as relevant partitions. By excluding a large number of non-candidate partitions 166 using the Bloom filters 120, the remaining candidate partitions 165 may be scanned efficiently to identify the relevant partitions 175 that actually include the value in one or more records. Even if the query of the Bloom filters 120 yielded a small number of false positives, the resources required to scan these additional partitions may be a small fraction of the resources that would otherwise be required to scan the entire data set.
The data indexing system 100 may use Bloom filters 120 for efficient querying of large data sets for a variety of purposes. For example, the Bloom filters 120 may be used to quickly find user data or customer data in a very large data set. The user data or customer data may be reported back to the user or deleted from the data lake 180 according to regulatory requirements (e.g., General Data Protection Regulation [GDPR] requirements). Without the data indexing system 100 and the use of Bloom filters 120, such a task may consume a prohibitive amount of computing resources (e.g., processors, memory, I/O, etc.) and compute time for a single query. By restricting a scan to only a small set of candidate partitions rather than the entire data set, the data indexing system 100 may significantly reduce the amount of computing resources (e.g., processors, memory, I/O, etc.) and the resulting cost for a query of a very large data set.
In one embodiment, one or more components of the data indexing system 100 and/or the data lake 180 may be implemented using resources of a provider network. The provider network may represent a network set up by an entity such as a private-sector business or a public-sector organization to provide one or more services (such as various types of network-accessible computing or storage) accessible via the Internet and/or other networks to a distributed set of clients. The provider network may include numerous services that collaborate according to a service-oriented architecture to provide the functionality and resources of the data indexing system 100 and/or data lake 180. The provider network may include numerous data centers hosting various resource pools, such as collections of physical and/or virtualized computer servers, storage devices, networking equipment and the like, that are used to implement and distribute the infrastructure and services offered by the provider. Compute resources may be offered by the provider network to clients in units called “instances,” such as virtual or physical compute instances. In one embodiment, a virtual compute instance may, for example, comprise one or more servers with a specified computational capacity (which may be specified by indicating the type and number of CPUs, the main memory size, and so on) and a specified software stack (e.g., a particular version of an operating system, which may in turn run on top of a hypervisor). In various embodiments, one or more aspects of the data indexing system 100 may be implemented as a service of the provider network, the service may be implemented using a plurality of different instances that are distributed throughout one or more networks, and each instance may offer access to the functionality of the service to various clients. Because resources of the provider network may be under the control of multiple clients (or tenants) simultaneously, the provider network may be said to offer multi-tenancy and may be termed a multi-tenant provider network. The provider network may be hosted in the cloud and may be termed a cloud provider network. In one embodiment, portions of the functionality of the provider network, such as the data indexing system 100, may be offered to clients in exchange for fees.
In various embodiments, components of the data indexing system 100 and/or data lake 180 may be implemented using any suitable set number and configuration of computing devices, any of which may be implemented by the example computing device 3000 illustrated in
Clients of the data indexing system 100 may represent external devices, systems, or entities. Client devices may be managed or owned by one or more clients of the data indexing system 100 and/or data lake 180. In one embodiment, the client devices may be implemented using any suitable number and configuration of computing devices, any of which may be implemented by the example computing device 3000 illustrated in
For a given one of the partitions 180A-180C, a plurality of Bloom filters may be generated for one or more fields to capture the possibility that particular values are found in the field(s). As shown in the example of
To begin searching the partitions 180A-180C of the data set for a particular value Q for a customer ID field, the querying component 150 may search the Bloom filters 120 corresponding to the data set's partitions to determine the partitions that definitely do not include the value Q and also determine the partitions that possibly include the value Q. In some embodiments, the query may be restricted to Bloom filters that correspond to fields that are known to include customer IDs. For example, if the first field in the partitions 180A-180C includes the partitioning field of the order date and the second field includes the customer ID of an order, then the querying 150 may use the Bloom filters corresponding to the second field (e.g., Bloom filters 120A2, 120B2, and 120C2) and not use other Bloom filters (e.g., Bloom filters 120A1 and 120B1). In some embodiments, the query may instead use the Bloom filters that correspond to all of the fields in the partitions 180A-180C.
If all of the Bloom filters for a given partition yielded a “definite no” result for inclusion of the value Q, then the querying component 150 may assign that partition to the set of non-candidate partitions 166. In the example shown in
In some embodiments, other probabilistic data structures or algorithms that may return false positives may be used instead of Bloom filters. In some embodiments, the querying 150 may use one or more machine learning techniques to restrict the set of candidate partitions 165. The one or more machine learning techniques may be used to predict the contents of partitions or fields so that queries can be restricted to a smaller set of partitions. The one or more machine learning techniques may augment the use of Bloom filters using features of the data to be queried. In some embodiments, one or more machine learning techniques may be used to automatically select a particular Bloom filter algorithm according to cost/benefit targets or to replace a Bloom filter algorithm with a different algorithm. The automatically selected algorithm may yield false negatives but may achieve performance and/or cost advantages, e.g., for data sets where “best effort” queries are acceptable. For example, for partitions that are more frequently updated, Bloom filters may require more frequent recalculation, while some machine learning algorithms may be trained once and then used for repeated predictions based on the learned behavior and regardless of updates to the partitions.
To determine the particular partitions that actually include the particular value Q in one or more records, the component 170 for partition scanning may examine the candidate partitions 165 and not the non-candidate partitions 166 to identify one or more partitions 175 with records that actually include the value Q. The partition(s) 175 that actually include the value Q may be referred to as relevant partitions and may include only the partition 180A. By excluding non-candidate partitions 166 using the Bloom filters 120, the remaining candidate partitions 165 may be scanned efficiently to identify the relevant partitions 175 that actually include the value Q. Even if the query of the Bloom filters 120 yielded a small number of false positives such as partition 180C, the resources required to scan these additional partitions may be a small fraction of the resources that would otherwise be required to scan the entire data set.
As shown in the example of
In some embodiments, the adaptive sizing 400 may use one or more machine learning techniques to determine appropriate sizes for Bloom filters for particular partitions. The one or more machine learning techniques may be used to predict the contents of partitions or fields so that features of data to be queried may be used to augment queries. In some embodiments, instead of manually selecting Bloom filter sizes or using heuristics such as a percentage of the partition volume, the one or more machine learning techniques may identify a Bloom filter size according to trade-offs between the costs and benefits of various sizes.
By adaptively sizing Bloom filters, the system 100 may improve the performance of queries. Adaptive sizing 400 may strike a balance between the size of Bloom filters and the false positive rate. Decreasing the size of a Bloom filter may reduce the storage and memory requirements for the filter. Increasing the size of a Bloom filter may reduce the false positive rate, which may then decrease the need to scan candidate partitions that do not actually include a value associated with a query.
In some embodiments, Bloom filters and/or corresponding partitions may be monitored for such metrics after one or more Bloom filters have been created for the corresponding partitions. If one or more thresholds are exceeded by the metrics, then a Bloom filter may be replaced with an equivalent Bloom filter of a larger or smaller size. For example, if a partition 180A is experiencing an increased number of searches, or the observed false positive rate is too high, then one or more Bloom filters for that partition may be replaced by larger versions. As shown in
In some embodiments, Bloom filters for a given field may be created in different sizes. As shown in
In some embodiments, the auto-scaling 500 and/or 600 may use one or more machine learning techniques to determine appropriate sizes for Bloom filters for particular partitions. The one or more machine learning techniques may be used to predict the contents of partitions or fields so that features of data to be queried may be used to augment queries. In some embodiments, instead of manually selecting Bloom filter sizes or using heuristics such as a percentage of the partition volume, the one or more machine learning techniques may identify a Bloom filter size according to trade-offs between the costs and benefits of various sizes.
By auto-scaling Bloom filters as shown in
As shown in 710, a query may be received that indicates a value. The partitions may be partitioned according to a field such as the time period of customer orders of good and services from an Internet-accessible electronic catalog. However, the query of the data set may seek data associated with a different field such as customer ID. To facilitate fast searching of the partitions by a non-partitioning field such as customer ID, the Bloom filters may be used to determine partitions that may include the value for the field.
As shown in 720, using the Bloom filters or other probabilistic data structures, a set of candidate partitions and a set of non-candidate partitions may be determined. To begin querying a particular data set for a particular value indicated in the query, the Bloom filters corresponding to the data set's partitions may be used to determine the non-candidate partitions that definitely do not include the value and also determine the candidate partitions that possibly include the value. For example, to search a data set of customer order data for a particular customer ID, the Bloom filters corresponding to the data set's partitions may be used to exclude the partitions that definitely do not include the customer ID from additional scanning. If all of the Bloom filters for a given partition yield a “definite no” result, then that partition may be assigned to the set of non-candidate partitions. If any of the Bloom filters for a given partition yields a “possible yes” result, then that partition may be assigned to the set of candidate partitions.
As shown in 730, using the set of candidate partitions and not the set of non-candidate partitions, one or more partitions that actually include the value in one or more records may be determined. To determine the particular partitions that actually include a particular value, the candidate partitions and not the non-candidate partitions may be scanned or examined to identify one or more partitions that actually include the value indicated in the query. The partition(s) that actually include the value may be referred to as relevant partitions. By excluding a large number of non-candidate partitions using the Bloom filters, the remaining candidate partitions may be scanned efficiently to identify the relevant partitions that actually include the value in one or more records. Even if the query of the Bloom filters yields a small number of false positives, the resources required to scan these additional partitions may be a small fraction of the resources that would otherwise be required to scan the entire data set.
As shown in 740, one or more actions may be performed with respect to the one or more records associated with the value. The one or more actions may be performed for the one or more partitions that actually include the value and not for other partitions (e.g., non-candidate partitions and partitions that were candidates only because of false positives). The one or more actions may include reading data, e.g., the one or more records that include the value. The one or more actions may include returning data to a query client, e.g., a copy of the one or more records that include the value. The one or more actions may include deleting data, e.g., the one or more records that include the value.
Illustrative Computer System
In at least some embodiments, a computer system that implements a portion or all of one or more of the technologies described herein may include a computer system that includes or is configured to access one or more computer-readable media.
In various embodiments, computing device 3000 may be a uniprocessor system including one processor or a multiprocessor system including several processors 3010A-3010N (e.g., two, four, eight, or another suitable number). In one embodiment, processors 3010A-3010N may include any suitable processors capable of executing instructions. For example, in various embodiments, processors 3010A-3010N may be processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In one embodiment, in multiprocessor systems, each of processors 3010A-3010N may commonly, but not necessarily, implement the same ISA.
In one embodiment, system memory 3020 may be configured to store program instructions and data accessible by processor(s) 3010A-3010N. In various embodiments, system memory 3020 may be implemented using any suitable memory technology, such as static random access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flash-type memory, or any other type of memory. In the illustrated embodiment, program instructions and data implementing one or more desired functions, such as those methods, techniques, and data described above, are shown stored within system memory 3020 as code (i.e., program instructions) 3025 and data 3026.
In one embodiment, I/O interface 3030 may be configured to coordinate I/O traffic between processors 3010A-3010N, system memory 3020, and any peripheral devices in the device, including network interface 3040 or other peripheral interfaces. In some embodiments, I/O interface 3030 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 3020) into a format suitable for use by another component (e.g., processors 3010A-3010N). In some embodiments, I/O interface 3030 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 3030 may be split into two or more separate components, such as a north bridge and a south bridge, for example. In some embodiments, some or all of the functionality of I/O interface 3030, such as an interface to system memory 3020, may be incorporated directly into processors 3010A-3010N.
In one embodiment, network interface 3040 may be configured to allow data to be exchanged between computing device 3000 and other devices 3060 attached to a network or networks 3050. In various embodiments, network interface 3040 may support communication via any suitable wired or wireless general data networks, such as types of Ethernet network, for example. Additionally, in some embodiments, network interface 3040 may support communication via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks, via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
In some embodiments, system memory 3020 may be one embodiment of a computer-readable (i.e., computer-accessible) medium configured to store program instructions and data as described above for implementing embodiments of the corresponding methods and apparatus. In some embodiments, program instructions and/or data may be received, sent or stored upon different types of computer-readable media. In some embodiments, a computer-readable medium may include non-transitory storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD coupled to computing device 3000 via I/O interface 3030. In one embodiment, a non-transitory computer-readable storage medium may also include any volatile or nonvolatile media such as RAM (e.g. SDRAM, DDR SDRAM, RDRAM, SRAM, etc.), ROM, etc., that may be included in some embodiments of computing device 3000 as system memory 3020 or another type of memory. In one embodiment, a computer-readable medium may include transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link, such as may be implemented via network interface 3040. The described functionality may be implemented using one or more non-transitory computer-readable storage media storing program instructions that are executed on or across one or more processors. Portions or all of multiple computing devices such as that illustrated in
The various methods as illustrated in the Figures and described herein represent examples of embodiments of methods. In various embodiments, the methods may be implemented in software, hardware, or a combination thereof. In various embodiments, in various ones of the methods, the order of the steps may be changed, and various elements may be added, reordered, combined, omitted, modified, etc. In various embodiments, various ones of the steps may be performed automatically (e.g., without being directly prompted by user input) and/or programmatically (e.g., according to program instructions).
The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used herein, the term “if” may be construed to mean “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event],” depending on the context.
It will also be understood that, although the terms first, second, etc., may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first contact could be termed a second contact, and, similarly, a second contact could be termed a first contact, without departing from the scope of the present invention. The first contact and the second contact are both contacts, but they are not the same contact.
Numerous specific details are set forth herein to provide a thorough understanding of claimed subject matter. However, it will be understood by those skilled in the art that claimed subject matter may be practiced without these specific details. In other instances, methods, apparatus, or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter. Various modifications and changes may be made as would be obvious to a person skilled in the art having the benefit of this disclosure. It is intended to embrace all such modifications and changes and, accordingly, the above description is to be regarded in an illustrative rather than a restrictive sense.
Liverance, Fletcher, Park, Yangbae, Song, Zhuonan, Jalumari, Laxmi Siva Prasad Balaramaraju, Opincariu, Daniel
Patent | Priority | Assignee | Title |
11914732, | Dec 16 2020 | STRIPE, INC. | Systems and methods for hard deletion of data across systems |
Patent | Priority | Assignee | Title |
10210195, | Feb 12 2016 | International Business Machines Corporation | Locating data in a set with a single index using multiple property values |
10394820, | Nov 22 2016 | International Business Machines Corporation | Constructing and querying a bloom filter to detect the absence of data from one or more endpoints |
10452676, | Jan 31 2014 | Hewlett Packard Enterprise Development LP | Managing database with counting bloom filters |
10503737, | Mar 31 2015 | EMC Corporation | Bloom filter partitioning |
10592532, | Oct 25 2017 | International Business Machines Corporation | Database sharding |
10678791, | Oct 15 2015 | Oracle International Corporation | Using shared dictionaries on join columns to improve performance of joins in relational databases |
10691687, | Apr 26 2016 | International Business Machines Corporation | Pruning of columns in synopsis tables |
10698898, | Jan 24 2017 | Microsoft Technology Licensing, LLC | Front end bloom filters in distributed databases |
10719512, | Oct 23 2017 | International Business Machines Corporation | Partitioned bloom filter merge for massively parallel processing clustered data management |
11210279, | Apr 15 2016 | Apple Inc | Distributed offline indexing |
8260909, | Sep 19 2006 | Oracle America, Inc | Method and apparatus for monitoring a data stream |
8631028, | Oct 29 2009 | XPath query processing improvements | |
8972337, | Feb 21 2013 | Amazon Technologies, Inc. | Efficient query processing in columnar databases using bloom filters |
9367574, | Feb 21 2013 | Amazon Technologies, Inc. | Efficient query processing in columnar databases using bloom filters |
9501527, | Dec 28 2015 | International Business Machines Corporation | Bloom filter construction method for use in a table join operation portion of processing a query to a distributed database |
9535658, | Sep 28 2012 | Alcatel Lucent | Secure private database querying system with content hiding bloom filters |
9971809, | Sep 28 2015 | CA, INC | Systems and methods for searching unstructured documents for structured data |
20090063396, | |||
20170011073, | |||
20180129691, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Aug 18 2020 | JALUMARI, LAXMI SIVA PRASAD BALARAMARAJU | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 055197 | /0268 | |
Aug 18 2020 | OPINCARIU, DANIEL | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 055197 | /0268 | |
Aug 19 2020 | LIVERANCE, FLETCHER | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 055197 | /0268 | |
Aug 19 2020 | SONG, ZHUONAN | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 055197 | /0268 | |
Aug 20 2020 | Amazon Technologies, Inc. | (assignment on the face of the patent) | / | |||
Aug 20 2020 | PARK, YANGBAE | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 055197 | /0268 |
Date | Maintenance Fee Events |
Aug 20 2020 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Date | Maintenance Schedule |
Dec 20 2025 | 4 years fee payment window open |
Jun 20 2026 | 6 months grace period start (w surcharge) |
Dec 20 2026 | patent expiry (for year 4) |
Dec 20 2028 | 2 years to revive unintentionally abandoned end. (for year 4) |
Dec 20 2029 | 8 years fee payment window open |
Jun 20 2030 | 6 months grace period start (w surcharge) |
Dec 20 2030 | patent expiry (for year 8) |
Dec 20 2032 | 2 years to revive unintentionally abandoned end. (for year 8) |
Dec 20 2033 | 12 years fee payment window open |
Jun 20 2034 | 6 months grace period start (w surcharge) |
Dec 20 2034 | patent expiry (for year 12) |
Dec 20 2036 | 2 years to revive unintentionally abandoned end. (for year 12) |