Provided are techniques for partitioning a physical index into one or more physical partitions; assigning each of the one or more physical partitions to a node in a cluster of nodes; for each received document, assigning an assigned-doc-ID comprising an integer document identifier; and, in response to assigning the assigned-doc-ID to a document, determining a cut-off of assignment of new documents to a current virtual-index-epoch comprising a first set of physical partitions and placing the new documents into a new virtual-index-epoch comprising a second set of physical partitions by inserting each new document to a specific one of the physical partitions in the second set using one or more functions that direct the placement based on one of the assigned-doc-id, a field value derived from a set of fields obtained from the document, and a combination of the assigned-doc-id and the field value.
|
1. A computer-implemented method, comprising:
in response to receiving a new document,
generating an assigned-doc-ID for the new document;
identifying, for the assigned-doc-ID, a virtual-index-epoch from a virtual-index-epoch map that includes virtual-index-epochs that are each assigned a range of assign-doc-IDs;
applying a first function to a virtual-index-epoch value of the identified virtual-index-epoch to identify a logical partition;
applying a second function to the identified logical partition to identify a physical partition; and
placing the new document into the identified physical partition associated with the identified virtual-index-epoch.
17. A computer program product comprising a tangible computer readable storage medium including a computer readable program, wherein the computer readable program when executed by a processor on a computer causes the computer to perform:
in response to receiving a new document,
generating an assigned-doc-ID for the new document;
identifying, for the assigned-doc-ID, a virtual-index-epoch from a virtual-index-epoch map that includes virtual-index-epochs that are each assigned a range of assign-doc-IDs;
applying a first function to a virtual-index-epoch value of the identified virtual-index-epoch to identify a logical partition;
applying a second function to the identified logical partition to identify a physical partition; and
placing the new document into the identified physical partition associated with the identified virtual-index-epoch.
9. A system, comprising:
a processor; and
storage coupled to the processor, wherein the storage stores a computer program, and wherein the processor is configured to execute instructions of the computer program to perform operations, the operations comprising:
in response to receiving a new document,
generating an assigned-doc-ID for the new document;
identifying, for the assigned-doc-ID, a virtual-index-epoch from a virtual-index-epoch map that includes virtual-index-epochs that are each assigned a range of assign-doc-IDs;
applying a first function to a virtual-index-epoch value of the identified virtual-index-epoch to identify a logical partition;
applying a second function to the identified logical partition to identify a physical partition; and
placing the new document into the identified physical partition associated with the identified virtual-index-epoch.
2. The method of
3. The method of
maintaining a persistent, transactionally recoverable structure that stores the virtual-index-epoch map.
4. The method of
dynamically maintaining the virtual-index-epoch map to accommodate changes in the system capacity, modeled performance of a physical partition, and actual performance of a physical partition.
5. The method of
maintaining the virtual-index-epoch map by at least one of creating and deleting virtual-index-epoch numbers from the virtual-index-epoch map.
6. The method of
including in the virtual-index-epoch map rows based on the assigned-doc-ID for the document that triggered maintenance of the virtual-index-epoch map and columns based on a number of physical partitions deemed to be sufficient to meet the performance criteria.
7. The method of
optimizing a total number of the physical partitions by reusing at least some of the physical partitions.
8. The method of
10. The system of
maintaining a persistent, transactionally recoverable structure that stores the virtual-index-epoch map.
11. The system of
dynamically maintaining the virtual-index-epoch map to accommodate changes in the system capacity, modeled performance of a physical partition, and actual performance of a physical partition.
12. The system of
maintaining the virtual-index-epoch map by at least one of creating and deleting virtual-index-epoch numbers from the virtual-index-epoch map.
13. The system of
including in the virtual-index-epoch map rows based on the assigned-doc-ID for the document that triggered maintenance of the virtual-index-epoch map and columns based on a number of physical partitions deemed to be sufficient to meet the performance criteria.
14. The system of
optimizing a total number of the physical partitions by reusing at least some of the physical partitions.
15. The system of
16. The system of
18. The computer program product of
maintaining a persistent, transactionally recoverable structure that stores the virtual-index-epoch map.
19. The computer program product of
dynamically maintaining the virtual-index-epoch map to accommodate changes in the system capacity, modeled performance of a physical partition, and actual performance of a physical partition.
20. The computer program product of
maintaining the virtual-index-epoch map by at least one of creating and deleting virtual-index-epoch numbers from the virtual-index-epoch map.
21. The computer program product of
including in the virtual-index-epoch map rows based on the assigned-doc-ID for the document that triggered maintenance of the virtual-index-epoch map and columns based on a number of physical partitions deemed to be sufficient to meet the performance criteria.
22. The computer program product of
optimizing a total number of the physical partitions by reusing at least some of the physical partitions.
23. The computer program product of
24. The computer program product of
|
1. Field
Embodiments of the invention relate to index partition maintenance over monotonically addressed document sequences.
2. Description of the Related Art
In the current state of the art, text indexing systems are implemented as inverted lists using standard underlying file system storage. Such text indexing systems typically provide adequate performance for the odd million documents or so depending on factors such as document size (i.e., average number of tokens per document), the distribution of words that typically occur within the document corpus, and a host of other factors. A token may be described as a term (e.g., word, number, sequence of logograms, or other contiguous string of symbols) appearing in a document. When, however, one makes an attempt to scale up such text indexing systems to contain a corpus in the order of billions of documents, then, a series of capacity and performance problems occur.
First, the text indexing system runs into typical file system limits and capacity problems, where it is virtually impossible to sustain a single text index larger than the underlying file system. Typical low cost file systems are directly implemented over Just a Bunch of Disks (JBOD) or one or more spindles (disks). Transparent storage scalable file systems exist, however, they demand higher costs, more indirect management, and, typically, limited scalability with respect to the number of participating machines. Also, such a choice may not be feasible in some installations due to the added software virtualization layers causing further I/O performance problems because the text indexing implementations in the field involve a high number of file system metadata changes that such file systems have problems with in general.
Second, the I/O profiles associated with the current offering of text indexing systems is such that the I/O profile directly affects create (i.e., insert or ingest) velocity of the overlying applications using the index at the time when the inverted list implementation within the text index undergoes a hardening operation called an index merge operation. Creation of a document at the text index layers may be described as processing of the document such that the document is inserted or created and indexed within the full text indexing system. Current text indexing systems undergo a serious sequential read and sequential write of almost the entire index, causing serious dips and stalls in the performance of the creation pipeline of the overlying application using the text index. There is another stall in the current product offerings of text indexing systems called the optimize problem, which essentially also stalls the application till the entire inverted list is recreated using the old instance of the inverted lists. This is typically a long duration event that stalls the creation pipeline of the overlying application.
Thirdly, another class of problems includes the term distribution problem. This problem involves the distribution of words within the document corpus being stored within the text index, which is sometimes referred to the term dictionary of the document corpus. It is altogether possible that simply attempting to activate and open the text index with the current product offerings could potentially consume all the memory resources of the hosting system simply to load in memory the first level term index/dictionary. In some cases, it could be virtually impossible to load for indexes that have very large term distributions demanding that the index be split and managed as a single index with a single virtual index view.
Fourth, on the side of search, performance due to very large term dictionaries can degrade.
For example, with reference to a conventional index there are inherent limits to which persistent file structures can actually be hosted in the text indexing systems at runtime. Certain structures, such as the first level term index file, at some point cannot be managed properly in memory due to finite memory that is available to the JAVA™ Virtual Machine (JVM) heap. JAVA is a trademark of Sun Microsystems in the United States and/or other countries. Also, a conventional index may be hosted in a directory and inherently must lie within the storage limits of an underlying physical file system. This implies that the file system storage limits would decide the maximum size of the index. A single conventional index has to lie within certain optimal limits in the posting lists to have reasonable search performance, assuming that the term distribution would reach a certain steady state at some point in the life cycle of the file system. A single conventional index would have a peak creation rate associated with the underlying performance of the file system and storage and available Central Processing Unit (CPU).
Thus, as described, there are a number problems associated with single very large full text indexes. Operationally, such indexes could exceed the file system capacity limits, which causes problems. The performance and throughput limits can also be seriously affected with such single very large indexes as in the case insertion of new documents into it as well as when performing a search or query. For example, dips and stalls in response times are known to occur when there are merge operations or index optimization performed internally to compact and maintain itself.
In conclusion, there is a need for transparently and optimally partitioning and managing text indexes with a single virtual view to an application that utilizes the text indexes.
Provided are a method, computer program product, and system for partitioning a physical index into one or more physical partitions; assigning each of the one or more physical partitions to a node in a cluster of nodes; for each received document, assigning an assigned-doc-ID comprising an integer document identifier; and, in response to assigning the assigned-doc-ID to a document, determining a cut-off of assignment of new documents to a current virtual-index-epoch comprising a first set of physical partitions and placing the new documents into a new virtual-index-epoch comprising a second set of physical partitions by inserting each new document to a specific one of the physical partitions in the second set using one or more functions that direct the placement based on one of the assigned-doc-id, a field value derived from a set of fields obtained from the document, and a combination of the assigned-doc-id and the field value.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
In the following description, reference is made to the accompanying drawings which form a part hereof and which illustrate several embodiments of the invention. It is understood that other embodiments may be utilized and structural and operational changes may be made without departing from the scope of the invention.
Thus, irrespective of the problems with the current state of the art, embodiments achieve steady state peak creation and search velocity using a virtual index that imbibes a series of autonomically managed underlying physical (i.e., real) indexes.
Embodiments dynamically partition text indexes transparently, while providing a single virtualized view of the index to an overlying application that could use the index with a single interface for create, modify, delete and search operations.
Embodiments provide a two-dimensional dynamic partitioning scheme (in the form of a virtual-index-epoch map) that affords a mechanism to provide a single system view or a virtualized view of multiple underlying physical partitions (e.g., physical indexes). The term “single system view” is analogous to the term “single system image” used in operating systems. The term refers to an underlying system providing a way for some outside consumer application to think that it is dealing with one entity, even though that underlying system is actually manipulating many entities. The term “virtulized view” is also used.
Embodiments provide an internal monotonic sequenced integer called an assigned-doc-ID (i.e., assigned-document-identifier) to be assigned and associated with each document that is created. This permits an integer range cutoff partition scheme based on the assigned-doc-ID which is used for the first dimension. In addition a user defined or load defined open partitioning scheme is introduced in the second dimension within each cutoff range. In certain embodiments, a row in this two-dimensional dynamic partitioned scheme represents a virtual-index-epoch that may be triggered autonomically or manually. In embodiments, a virtual-index-epoch may be described as a state partitioning state snapshot in the first dimension.
Embodiments provide a Highly Scalable Indexing Platform (HSIP). The HSIP usually starts with a hand tooled single established virtual-index-epoch numbered zero. Subsequently, as triggers in the first dimension occur, embodiments create a new virtual-index-epoch that becomes the new current virtual-index-epoch and cut's off the previous virtual-index-epoch, thereby assigning a monotonic range to the previous virtual-index-epoch. In certain embodiments, the triggers in the first dimension are typically fired on capacity feedback mechanisms of the HSIP. In certain embodiments, the triggers in the second dimension are typically fired based on throughput and response time feedback. In certain embodiments, the virtualized view provides the transparency to the hosting application in these dimensions of the underlying scaling out or up the physical partitions.
The index controller 140 is coupled to a DataBase Management System (DBMS) 142. The DBMS 142 is coupled, via a database network 150, to one or more databases. In the illustration of
The nodes 110, 120, and 130 are coupled to one or more shared file systems via a shared file system network 160. In some embodiments this could be a standard shared file system like NFS, CFS or GPFS where in the file system network 160 is none other than an IP based network. In the illustration of
The application 100 issues Create, Read, Update, and Delete (CRUD) operations or query operations via an application network to the index controller 140. The index controller 140 forwards the CRUD operations and/or queries to the appropriate index server 114, 124, 126, 134 to process. The index server[s] 114, 124, 126, 134 accesses the appropriate shared file system 162, 164 to process the associated CRUD and/or partial query operations.
The use of ellipses in
In certain embodiments, the index controller 140 is designed to failover to a passive instance of another index controller on some alternate node within the active group of nodes, and, together, the index controllers can be deemed to be operating in an active/passive mode. There may be zero or more passive instances of the index controller.
The trigger generator component 200 receives messages from node managers 110, 120, 130 and/or index servers 114, 124, 126, 134 containing performance metrics.
The shared file system 162, 164 is used to store the underlying persisted forms of the text index typically used by the specific text indexing engine (also called a native indexer in this discussion).
The following definitions are used herein:
1. Document—A document may be described as a logical sequence of words or tokens in any format. A document may be materialized by appropriate tokenization of a MICROSOFT™ word document, a Hypertext Markup Language (HTML) web page, a Portable Document Format (PDF) document, a raw text document, or a document in a host of other formats. MICROSOFT is a trademark of Microsoft Corporation in the United States and/or other countries.
2. Query—A query expresses the characteristics that a user or system is interested in. For instance, the query “John Doe” is considered to represent a user or system's interest in all documents containing the phrase “John Doe”. There are several known syntaxes for expressing queries, including those used by public search engines, indexing software, and standard query languages, such as Extensible Markup Language (XML) Path Language (XPath). In embodiments, the query may be written in any query syntax.
3. Assigned-doc-ID—An assigned-doc-ID is generated for a document during creation. The document is addressed/identified by the assigned-doc-ID, which is a monotonically increasing, non-reusable unique identifier (e.g., a 64-bit integer). For example, monotonically addressable documents occur in the context of content management systems.
4. Physical partition (e.g., physical index)—A physical partition is managed by an index server and stored in a shared file system. Typically, the physical partition is an inverted list and may be used by any search engine.
5. Virtual Index—A virtual index is a single system view of all physical partitions.
6. Each index server is capable of tracking the state of the physical partitions it has been assigned. Each index server has a notion of what (or highest) assigned-doc-ID that has been persisted for each of those physical partitions. In some embodiments, the index controller 140 may query this state and reapply what the index controller 140 thinks was lost in flight due to network partitions/location failure, etc, for a specific physical partition it is attempting to process. In certain embodiments, the recovery system for each physical partition is managed by the local index server and each physical partition can recover independently by negotiating with the index controller using some form of write ahead logging of the CRUD operations in the index controller.
7. Virtual-index-epoch—A set of indexes that share the same range partition with a well known lower bound and upper bound. In certain embodiments, for the two-dimensional dynamic partitioned scheme, there is one active virtual-index-epoch. This set of indexes use the same partitioning scheme in the second dimension. The indexes within the virtual-index-epoch are logically numbered starting from 0.
8. Virtual-index-epoch transition—The act of transitioning to a new set of one or more indexes with a new lower bound assigned-doc-ID and infinity upper bound. This meets basic failure scenarios and is transactional with appropriate cleanup/restart recovery. Such a transition usually occurs when a capacity or throughput trigger is generated by the trigger generator.
9. Location—A node/host that has an independent CPU and memory resources and either uses shared file system storage 140 or isolated storage manifested as a file-system mounted at that node.
8. Placement Technique—Placement is a two part process consisting of the act of determining a location (e.g., node 13) to host one or more index server instances subsequent to optimally determining what physical partition subset will be hosted by the individual index servers. Placement techniques (e.g., bin packing or the knapsack problem for combinatorial optimization) are known to optimally place disjoint subsets of physical partitions into an optimal set of index servers that can then be placed across the nodes/locations in the cluster.
9. Placement Map 122.—The entire set of index server instances and their locations. Also the associated disjoint set of physical partitions hosted by each index server.
10. Trigger—There exist two types of triggers, either manually driven or autonomically driven by way of feedback and thresholds. Threshold values are derived from independent modeling. The type-1 trigger is associated with storage resources, document corpus limits, and memory availability within a node to sustain an index for a given number of documents, etc. The type-2 Trigger is typically a throughput trigger. The throughput trigger may be driven manually or autonomically, where the throughput disposition at an earlier virtual-index-epoch is determined from response behavior/history on inserts.
11. Index controller 140—The index controller 140 is the keeper of all distributed state, and the keeper of the persistent Create, Read, Update, and Delete (CRUD) queue. The index controller 140 also orchestrates the virtual-index-epoch transition.
Embodiments solve the problems with the current state of the art as follows:
a. An index is broken at birth on a continuous basis into a series of managed physical partitions that are more easily hosted within the HSIP 190 limits. The number of the physical partitions are autonomically managed to provide a transparent single virtual index view. That is, an application will believe that it is submitting a query to a single index, rather than a set of physical partitions.
b. A single system view or image is provided of the managed physical partitions so that applications are unchanged. That is, applications interacting with the index controller 140 are not aware of an index being separated into multiple physical partitions that happens transparently on a continuous basis. The management of virtual-index-epochs and the use of the mapping (via the use of the two-dimensional dynamic partitioning scheme, the map function and the group function) provides a mechanism to provide that single virtualized view of the physical partitioned indexes.
c. The size of the physical partitions is kept within a tolerable performance envelope, such that the overall virtual index has a predictable steady performance characteristic in the creation velocity and the search performance. This is achieved with the help of triggering an virtual-index-epoch of type-1 (i.e., capacity). Reasonable feedback mechanisms for size and term distributions are tied to this virtual-index-epoch of type-1.
d. The virtual index is managed autonomically, providing a single system view or image to the applications, using a two-dimensional dynamic partitioning scheme. This provides a single system view or a virtualized index over the multiple underlying physical partitions with an internal monotonic sequenced document ID that is range partitioned in the first dimension and a user defined or load defined open partitioning scheme in the second dimension for each range. A row in the two-dimensional dynamic partitioned scheme represents a virtual-index-epoch, which can be triggered autonomically. The triggers in the first dimension are typically fired on capacity. The triggers in the second dimension are typically fired based on throughput demands. The virtualized view provides the transparency to the hosting application in these dimensions of scaling out or up the physical partitioned indexes. The virtual-index-epoch provides means to evaluate and dynamically reconfigure the number of indexes required to sustain a steady state performance. In certain embodiments, the reconfiguration may be a new set of physical partitions added or even removed or even older indexes in earlier virtual-index-epoch being merged up without loss of CRUD or query service at the virtual index level.
In block 402, the index controller 140 determines an action from reviewing statistics such as CPU usage, memory usage, and or other relevant statistics about the node collected from the node managers delivered by way of a Remote Procedure Call (RPC). In various embodiments, the shared file system monitor 212 continuously and/or periodically monitors the size and usage of the shared file systems, and other policies expressed as rule sets. The actions may include performing load balancing. For example, for type 1 triggers, the actions may be to add another index server or have an existing index server process two physical partitions instead of three physical partitions. In block 404, the index controller 140 determines whether a virtual-index-epoch transition is in progress. If so, the index controller 140 waits, otherwise, the index controller 140 continues to block 406.
In block 406, the index controller 140 determines whether all query sessions and CRUD sessions to the index controller 140 have completed current operations and the gate can be acquired. This involves acquiring a gate such that no other operation can proceed. If the gate is not available, the index controller 140 continues to block 408, otherwise, the index controller 140 waits till all open sessions rendezvous and wait at the gate. This allows the virtual-index-epoch transition to get exclusive rights to alter the appropriate map structures.
In block 408, the index controller locks the virtual-index-epoch map 224 (i.e., closes the virtual-index-epoch gate). From block 408 (
In block 416, the index controller 140 closes the virtual-index-epoch gate. In block 418, the index controller 140 unlocks the virtual-index-epoch map 224. In block 420, the index controller 140 creates a virtual-index-epoch start time phase marker persistent record for crash purposes. This involves persisting a record into the DBMS 142 to mark a start phase of the virtual-index-epoch transition. This is done so that, in case a crash occurs, the HSIP 190 can recover by detecting the said persisted record, seeing that the virtual-index-epoch did not complete and rolling back the incomplete virtual-index-epoch transition operations that may have occurred partially.
From block 420 (
In block 424, in accordance with certain embodiments, the index controller 140 removes the old logical index number assignments to the physical partitions of the previous virtual-index-epoch in the in memory form of the virtual-index-epoch map 224. In such embodiments, a previous virtual-index-epoch that was current at some point in time in history has physical partitions that have some logical numbers attached to them. When brought forward to the new virtual-index-epoch, all physical partitions, including one or more new physical partitions that may be deemed necessary based on the trigger, are renumbered with new logical index numbers. In block 426, the index controller 140 assigns new logical index numbers by renumbering the physical partitions starting from zero. In block 428, the index controller 140 runs a placement technique (e.g., bin packing) to assign the physical partitions to the index servers 114, 124, 126, 134. In block 430, the index controller 140 deploys a placement map 222 by re-deploying and re-starting the index servers 114, 124, 126, 134 over the M nodes 110, 120, 130 in the N clusters for the N physical partitions.
From block 430 (
In certain embodiments, a virtual-index-epoch transition does not stop create operations at the index controller 140.
From block 510 (
From block 522 (
In
In
In certain embodiments, the index controller 140 computes the virtual-index-epoch using a binary search of table 700, where the virtual-index-epoch of a document ID is based on the sorted unique interval determined from the range minimum and range maximum that contains the assigned-document-ID (max is included, min is excluded). For example, for each function virtual-index-epoch(document ID)=virtual-index-epoch number and with reference to table 700: virtual-index-epoch(96)=1, virtual-index-epoch(3097)=5, virtual-index-epoch(699)=3, virtual-index-epoch(6098)=6.
Then, using table 700, the index controller 140 computes the map function (e.g., map(document ID)) using the virtual-index-epoch value and a hash function. For example if hash(x)=x, then, for a document ID of 2000:
For the map example of map(2000), the values of 3, 2, and 4 represent the range modulos (“mods”) of prior virtual-index-epochs in table 700, the percent represents modulo, virtual-index-epoch 6 represents the current virtual-index-epoch, and the result (i.e., 11 in this example) is the logical partition.
In certain embodiments, structure 800 is implemented as a lookup table. In certain embodiments, the mapping from logical partition to physical partition is done using the group function. For example, for a document ID of 2000, for the group function lookup (map(2000))=lookup (11)=3. Thus, the document with document ID 2000 is stored in physical partition 3.
From block 910 (
In block 916, the index controller 140 transmits the optimized query to the index servers in the set (e.g., some subset of index servers 114, 124, 126, 134).
From block 920 (
From block 1110 (
From block 1122 (
In block 1206, the target index server determines whether this operation is an update operation. If so, processing continues to block 1208, otherwise, processing continues to block 1214. In block 1208, the target index server applies the update operation by first retrieving the associated document, then deleting and re-inserting the document with the appropriate field values updated. From block 1208, processing continues to block 1212. In block 1212, the target index server replies success or failure to the index controller 140. This is the response that the index controller 140 is waiting for in block 1124 (
In block 1214, the target index server performs other processing.
From block 1308 (
In block 1314, the trigger generator component 200 determines whether CPU usage at a node exceeds a node capability. If so, processing continues to block 1316, otherwise, processing continues to block 1318. In block 1316, the trigger generator component 200 performs the placement technique.
In block 1318, the trigger generator component 200 determines whether creation or query response times exceed respond thresholds. If so, processing continues to block 1320, otherwise, processing is done. In block 1320, the trigger generator component 200 generates a throughput trigger.
In certain embodiments, the index state maintained within the physical partition I[1],I[2] etc in the shared file system 320 in
In some embodiments, the placement technique performs node location and placement of physical partition's to index servers by computing an optimal placement that capitalizes on spare CPU and storage capacity. In certain embodiments, the placement technique may be a typical greedy bucket assignment type technique, with the assumption that actual remote placement infrastructure is available. This includes the notion that an index can be re-placed to a different location, if necessary.
For the placement technique, the node availability is not considered specifically, however, it is assumed that one or more indexes hosted on a location can be recovered independently from any node within the cluster, from past history in the index controller which tracks all application operations in a persistent queue structure within the DBMS.
The index controller 140 is started by hand-tooling the E0 virtual-index-epoch. This involves bringing up the index controller 140 with a fixed number of index instances (e.g., N0=1). That is, there exists a base virtual-index-epoch for the index controller 140 to boot-up.
The CRUD and the query rely on a consistent view of the map. For CRUD operations, this relies on determining the home location for a given assigned-doc-ID. This is done by first doing a range partition for the virtual-index-epoch using the cutoff assigned-doc-ID, then applying the F(i) general partitioning function for the given virtual-index-epoch. A general partitioning function can be specified in many ways that a person conversant in the art would know, for example the partitioning function can be a simple modulo of the number of logical indexes within the virtual index-epoch using the assigned-document-ID. Or in other embodiments it can be a user defined method that uses a well known data field from the document other than the assigned-document-id and some general hash function and a modulo of the number of logical indexes within the virtual-index-epoch to arrive at which specific logical index the document must be assigned to.
The precise assignment may be obtained by way of the virtual-index-epoch map and is obtained by first determining the assigned-doc-ID to a set of indexes and applying the function as needed to obtain the physical partition within the index set.
With embodiments, the assigned-doc-ID comprises a monotonically increasing, non-reusable unique identifier is a monotonically increasing number of sufficient precision. The HSIP 190 maintains a persistent, transactionally recoverable structure that stores the virtual-index-epoch map. The HSIP 190 dynamically maintains the virtual-index-epoch map to accommodate changes in the system capacity, modeled performance of a physical partition, and actual performance of a physical partition. The HSIP 190 maintains the virtual-index-epoch map by creating or deleting virtual-index-epoch numbers from the virtual-index-epoch map. The HSIP 190 includes rows in the virtual-index-epoch map based on the assigned-doc-ID for the document that triggered maintenance of the virtual-index-epoch map and columns based on a number of physical partitions deemed to be sufficient to meet the performance criteria.
The HSIP 190 optimizes a total number of the physical partitions by reusing the physical partitions. The one or more functions are either system determined or user specified, wherein the system determined functions are based on one of system capacity, modeled performance of a physical partition, and actual performance of the physical partition. Also, the determined cut-off comprises the current assigned-doc-ID plus a cushion, wherein the cushion is specified to provide a means to not block other CRUD and query sessions that can occur while the virtual-index-epoch transition is occurring.
Embodiments scale in capacity, improve availability, scale creation operations with search throughput, and improve manageability of large full text indexes. In certain embodiments, the index controller 140 partitions an index into a one or more physical partitions, assigns a document to a two-dimensional map based on a document ID and a field value, and derives the field value from a set of functions (one or more functions), the set of functions being either system determined or user specified. In certain embodiments, index controller 140 the system determined functions are based on one of: system capacity, modeled performance of a physical partition, and actual performance of a physical partition. In certain embodiments, index controller 140 assigns each physical partition to a node in a cluster. In certain embodiments, the index controller 140 assigns the document ID, which is a monotonically increasing number of sufficient precision. In certain embodiments, the index controller 140 maintains a persistent, transactionally recoverable structure that maintains the two-dimensional map. In certain embodiments, the index controller 140 dynamically maintains the map to accommodate changes in any of the following: system capacity, modeled performance of a physical partition, and actual performance of a physical partition. In certain embodiments, the index controller 140 maintains the map structure and the group structure by modifying, deleting or expanding the structure. In certain embodiments, the index controller 140 includes in the map structure, rows based on the document ID for a current document that triggered maintenance of the map and columns based on a number of indexes deemed to be sufficient to meet the performance criteria. In certain embodiments, the index controller 140 optimizes a total number of the indexes by reusing the indexes, from one column and one row in the map, in one or more additional columns or one or more additional rows in the map through system defined methods.
Embodiments partition the index from the beginning into a series of virtual-index-epochs, each virtual-index-epoch consisting of a range of documents based on monotonically-increasing document IDs. For each virtual-index-epoch, embodiments break the virtual-index-epoch into a variable number of logical partitions and assign each logical partition to a physical partition. Physical partitions (e.g., a single file system instance) can contain multiple logical partitions. The logical partitions for any one virtual-index-epoch can each be assigned to any arbitrary physical partition. The assignment of a logical partition to a physical partition is done to optimize storage performance, so that virtual-index-epochs could be distributed over multiple physical partitions to allow better concurrency, for instance. More specifically, embodiments optimize the resources used to maintain the virtual-index-epoch map. Each physical partition is a real physical partition taking up storage and operating system resources, such as descriptors, memory, etc. Embodiments provide efficient mapping from document ID to virtual-index-epoch, then to logical partition, then to physical partition, such that the full text search can be presented with a single virtual index.
Embodiments provide a single index view for all applications. Embodiments dynamically or manually detect that a reconfiguration of scale is needed to sustain both storage limits or insufficient throughput. Embodiments continue the act of data creation via CRUD and sustain query processing while reconfiguration actually occurs. Limited manual intervention is required. For example, manual intervention may occur when a manual trigger is needed and/or the supply of machine and resource is needed. This keeps the total cost of ownership for such a reconfiguration very low.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, solid state memory, magnetic tape or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The code implementing the described operations may further be implemented in hardware logic or circuitry (e.g., an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc.
Input/Output (I/O) devices 1612, 1614 (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers 1610.
Network adapters 1608 may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters 1608.
The computer system 1600 may be coupled to storage 1616 (e.g., a non-volatile storage area, such as magnetic disk drives, optical disk drives, a tape drive, etc.). The storage 1616 may comprise an internal storage device or an attached or network accessible storage. Computer programs 1606 in storage 1616 may be loaded into the memory elements 1604 and executed by a processor 1602 in a manner known in the art.
The computer system 1600 may include fewer components than illustrated, additional components not illustrated herein, or some combination of the components illustrated and additional components. The computer system 1600 may comprise any computing device known in the art, such as a mainframe, server, personal computer, workstation, laptop, handheld computer, telephony device, network appliance, virtualization device, storage controller, etc.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of embodiments of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
The foregoing description of embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the embodiments to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the embodiments be limited not by this detailed description, but rather by the claims appended hereto. The above specification, examples and data provide a complete description of the manufacture and use of the composition of the embodiments. Since many embodiments may be made without departing from the spirit and scope of the embodiments, the embodiments reside in the claims hereinafter appended or any subsequently-filed claims, and their equivalents.
Li, Ning, Lindsay, Bruce Gilbert, Rajagopalan, Sridhar, Raphael, Roger C., Barber, Ronald Jason, Deshmukh, Harish, Shekita, Eugene J., Taylor, Paul Sherwood
Patent | Priority | Assignee | Title |
10686875, | Mar 14 2013 | Microsoft Technology Licensing, LLC | Elastically scalable document-oriented storage services |
Patent | Priority | Assignee | Title |
5745899, | Aug 09 1996 | R2 SOLUTIONS LLC | Method for indexing information of a database |
5778354, | Jun 07 1995 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Database management system with improved indexed accessing |
7293016, | Jan 22 2004 | Microsoft Technology Licensing, LLC | Index partitioning based on document relevance for document indexes |
20050165750, | |||
20060123062, | |||
20080065596, | |||
20090019038, | |||
20090063396, | |||
20090177757, | |||
20090327312, | |||
20100030773, | |||
20110029524, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Aug 12 2010 | RAPHAEL, ROGER C | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025925 | /0920 | |
Aug 13 2010 | TAYLOR, PAUL S | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025925 | /0920 | |
Aug 15 2010 | SHEKITA, EUGENE J | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025925 | /0920 | |
Aug 16 2010 | BARBER, RONALD J | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025925 | /0920 | |
Aug 16 2010 | LINDSAY, BRUCE G | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025925 | /0920 | |
Aug 17 2010 | LI, NING | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025925 | /0920 | |
Aug 19 2010 | DESHMUKH, HARISH | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025925 | /0920 | |
Sep 03 2010 | International Business Machines Corporation | (assignment on the face of the patent) | / | |||
Sep 03 2010 | RAJAGOPALAN, SRIDHAR | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025925 | /0920 |
Date | Maintenance Fee Events |
Oct 17 2017 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Oct 18 2021 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
May 27 2017 | 4 years fee payment window open |
Nov 27 2017 | 6 months grace period start (w surcharge) |
May 27 2018 | patent expiry (for year 4) |
May 27 2020 | 2 years to revive unintentionally abandoned end. (for year 4) |
May 27 2021 | 8 years fee payment window open |
Nov 27 2021 | 6 months grace period start (w surcharge) |
May 27 2022 | patent expiry (for year 8) |
May 27 2024 | 2 years to revive unintentionally abandoned end. (for year 8) |
May 27 2025 | 12 years fee payment window open |
Nov 27 2025 | 6 months grace period start (w surcharge) |
May 27 2026 | patent expiry (for year 12) |
May 27 2028 | 2 years to revive unintentionally abandoned end. (for year 12) |