A system and method for modifying or converting predicates involving a comparison with a string constant in a query is provided so that the predicates can be evaluated by an external or remote source that uses a different collating sequence, thereby effectively reducing the amount of data sent over a network. Techniques for predicate simplification are also provided for complex predicates. Although it is desirable that the converted predicate precisely match the original predicate, it is not necessary; an imprecise predicate that selects a superset of the target rows is sufficient. Thus, techniques for imprecise conversion are also provided. For an imprecise conversion, the original query is further evaluated against the obtained superset to obtain the final result, whereas for a precise conversion, the returned set is the required set, and no further evaluation is needed.
|
22. A heterogeneous database system, comprising:
a first data source using a first collating sequence (I1) having a first plurality of characters; a second data source using a second collating sequence (I2); a conversion table generator using a conversion rule generated by an operation of an operator, wherein said operator is one of less than (<), less than or equal to (≦), greater than (>), and greater than or equal to, (≧), on a subject character, said subject character being one of the first plurality of characters; a conversion table generated by said conversion table generator; and a predicate converter for receiving a query including a source predicate (e1) for I1 and, using said conversion table, converting e1 to a target predicate (e2) for I2.
1. A method for use in a heterogeneous database environment including a first data source using a first collating sequence (I1) and a second data source using a second collating sequence (I2), comprising the steps of:
receiving a query including a source predicate (e1) for the first collating sequence I1 having a first plurality of characters; converting said source predicate e1 to a target predicate (e2) for the second collating sequence I2 using a conversion rule generated by an operation of an operator, wherein said operator comprises one of less than (<), less than or equal to (≦), greater than (>), and greater than or equal to (≧), said subject character being one of the first plurality of characters; and submitting said target predicate e2 to the second data source.
16. A method for generating a conversion rule for a conversion table between a first collating sequence (I1) having a first plurality of characters, each having a respective weight, and a second collating sequence (I2) having a second plurality of characters, each having a respective weight, comprising the steps of:
for a subject character operated on by an operator, said subject character being one of the first plurality of characters and said operator being one of less than (<), less than or equal to (≦), greater than (>), and greater than or equal to (≧), determining which ones of the first plurality of characters in the first collating sequence have a weight that satisfies said operator operating on said subject character; detecting the characters in the second collating sequence that correspond to the first plurality of characters that have been determined to have a weight that satisfies said operator operating on said subject character; and determining at least one interval of the detected characters, the conversion rule being responsive to said at least one interval.
2. The method as recited in
3. A computer readable storage medium on which is stored a conversion table for use in the method of
4. A conversion table as recited in
5. The conversion table as recited in
6. The method as recited in
7. The method as recited in
8. The method as recited in
obtaining a set of intervals in said target predicate e2; and combining said intervals into at least one interval in accordance with an intersection or union of said intervals.
9. The method as recited in
detecting at least one external constraint at the second data source; and modifying said target predicate e2 responsive to said at least one external constraint.
10. The method as recited in
converting said target predicate e2 to conjunctive normal form; eliminating redundant atomic predicates from said target predicate e2; converting said target predicate e2 to disjunctive normal form; deleting contradictory terms and empty ranges from said target predicate e2; and converting said target predicate e2 to conjunctive normal form.
11. The method as recited in
12. The method as recited in
13. The method as recited in
14. The method as recited in
15. The method as recited in
17. The method as recited in
18. The method as recited in
20. The method as recited in
determining which of said characters in the first collating sequence have the same weight as said subject character; and prior to determining said at least one interval, detecting the characters in the second collating sequence that correspond to the first plurality of characters that have been determined to have the same weight as said subject character.
21. The method as recited in
detecting each character in the second collating sequence that has the same primary weight as the character that corresponds to said subject character; determining sequences of adjacent detected characters at the second collating sequence; determining which of said detected characters has a lowest primary weight; determining which of said detected characters has a highest primary weight; determining at least one interval of the characters of the second collating sequence having said lowest primary weight character and said highest primary weight character as a lowest and a highest endpoint, respectively; determining if there is more than one character in each said interval that has said lowest primary weight, and if so, changing said lowest endpoint to a character having a next lowest primary weight; and determining if there is more than one character in each said interval that has said highest primary weight, and if so, changing said highest endpoint to a character having a next highest primary weight.
23. The database system as recited in
25. The database system as recited in
26. The database system as recited in
|
This invention relates in general to database management systems performed by computers. More particularly, this invention relates to efficient processing of database queries involving selection against heterogeneous data.
Heterogeneous data access is becoming an important feature in database systems because it facilitates the integration of information from a variety of data sources. The performance of queries involving external data sources is greatly affected by the cost of transporting data over the network. This cost can be reduced by pushing down subqueries, particularly selections and projections, to the external data source. However, if the external data source uses a different collating sequence, predicates consisting of comparison of a column (also known as a field) with a string constant cannot be pushed down unchanged. A "predicate" is a qualifier or condition for a search.
A collating sequence is essentially the computer equivalent of alphabetical order. One typical collating sequence is the order of characters based on their binary values in ASCII (American Standard Code for Information Interchange) and another typical collating sequence is the order of characters based on their binary values in EBCDIC (Extended Binary-Coded Decimal (BCD) Interchange Code). Although these two sequences have similar characters or collating elements, their collating sequences (sort orders) are significantly different, as shown in FIG. 1. For example, both collating sequences contain all the capital and lower case letters of the (English) alphabet, but in the ASCII capital letters all sort before lower case letters but in EBCDIC this is reversed. Also, in ASCII digits sort before letters but not in EBCDIC.
A collating sequence is separate from a character (code) set: two countries or languages may use the same character set but have different collating sequences. A list of different language identifiers having similar character sets but different collating sequences is shown in Table 1. The purpose of a collating sequence is to define a total order on character strings. This is done by means of weights assigned to collating elements, which most often correspond to a single character but may consist of a sequence of characters. For example, in Spanish, the sequence "11" is treated as a single collating element. In the simplest case, each character is assigned a unique weight. To determine the ordering of two strings, the weights of the characters are compared proceeding from left to right. If the two strings compare equal on the first level weights, they are compared on second level weights (if any), then on third level weights, and so on.
TABLE 1 | ||
Language Name | Language String | |
Czech | "czech" | |
Danish | "danish" | |
Dutch (Belgian) | "dutch-belgian" | |
Dutch (default) | "dutch" | |
English (Australian) | "australian" | |
English (Canadian) | "canadian" | |
English (default) | "english" | |
English (New Zealand) | "english-nz" | |
English (UK) | "english-uk" | |
English (USA) | "american-english" | |
Finnish | "finnish" | |
French (Belgian) | "french-belgian" | |
French (Canadian) | "french-canadian" | |
French (default) | "french" | |
French (Swiss) | "french-swiss" | |
German (Austrian) | "german-austrian" | |
German (default) | "german" | |
German (Swiss) | "swiss" | |
Hungarian | "hungarian" | |
Icelandic | "icelandic" | |
Italian (default) | "italian" | |
Italian (Swiss) | "italian-swiss" | |
Norwegian (Bokmal) | "norwegian-bokmal" | |
Norwegian (default) | "norwegian" | |
Norwegian (Nynorsk) | "norwegian-nynorsk" | |
Polish | "polish" | |
Portuguese (Brazilian) | "portuguese-brazilian" | |
Portuguese (default) | "portuguese" | |
Spanish (default) | "spanish" | |
Spanish (Mexican) | "spanish-mexican" | |
Spanish (Modern) | "spanish-modern" | |
Swedish | "swedish" | |
Many database systems, including SQL Server, support queries over heterogeneous data sources. If the system detects that some tables referenced in a query are managed by external data sources, the system decomposes the query into a set of single-source queries, submits them, and performs any additional processing needed to integrate the data returned from the local queries. External data sources differ greatly in their query capabilities and interfaces, ranging from simple one-table sources with no query capability to full-fledged SQL database systems. Many aspects of heterogeneity in multidatabase systems, such as data model differences, naming differences, format differences, structure differences, and conflicting data, have been studied and are described in Bright, M. W. et al., "A taxonomy and current issues in multidatabase systems", IEEE Computer, March 1992, pages 50-60.
Many data sources have selection (i.e., filtering) and projection capability. A query optimizer decomposes the query and generates an overall, efficient plan. Often this means delegating as much as possible of the processing to the external data sources, taking into account their processing capabilities. In particular, pushing down selections and projections to an external data source is usually desirable because it reduces the amount of data shipped over the network. However, if the external data source uses a different collating sequence, predicates- involving string comparisons cannot be pushed down unchanged. In practice, systems do not push down such predicates, thereby resulting in a loss of efficiency.
For example, consider a SQL Server running on a Windows NT system that uses the (binary) ASCII collating sequence (WINDOWS NT® is a registered trademark of Microsoft Corp.). Among other things, it provides access to a course table that is stored in a DB2 database running on a mainframe system that uses the (binary) EBCDIC collating sequence. (DB2 is a trademark of International Business Machines Corp.) Suppose the SQL Server receives the query "select CourseNo from course where CourseNo is between `CH020` and `CS499`", and the table has 1000 records and the query will select 100 of the records. If the selection predicate can be pushed down and evaluated at the source, the data communication cost would be reduced by 90% because only the data satisfying the selection predicate would need to be transferred to SQL server.
However, the two systems use different collating sequences, and so the query cannot just be pushed unchanged. Doing so would produce the wrong result. As shown in
Many atomic SQL predicates are independent of the collating sequence; this is true for operators such as IS NULL and IS NOT NULL. Similarly, comparisons of numeric data types (integer, float, decimal) and dates are independent of the collating sequence. BETWEEN and NOT BETWEEN are simply shorthand notation for two comparisons. IN and NOT IN followed by a set of string constants can be transformed into a set of predicates with equals (=) and not equal (<>) operators. Therefore, the predicates requiring conversion are mainly string comparisons using one of the operators <, ≦, >, and ≧. Predicates with operators=and <>may require conversion when some characters have the same weight in the collating sequence.
Although the art of database query processing is well developed, there remain some problems inherent in this technology, particularly sending a query between databases having different collating sequences. Therefore, a need exists for compensating for differences in collating sequences using predicate conversion that overcomes the drawbacks of the prior art.
The present invention is directed to a method for use in a heterogeneous database environment including a first data source using a first collating sequence (I1) and a second data source using a second collating sequence (I2). The method comprises the steps of: receiving a query including a source predicate (e1) for the first collating sequence I1; converting the source predicate e1 to a target predicate (e2) for the second collating sequence I2; and submitting the target predicate e2 to the second data source.
According to one aspect of the present invention, the source predicate e1 is converted to the target predicate e2 in accordance with a conversion table.
According to another aspect of the present invention, the method further comprises the step of simplifying the target predicate e2 prior to submitting the converted query to the second data source. The step of simplifying comprises at least one of the steps of combining range values, exploiting domain constraints, and reducing to prime implicants. Combining range values comprises the steps of obtaining a set of intervals defined in the target predicate e2, and combining the set of intervals into at least one larger interval by taking the union of some of the intervals. Exploiting domain constraints comprises the steps of: detecting at least one external constraint at the second data source; and modifying the target predicate e2 taking into account this constraint. Reducing to prime implicants comprises the steps of: converting the target predicate e2 to conjunctive normal form; eliminating redundant atomic predicates from the target predicate e2; converting the target predicate e2 to disjunctive normal form; deleting contradictory terms and empty ranges from the target predicate e2; and converting the target predicate e2 to conjunctive normal form.
According to another aspect of the present invention, each of the first and second collating sequences comprises either ASCII or EBCDIC. (This is just an example, not a condition for predicate conversion. The source and target collating sequences are not limited to ASCII and EBCDIC.)
According to a further aspect of the present invention, the step of converting comprises a step of imprecise conversion of the source predicate e1. According to one method of imprecise conversion, the source predicate e1 comprises a plurality of predicates, and the step of imprecise converting comprises discarding at least one of the plurality of predicates. According to another method of imprecise converting, the source predicate e1 comprises a string of characters, and the step of imprecise conversion comprises truncating the string of characters. According to another method of imprecise conversion, the source predicate e1 comprises a plurality of ranges, and the step of imprecise conversion comprises merging at least two of the ranges into a larger range covering the two input ranges.
Another embodiment within the scope of this invention includes a conversion table for use in the method described above. Preferably, the conversion table comprises a plurality of rules for converting characters from the first collating sequence I1 operated on by one of a plurality of operators to the second collating sequence I2, such that the query comprising the source predicate e1 can be converted to the target predicate e2 by applying the rules. The operators comprise less than (<), less than or equal to (≦), greater than (>), and greater than or equal to (≧). Each of the rules is either precise or imprecise.
Another embodiment within the scope of this invention includes a method for generating a conversion rule for a conversion table between a first collating sequence (I1) having a first plurality of characters, each having a respective weight, and a second collating sequence (I2) having a second plurality of characters, each having a respective weight. The method comprises the steps of: for a subject character operated on by an operator, the subject character being one of the first plurality of characters and the operator being one of less than (<), less than or equal to (≦), greater than (>), and greater than or equal to (≧), determining which ones of the first plurality of characters in the first collating sequence have a weight that satisfies the operator operating on the subject character; detecting the characters in the second collating sequence that correspond to the first plurality of characters that have been determined to have a weight that satisfies the operator operating on the subject character; and determining at least one interval of the detected characters, where the conversion rule is responsive to the at least one interval.
According to one aspect of the present invention, the method further comprises the step of OR'ing each interval of the at least one interval together to form the conversion rule.
According to another aspect of the present invention, each respective weight in the first collating sequence is different and each respective weight in the second collating sequence is different.
According to another aspect of the present invention, each respective weight consists of only one level weight.
According to further aspects of the present invention, at least two respective weights in one of the first collating sequence and the second collating sequence are identical, and the method further comprises the steps of determining which of the characters in the first collating sequence have the same weight as the subject character; and prior to determining the at least one interval, detecting the characters in the second collating sequence that correspond to the first plurality of characters that have been determined to have the same weight as the subject character.
According to further aspects of the present invention, each respective weight comprises at least a primary weight and a secondary weight, and the method further comprises the steps of: detecting each character in the second collating sequence that has the same primary weight as the character that corresponds to the subject character; determining sequences of adjacent detected characters at the second collating sequence; determining which of the detected characters has a lowest primary weight; determining which of the detected characters has a highest primary weight; determining at least one interval of the characters of the second collating sequence having the lowest primary weight character and the highest primary weight character as a lowest and a highest endpoint, respectively; determining if there is more than one character in each the interval that has the lowest primary weight, and if so, changing the lowest endpoint to a character having a next lowest primary weight; and determining if there is more than one character in each the interval that has the highest primary weight, and if so, changing the highest endpoint to a character having a next highest primary weight.
Another embodiment within the scope of this invention includes a heterogeneous database system, comprising: a first data source using a first collating sequence (I1); a second data source using a second collating sequence (I2); a conversion table generator; a conversion table generated by the conversion table generator; and a predicate converter for receiving a query including a source predicate (e1) for I1 and, using the conversion table, converting e1 to a target predicate (e2) for I2.
According to one aspect of the present invention, the database system further comprises a predicate simplifier for simplifying the target predicate e2.
According to another aspect of the present invention, the first collating sequence I1 and the second collating sequence I2 each comprise a plurality of characters, each character having a weight, at least two of the characters in at least one of the first and second collating sequences having identical weights.
According to a further aspect of the present invention, the first collating sequence I1 and the second collating sequence I2 each comprise a plurality of characters, each character having multiple weights.
The foregoing and other aspects of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.
When a database system retrieves data from a remote or external data source, pushing down evaluation of selection predicates to the external data source, provided that it has selection capability, can reduce communication costs. However, if the systems use different collating sequences, predicates involving string comparisons cannot be pushed down unchanged. The present invention is directed to a system and method for modifying or converting predicates in a query so that they can be pushed down to an external or remote source, thereby effectively reducing the amount of data sent over network. More specifically, the present invention is directed to a system and method for converting predicates containing a comparison of a column or field with a string constant. The resulting predicate may be complex so techniques for predicate simplification are also described herein. Although it is desirable that the converted predicate substantially precisely match the original predicate, it is not necessary; an imprecise predicate that selects a superset of the target rows is sufficient. Thus, techniques for imprecise conversion are also described herein. For an imprecise conversion, the original query is further evaluated against the obtained superset to obtain the final result, whereas for a precise conversion, the returned set is the required set, and no further evaluation is needed. The present invention reduces data communication cost without adding much additional work to the remote database system (also referred to herein as an external data source).
Computer Environment
Database systems are implemented on a computer. FIG. 2 and the following discussion are intended to provide a brief general description of a suitable computing environment in which the invention may be implemented. Although not required, the invention will be described in the general context of computer-executable instructions, such as program modules, being executed by a computer, such as a client workstation or a server. Generally, program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
As shown in
A number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37 and program data 38. A user may enter commands and information into the personal computer 20 through input devices such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone, joystick, game pad, satellite disk, scanner or the like. These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor 47, personal computers typically include other peripheral output devices (not shown), such as speakers and printers.
The personal computer 20 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 49. The remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the personal computer 20, although only a memory storage device 50 has been illustrated in FIG. 2. The logical connections depicted in
When used in a LAN networking environment, the personal computer 20 is connected to the local network 51 through a network interface or adapter 53. When used in a WAN networking environment, the personal computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program modules depicted relative to the personal computer 20, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
Network Environment
As noted, the computer described above can be deployed as part of a computer network. In general, the above description applies to both server computers and client computers deployed in a network environment.
The exemplary system of
The presently preferred methods for predicate conversion will now be explained in greater depth.
Predicate Conversion for Heterogeneous Database System
Given two collating sequences, the present invention is directed to generating a table that contains a set of conversion rules. Predicates consisting of a comparison between a column or a field and a single character can be converted by directly applying the rules in the conversion table. This technique is extended to convert predicates containing longer character strings.
As described herein, it is assumed that the collating sequence at the remote or external data source includes at least all the characters in the collating sequence at the query site. In the description herein, it is assumed that, the query site uses an ASCII collating sequence and the external data source uses an EBCDIC collating sequence, although any collating sequence can be used at the query site and the external data source. Also, for ease of description, it is assumed that each character has a single weight equal to its code point (numeric value) and that source predicates are in conjunctive normal form (CNF).
With respect to the first embodiment described herein, it is assumed that each character has a unique weight in each of the two collating sequences. The present invention achieves a general solution that works for any pair of collating sequences by a rule-based or table-driven design that separates the mechanics of predicate conversion from the conversion rules.
It should be noted that before a conversion takes place, the characters in the remote collating sequence are initialized by removing any marking or flagging they may have received in prior conversions. After the characters in the remote collating sequence are initialized, each character in the source collating sequence that has a weight that is lower than the subject character weight (the weight of the character or sort element being searched) is determined, at step 100. The characters in the remote collating sequence that correspond to the characters having a lower weight from step 100 are marked or flagged at step 110. Intervals consisting of adjacent marked characters in the remote collating sequence are generated at step 120. Two characters being adjacent in a collating sequence means there is no character between the two characters. More precisely, there is no character having a weight that is between the weights of the two characters. The intervals generated at step 120 are OR'ed together to make up the rule for the subject character or sort element for the < operator, and are placed into a conversion table.
Similar procedures are used to determine the intervals and rules for the other three operators ≦, >, ≧, and are shown in
Thus, for example with respect to the rules of
For the predicate x<`t`, the intervals are [", `+`] [`&`, `r`] `s` [`[`, `]`] [`A`, `I`] [`J`, "], where [`&`, `r`] represents the characters between `&` and `r` in the EBCDIC sequence, including `&` and `r`; `s` represents the single character `s`; and " represents the beginning or the end of a collating sequence. Thus, for the predicate x<`t`, the characters in the remote collating sequence that satisfy this query (i.e., the characters in the remote collating sequence that correspond to the characters in the query site collating sequence that satisfy the query) are in the intervals (1) between the start of the remote collating sequence and `+`, (2) between `&` and `r`, (3) `s`, (4) between `[`and`]`, (5) between `A` and `I`, and (6) between `J` and the end of the remote collating sequence, inclusive. The characters in these intervals are OR'ed together to make up the rule for the x<`t` predicate, which is placed into a conversion table.
For the predicate x≦`t`, the intervals are [", `+`] [`&`, `r`] [`s`, `t`] [`[`, `]`] [`A`, `I`] [`J`, "].
For the predicate x>`t`, the intervals are `|``∼`[`u`, `z`] `{"}`.
For the predicate x≧`t`, the intervals are `|` `∼` [`u`, `z`] `{"}`.
Similarly, the rules for converting `3` and `B` are given below. "3"
For the predicate x<`3`, the intervals are [``,`.`] [`(`, `+`] [`&`, `)`] [`-`,`%`] `#``" "" [`0`, `2`].
For the predicate x≦`3`, the intervals are [``, `.`] [`(`, `+`] [`&`, `)`] [`-`, `%`] `#` `" "" [`0`, `3`].
For the predicate x>`3`, the intervals are `<` `|` `;` [`_`, `:`] `@` `=` [`a`, `Z`] [`4 `, "].
For the predicate x≧`3`, the intervals are `<` `|` `;` [`_`, `:`] `@` `=` [`a`, `Z`] [`3`, "]. "B"
For the predicate x<`B`, the intervals are [", `+`] [`&`, `%`] [`>`, `?`] [`:`, ""] `A` ]`0`, "].
For the predicate x≦`B`, the intervals are [", `+`] [`&`, `%`] [`>`, `?`] [`:`, ""] `A`, `B`] [`0`, "].
For the predicate x>`B`, the intervals are `|` `_` ``` [`a`, ` {`] [`C`, `Z`].
For the predicate x≧`B`, the intervals are `|` `_` ``` [`a`, ` {`] [`C`, `Z`].
Thus, the marking method described above with respect to
Although the above description has been directed to each sorting element or character consisting of only one character, the description can be extended to the more complicated scenario when some sort elements contain multiple characters, such as "11" in Spanish. Such an element is preferably treated as a single character in the collation and has its own conversion rules. During conversion, a look-ahead mechanism is preferably implemented to recognize such elements.
If the predicate to be converted is a comparison with a single-character string, the converted predicate can be constructed by the predicate converter 80 (of
(x ≦ `+`) | OR | |
(x ≧ `&` AND x ≦ `r`) | OR | |
(x = `s`) | OR | |
(x ≧ `[` AND x ≦ `]`) | OR | |
(x ≧ `A` AND x ≦ `I`) | OR | |
(x ≧ `J`). | ||
Comparisons with longer string constants can also be converted exactly but the resulting expression may be quite complex. The basic idea is described herein by way of an example. Consider the predicate x<`3Bt`. It is first transformed into a longer but equivalent expression:
(x < `3Bt`) = | ||
(x < `3`) | OR | |
(x ≧ `3` AND x < `3B`) | OR | |
(x ≧ `3B` AND x < `3Bt`). | ||
The first line is a single character comparison that can be converted directly from the conversion table. The second line covers the range [`3`, `3B`), that is, the first character must be equal to `3` and the second character must be in the range [", `B`). This is referred to as a prefixed range and denoted `3`+[", `B`). The third line is similar and specifies the prefixed range `3B`+[", `t`). These two lines can be converted using the conversion rules for `B` and `t` by adding the prefix `3` and `3B`, respectively, in front of each character in the conversion rules. The resulting predicate follows.
(x ≦ `.`) | OR | |
(x ≧ `(` AND x ≦ `+`) | OR | |
(x ≧ `&` AND x ≦ `)`) | OR | |
(x ≧ `-` AND x ≦ `%`) | OR | |
(x = `#`) | OR | |
(x = `\") | OR | |
(x = `"`) | OR | |
(x ≧ `0` AND x < `2`) | OR | |
(x ≦ `3+`) | OR | |
(x ≧ `3&` AND x ≦ `3%`) | OR | |
(x ≧ `3>` AND x ≦ `3?`) | OR | |
(x ≧ `3:` AND x ≦ `3"`) | OR | |
(x = `3A`) | OR | |
(x ≧ `30`) | OR | |
(x ≦ `3B+`) | OR | |
(x ≧ `3B&` AND x ≦ `3Br`) | OR | |
(x = `3Bs`) | OR | |
(x ≧ `3B[` AND x ≦ `3B]`) | OR | |
(x ≧ `3BA` AND x ≦ `3BI`) | OR | |
(x ≧ `3BJ`). | ||
Note that a prefixed range is converted as a whole and is not converted as two separate string comparisons. In this way, a string comparison is converted into a target predicate for use at the external data source that will produce exactly the same result. Thus, a comparison with a long string can be converted to a comparison with a single-character string and a set of prefixed range predicates.
The converted predicates may be very long and complicated. For example, the comparison x<`t` is converted into nine atomic predicates. A comparison with a long string will produce many more atomic predicates. Very complex predicates increase the optimization and processing cost and, in the worst case, may cause the query to fail (by exceeding some internal limit at the remote database system). To avoid this, it is desirable to simplify the predicates via the predicate simplifier 85. Simplification may be applied before the conversion, after the conversion, or both.
Predicate Simplification
Predicate simplification can be applied using any of the known techniques including: combining value ranges as shown in the flowchart of
A first technique for simplification is combining value ranges as shown in the flowchart of FIG. 6A. Conversion of a string comparison generates many range predicates. These predicates are obtained at step 210. If the source predicate contains several predicates comparing the same column with different string constants, the conversion will generate many range predicates that are redundant or conflicting and can be eliminated. For example, converting the predicate x>`3` AND x<`B` results in
(x = `<` | OR | |
x = `|` | OR | |
x = `;` | OR | |
(x ≧ `_` AND x ≦ `:`) | OR | |
x = `@` | OR | |
x = `=` | OR | |
(x ≧ `a` AND x ≦ `Z`) | OR | |
x ≧ `4`) | ||
AND | ||
(x < `+` | OR | |
(x ≧ `&` AND x ≦ `%`) | OR | |
(x ≧ `>` AND x ≦ `?`) | OR | |
(x ≧ `:` AND x ≦ `"`) | OR | |
x = `A` | OR | |
x > `0`) | ||
To determine the value ranges that satisfy both sets of comparisons, for each Boolean factor, a list is built, at step 215, containing its allowable intervals, i.e., each element on the list specifies the lower and upper bound of a value range. For the present example, the first list consists of [`<`, `<`], [`|`, `|`], [`;`, `;`], [`_`, `:`], [`@`, `@`], [`=`, `=`], [`a`, `Z`], [`4`, "]
At step 220, the intersection of the intervals is determined; i.e., "multiply" the two lists by computing the intersection of every pair of intervals, one from each of the two lists. If the resulting interval is not empty, output a new element with the lower and upper bound of the intersection, at step 225.
If the predicate contains more than two Boolean factors, repeatedly multiply the result with the next factor. Applying this method to the predicate above reduces it to
x = `<` | OR | |
x = `;` | OR | |
x = `:` | OR | |
(x ≧ `>` AND x ≧ `?`) | OR | |
x = `=` | OR | |
x = `A` | OR | |
x ≧ `4` | ||
A second technique for simplification is exploiting domain constraints as shown in the flowchart of FIG. 6B. The external data source may include constraints restricting the domain of the column or field involved in a string comparison. It is determined, at step 230, if there are any constraints at the external data source. If not, this simplification method is not used, and processing exits at step 232. If there are constraints at the external data source, the constraints are added to the converted predicate at step 235, and the method of combining ranges described above is used, at step 240, to further simplify the predicate.
Assume that the external data source has a constraint on column x that restricts its values to the range [`0`,`8`]. If this constraint is added to the converted predicate and simplified as describe above, the result shown below is obtained.
(x = `<` | OR | |
x = `;` | OR | |
x = `:` | OR | |
(x ≧ `>` AND x ≦ `?`) | OR | |
x = `=` | OR | |
x = `A` | OR | |
x ≧ `4`) | ||
AND | ||
(x ≧ `0` AND x ≦ `8`) | ||
= | ||
(x ≧ `4` AND x ≦ `8`) | ||
Taking into account that column values are restricted to the range [`0`,`8`], the predicate can be further simplified to x≧`4`.
A third technique for simplification is reducing to prime implicants as shown in the flowchart of
Given an un-normalized predicate, it is first converted to CNF, at step 250. Each Boolean factor now consists of a set of atomic predicates connected by OR. Atomic predicates that are subsumed by other atomic predicates are eliminated in this factor, at step 255. For example, in x>`s` or x>`b`, the term x>`s` is subsumed by x>`b` and can be deleted. The predicate is then converted to DNF, at step 260. In DNF, contradictory terms (empty ranges) are detected and eliminated, at step 265. Finally, the result is converted back to CNF, at step 270.
For example, shown below is a predicate that is initially in CNF. Viewed in this form, the predicate does not contain any obvious redundancies.
(x>`3`) AND (x<`B`) AND
(y=`Part1` OR x>`Part4`)
Once it is converted to DNF, the second term is revealed to contain a contradiction, (x<`B` AND x>`Part4`), and can be eliminated. This results in the simple predicate on the last line (which happens to be in both DNF and CNF).
(x>`3` AND x<`B` AND y=`Part1`) OR
(x>`3` AND x<`B` AND x>`Part4`)
=
(x>`3` AND x<`B` AND y=`Part1`)
Imprecise Conversion
The embodiments described thus far have been directed to exact or precise conversion; that is, the converted predicate selects exactly the same rows or elements as the original predicate. Unfortunately, precise conversion may lead to very complex and time- consuming predicates. This may be undesirable in certain systems and situations. To overcome this, imprecise conversion can be used; that is, converting a predicate to a simple predicate that returns a superset of the desired rows (i.e., all the desired rows and possibly some additional rows), then evaluating the original predicate at the query site to filter out the superfluous rows. It should be noted that imprecise conversion here refers to the creation of imprecise predicates that are applied at the remote system. Imprecise conversion keeps the converted predicate simple while still reducing the amount of data shipped. However, it has the drawback that the original predicate still has to be evaluated at the receiving site, and all the columns involved in that predicate have to be shipped. The usefulness of this approach depends on the incremental data reduction caused by the imprecise predicate. Incremental data reduction is the number of rows returned when the imprecise predicate is included, divided by the number of rows returned when it is not included.
There are many ways to construct an imprecise predicate. The objective is to construct a predicate that is simple but still provides high incremental data reduction. Methods for imprecise conversion include: discarding predicates, truncating long strings, and merging ranges.
A first method for imprecise conversion involves discarding predicates. If the predicate to be converted causes little incremental data reduction, one course of action is to simply drop it.
A second method for imprecise conversion involves truncating long strings. Conversion of a comparison with a long string normally produces a predicate with many atomic predicates. To reduce the complexity of the converted predicates, the string can be replaced with a shorter one before conversion. For example, the predicate x<`3w0vksdfj543kjjgn54` can be replaced by x<`3w1` or x<`3x` before conversion. If the comparison operator is > or ≧, the string can be truncated. If the operator is < or ≦, the last character of the truncated string is rounded up, that is, replaced by the next higher character in the source collating sequence.
A third method for imprecise conversion involves merging ranges. Conversion of a simple string comparison results in a number of range predicates. If two or more ranges are "close", they can be consolidated into a single range, preferably, without significantly affecting the incremental data reduction. For example, conversion of the comparison x<`t` results in
(x ≦ `+`) | OR | |
(x ≧ `&` AND x ≦ `r`) | OR | |
(x = `s`) | OR | |
(x ≧ `[` AND x ≦ `]`) | OR | |
(x ≧ `A` AND x ≦ `I`) | OR | |
(x ≧ `J`) | ||
which consists of five ranges. EBCDIC has only one character between `+` and `&` and also between `I` and `J` so the first two ranges and also the last two may be merged. This results in
(x ≦ `r`) | OR | |
(x = `s`) | OR | |
(x ≧ `[` AND x ≦ `]`) | OR | |
(x ≧ `A`). | ||
If the first two ranges are merged and the last two ranges are merged, this gives (x≦`s`) OR (x≧`[`).
Whether or not to merge ranges depends on how "close" is defined. The issue is not the distance (in characters) between ranges; what matters is how merging affects the incremental data reduction. As shown in
Handling Elements with Equal Weights
The embodiments described thus far are directed to collating sequences in which each character or collation element has a unique weight in both the source and the target collating sequences. In the embodiment now described, some characters or collating elements in the source, target, or both collating sequences have the same weight. For example, in case insensitive collating sequences, "A" and "a" have the same weight. The methods described above are modified to reflect this change but the basic approach remains the same.
In one embodiment, assume some characters in the source collating sequence have equal weights (e.g. case insensitive order), but each character in the target collating sequence has a unique weight (e.g. case sensitive order). Suppose the two collating sequences are:
Source: 01 . . . 9{aA}{bB}{cC} . . . {zZ}
Target: 01 . . . 9abc . . . zABC . . . Z
The source collating sequence is case insensitive, that is, "a" "A" have the same weight in the source, as have "b" and "B", and so on. Now consider what happens when the conversion rule is computed for one of the operators. In this case, characters having the same weight will mark the same set of characters. This means that the characters having the same weight will have the same rules. This situation can be handled by including in the head of a set of rules not just one character but all characters with the same weight (in the source collating sequence). For example, based on the above collating sequences, the rules for "c" and "C" will be:
=>[``, `b`][`A`, `B`]
=>[``, `c`][`A`, `C`]
=>[`d`, `z`][`D`, ``]
=>[`c`, `z`][`C`, ``]
for operator <,≦, >, ≧ respectively.
This is the only change needed. The method for constructing rules, predicate conversion, and predicate simplification remain the same as described above.
Another embodiment is directed to the case when each character in the source collating sequence has a unique weight, but some characters in the target collating sequence have the same weight. This is illustrated by interchanging the roles of the two example sequences above.
Source: 01 . . . 9abc . . . zABC . . . Z
Target: 01 . . . 9{aA}{bB}{cC} . . . {zZ}
Note that although each character is distinguishable in the source set, they may not be distinguishable in the target set. For example, x<"b" in source collating sequence will include "a" but not "A". However, an exact rule for the target collating sequence that includes "a" but not "A" cannot be obtained because "a" and "A" are not distinguishable at the external data source. Thus, precise conversion cannot be achieved in this case. The best that can be achieved is to compute a set of rules that will convert a predicate into an imprecise predicate that selects a superset of the desired rows. The original predicate is then applied on the result set at the query site.
The same marking technique can be used to construct conversion rules. However, in the target collating sequence, when one character is selected, all characters having the same weight as the selected character are marked. If the marked characters in the target set is a superset of the marked characters in the source set, the conversion rule is an imprecise rule (it should be noted that this is different from the imprecise conversion of predicates described above; here, the rule itself is imprecise). If the two sets are equal, the conversion rule is precise. A rule is shown herein as precise if denoted by ==> or imprecise if denoted by ∼=>, so that it can be determined whether the rule will result in precise conversion or not. For example, based on the above collating sequences, the rules of `c` are:
∼=>[``, `B`]
∼=>[``, `C`]
∼=>[`a`, ``]
∼=>[`a`, ``]
The first rule is equivalent to [", "b"]. The third rule has "a" as the lower bound because, in the source collating sequence, "A" to "Z" satisfy the predicate x>"a".
For some characters, there may be precise rules for some operators and imprecise rules for other operators, namely, for example:
`a`
==>[``, `9`]
∼=>[``, `A`]
∼=>[`a`, ``]
==>[`a`, ``]
For some characters having unique weights in the target collating sequence, there may be precise rules for all four operators, namely, for example:
`5`
==>[``, `4`]
==>[``, `5`]
==>[`6`, ``]
==>[`5`, ``]
Another embodiment of the present invention is directed to the case in which both the source and the target collating sequences have some characters with the same weight. If the two collating sequences have the same character set and the same sort order, e.g., both use case insensitive dictionary order, this will not affect any predicate and the predicates can be pushed down unchanged. The complication arises when they are different. Consider the following source and target collating sequences (characters within braces have the same weight):
Source: 0 . . . 9{abc}{def}ghij . . . z
Target: 0 . . . 9{abcd}{ef}{gh}ij . . . z
This case can be handled by combining the methods described above with respect to the two simpler cases. The marking method can be used to construct conversion rules but, when a character is marked, all characters are marked in its equal-weight set in both the source and the target collating sequences. If the marked characters in the target are a superset of the marked characters in the source, the conversion rule is an imprecise rule. Otherwise, it is a precise rule.
A method for generating the rules of the < operator is described with respect to FIG. 8. The methods for the other three operators are similar. At step 600 it is determined if any character `s` at the query site has the same weight as the subject character and if the rules have been generated for the character `s`. If there is no character `s` in the query site having the same weight as the subject character or if there are no rules generated for character `s`, then all the characters in the source collating sequence and the remote collating sequence are unmarked as an initialization step, at step 610. If there is a character `s` at the query site having the same weight as the subject character and if the rules have been generated for the character `s`, then the subject character is added to the head of the rules for the character `s`, at step 620.
At step 630, each character in the source and remote collating sequences that have a lower weight than the subject character are marked. Each character in the remote collating sequence that has the same weight as the character `s` at the query site is marked, at step 640. Intervals are then created, at step 650, comprising adjacent marked characters in the remote collating sequence.
It is then determined, at step 660, if the marked characters in the remote collating sequence are a superset of the marked characters in the source collating sequence. If so, the intervals are provided as being imprecise, at step 680. If the marked characters in the remote collating sequence are not a superset of the marked characters in the source collating sequence, the intervals are provided as being precise, at step 670.
Applying the method of
`a`, `b`, `c` | `d`, `e`, `f` | |
==> [` `, `9`] | ∼=> [` `, `d`] | |
∼=> [` `, `d`] | ==> [` `, `f`] | |
∼=> [`a`, ` `] | ==> [`g`, ` `] | |
==> [`a`, ` `] | ∼=> [`a`, ` `] | |
`g` | >h= | |
==> [` `, `f`] | ∼=> [` `, `h`] | |
∼=> [` `, `h`] | ==> [` `, `h`] | |
∼=> [`g`, ` `] | ==> [`i`, ` `] | |
==> [`g`, ` `] | ∼=> [`g`, ` `] | |
A character (or same-weight character set) may generate any combination of precise and imprecise rules. The rules that are applied during conversion of a predicate determine whether the resulting predicate is precise or imprecise. With the rules available, predicate conversion and simplification can be achieved using the methods described above.
Predicates consisting of an equality comparison between a column or field and a constant are frequently used and typically have high selectivity. Pushing down this type of predicate is particularly beneficial in practice because of the high data reduction. When the characters have unique weights in both the source and target collating sequences, the result does not depend on the sort order; therefore, in that case, equality predicates can be pushed down unchanged. However, if some characters have the same weight, this is not necessarily true. For example, the predicate x="abc" will select "ABC", "Abc", "aBC", etc. in a case insensitive collating sequence, but not in case sensitive one.
One solution is to convert an equality predicate to a conjunction of two atomic predicates with ≧ and ≦ respectively, because x=String is equivalent to (x≧String ANDx≦≦String). The predicate can then be converted using the methods described above. However, the conversion and simplification can be tedious and time consuming. Therefore, rules have been developed in accordance with the present invention that are particular to equality predicates to simplify the conversion process.
The rule is a correspondence of same-weight character sets between the source and the target collating sequences. Again, precise rules and imprecise rules exist. The method described above with respect to
CASE 1:
`c`, `C`
==>`c`, `C`
CASE 2:
`c`
∼=>`c`, `C`
CASE 3:
`a`, `b`, `c`
∼=>`a`, `b`, `c`, `d`
`d`, `e`, `f`
∼=>`a`, `b`, `c`, `d`, `e`, `f`
`g`
∼=>`g`, `h`
For single-character constants conversion is straightforward. For example, in CASE 3, a predicate X="e" is converted into:
X=`a` or X=`b` or X=`c` or
X=`d` or X=`e` or X=`f`
When a predicate is a comparison between a column or field and a long string, the converted predicates can be very long. For a predicate such as x="abcd", in case insensitive sort order where each character has two choices, a disjunction of 25=32 atomic predicates is obtained. This is not practical; therefore, for long strings, the method described above for truncating the string is used. Using this approach, the predicate is changed into x>"abc" AND x<"abd", which is imprecise.
However, in practice, many long strings can be converted precisely because the characters occurring in the string have unique weights in both collating sequences. One example of this is phone numbers. Phone numbers are typically stored as strings even though they are numeric. Digits have unique weights so an equality comparison with an all-digit string constant needs not be converted at all.
Predicates using <>, such as x<>"abc", are very similar to = in the sense that it is equivalent to a disjunction of two atomic predicates (x<"abc" OR x>"abc"). However, predicates of this type often have poor reduction power, because the conversion may result in many predicates which increase the evaluation cost at the remote database system. So it might be desirable to simply drop these types of predicates and evaluate them at the query site.
Multiple Sets of Weights
The embodiments described thus far have assumed that both the source and the target collating sequences each use only one set of weights. This approach is not sufficient to handle all languages and collating sequences: some use multiple sets of weights. Multiple sets of weights involve a primary set, a secondary set, a tertiary set, etc., up to as many as are desired. When a collating sequence uses multiple sets of weights, imprecise conversion can be achieved. The description below is directed to the case when only the target collating sequence has multiple sets of weights. Logically this can be viewed as creating a multipart sort key: the first part is formed by using the character weights in the primary set, the second part by using the weights in the secondary set, and so on. Comparisons are then done in the standard way for multipart keys: compare the first parts; if they are equal, compare the second parts; and so on.
Table 2 lists the collation weights of visible ASCII characters in English_USA. It shows the hexadecimal value, i.e., the code point of the character, the collation weights of case insensitive dictionary order, and the collation weights of case sensitive dictionary order. For hexadecimal sort order, the collation weight of a character is the hexadecimal value of the character--this is an example of a collating sequence in which each character has a unique (primary) weight. Case insensitive dictionary order is an example of the case when multiple characters may have the same (primary) weight, while case sensitive dictionary order is an example where the collating sequence has multiple sets of weights (which in fact is also a unique weight collating sequence).
A character as shown may have up to five different weights (shown separated by bars in Table 2), although any number of different weights can be used. The first weight is the primary weight (used for the first compare pass); the second weight is the diacritic weight for accent compare, for example; the third weight is the case weight for case sensitive/insensitive, for example; the fourth weight is an extra weight mainly used to differentiate between Katakana and Hiragana in Japanese, for example; and the last weight is a special weight, used for position dependant differentiation, for example. In Table 2, most weights are empty except the primary weights. Only for the case sensitive case, do the upper case letters have a third level weight.
TABLE 2 | |||
Hexadecimal Value | Case Insensitive | Case Sensitive Weight | |
! | 21 | 071c||||| | 071c||||| |
" | 22 | 071d||||| | 071d||||| |
# | 23 | 071f||||| | 071f||||| |
$ | 24 | 0721||||| | 0721||||| |
% | 25 | 0723||||| | 0723||||| |
& | 26 | 0725||||| | 0725||||| |
' | 27 | 0680||||| | 0680||||| |
( | 28 | 0727||||| | 0727||||| |
) | 29 | 072a||||| | 072a||||| |
* | 2a | 072d||||| | 072d||||| |
+ | 2b | 0803||||| | 0803||||| |
, | 2c | 072f||||| | 072f||||| |
- | 2d | 0682||||| | 0682||||| |
. | 2e | 0733||||| | 0733||||| |
/ | 2f | 0735||||| | 0735||||| |
0 | 30 | 0c03||||| | 0c03||||| |
1 | 31 | 0c21||||| | 0c21||||| |
2 | 32 | 0c33||||| | 0c33||||| |
3 | 33 | 0c46||||| | 0c46||||| |
4 | 34 | 0c58||||| | 0c58||||| |
5 | 35 | 0c6a||||| | 0c6a||||| |
6 | 36 | 0c7d||||| | 0c7d||||| |
7 | 37 | 0c90||||| | 0c90||||| |
8 | 38 | 0ca2||||| | 0ca2||||| |
9 | 39 | 0cb4||||| | 0cb4||||| |
: | 3a | 0737||||| | 0737||||| |
; | 3b | 073a||||| | 073a||||| |
< | 3c | 080e||||| | 080e||||| |
= | 3d | 0812||||| | 0812||||| |
> | 3e | 0814||||| | 0814||||| |
? | 3f | 073c||||| | 073c||||| |
@ | 40 | 073e||||| | 073e||||| |
A | 41 | 0e02||||| | 0e02||12||| |
B | 42 | 0e09||||| | 0e09||12||| |
C | 43 | 0e0a||||| | 0e0a||12||| |
D | 44 | 0e1a||||| | 0e1a||12||| |
E | 45 | 0e21||||| | 0e21||12||| |
F | 46 | 0e23||||| | 0e23||12||| |
G | 47 | 0e25||||| | 0e25||12||| |
H | 48 | 0e2c||||| | 0e2c||12||| |
I | 49 | 0e32||||| | 0e32||12||| |
J | 4a | 0e35||||| | 0e35||12||| |
K | 4b | 0e36||||| | 0e36||12||| |
L | 4c | 0e48||||| | 0e48||12||| |
M | 4d | 0e51||||| | 0e51||12||| |
N | 4e | 0e70||||| | 0e70||12||| |
O | 4f | 0e7c||||| | 0e7c||12||| |
P | 50 | 0e7e||||| | 0e7e||12||| |
Q | 51 | 0e89||||| | 0e89||12||| |
R | 52 | 0e8a||||| | 0e8a||12||| |
S | 53 | 0e91||||| | 0e91||12||| |
T | 54 | 0e99||||| | 0e99||12||| |
U | 55 | 0e9f||||| | 0e9f||12||| |
V | 56 | 0ea2||||| | 0ea2||12||| |
W | 57 | 0ea4||||| | 0ea4||12||| |
X | 58 | 0ea6||||| | 0ea6||12||| |
Y | 59 | 0ea7||||| | 0ea7||12||| |
Z | 5a | 0ea9||||| | 0ea9||12||| |
[ | 5b | 073f||||| | 073f||||| |
\ | 5c | 0741||||| | 0741||||| |
] | 5d | 0742||||| | 0742||||| |
{circumflex over ( )} | 5e | 0743||||| | 0743||||| |
-- | 5f | 0744||||| | 0744||||| |
{grave over ( )} | 60 | 0748||||| | 0748||||| |
a | 61 | 0e02||||| | 0e02||||| |
b | 62 | 0e09||||| | 0e09||||| |
c | 63 | 0e0a||||| | 0e0a||||| |
d | 64 | 0e1a||||| | 0e1a||||| |
e | 65 | 0e21||||| | 0e21||||| |
f | 66 | 0e23||||| | 0e23||||| |
g | 67 | 0e25||||| | 0e25||||| |
h | 68 | 0e2c||||| | 0e2c||||| |
i | 69 | 0e32||||| | 0e32||||| |
j | 6a | 0e35||||| | 0e35||||| |
k | 6b | 0e36||||| | 0e36||||| |
l | 6c | 0e48||||| | 0e48||||| |
m | 6d | 0e51||||| | 0e51||||| |
n | 6e | 0e70||||| | 0e70||||| |
o | 6f | 0e7c||||| | 0e7c||||| |
p | 70 | 0e7e||||| | 0e7e||||| |
q | 71 | 0e89||||| | 0e89||||| |
r | 72 | 0e8a||||| | 0e8a||||| |
s | 73 | 0e91||||| | 0e91||||| |
t | 74 | 0e99||||| | 0e99||||| |
u | 75 | 0e9f||||| | 0e9f||||| |
v | 76 | 0ea2||||| | 0ea2||||| |
w | 77 | 0ea4||||| | 0ea4||||| |
x | 78 | 0ea6||||| | 0ea6||||| |
y | 79 | 0ea7||||| | 0ea7||||| |
z | 7a | 0ea9||||| | 0ea9||||| |
{ | 7b | 074a||||| | 074a||||| |
| | 7c | 074c||||| | 074c||||| |
} | 7d | 074e||||| | 074e||||| |
∼ | 7e | 0750||||| | 0750||||| |
Based on this table, the following collating sequences are obtained:
Binary order:
!\"#$%&'( )*+,./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]{circumflex over ( )}--
abcdefghijklmnopqrstuvwxyz{|}∼
Case insensitive dictionary order (where uppercase and lowercase letters are equal):
'-!\"#$%&( )*,./:;?@[\\]{circumflex over ( )}--'{|}∼+<=0123456789aAbBcCdDeEfFgGhHiIjJkKlLmM
nNoOpPqQrRsStTuUvVwWxXyYzZ
Case sensitive dictionary order compares strings in the same order as case insensitive order in the first pass. If two strings are equal in the first pass using primary weights, then in the second pass, the strings are compared using case weight (no diacritic weight here) to differentiate the two strings where the uppercase character has a greater weight than the lowercase character.
String comparisons under different collating sequences is a complex issue. Comparisons are based on weights assigned to characters to reflect their collation order. The present invention is directed to three cases: (1) all characters having unique weight, (2) some characters have equal weights, and (3) characters with multiple weights. The first and third case affect comparisons with operator <,≦,>, and ≧. However, the second case also affects equality predicates, which are commonly used and have high selectivity.
Assume that the target collating sequence has the same primary weights for the characters c1, c2, . . . ,cn but different secondary weights. The precise values of the secondary weights are unimportant so assume that they are 1,2, . . . ,n, respectively. Consider the comparison X≦s1 and what characters in the target collating sequence should be included in the rule for ≦. If a character s2 which is less than or equal to s1 in the source collating sequence, maps to one of the characters in c1c1, . . . , cn then all the characters of c1, c1, . . . , cn are included in the rule because they are not distinguished at the first level of comparison. This applies, in particular, if s2=s1.
If the marking method described above is used, all characters are marked that have a primary weight less than or equal to the primary weight of s1. Once the marking has been performed, intervals are created. However, several characters may now lie on the boundary of an interval because they have the same primary weight. Selection then proceeds based on their secondary weights. If this happens on the lower bound of an interval, the one with the lowest secondary weight is chosen. If it happens on the upper bound, the one with the highest secondary weight is chosen. If secondary weights do not separate them either, tertiary weights are considered, and so on. In this way, intervals are created that are guaranteed to include every character in the target collating sequence that corresponds to a character which is less than or equal to s1 in the source collating sequence (and possibly some more). The conversion method is described with respect to the flowchart shown in FIG. 9. Since the conversion will be imprecise in any case, only two rules are used: one for <and ≦ and one for >and ≧.
Initially, at step 700, a character "t" to be sorted or searched against is chosen at the query site. All the characters in the remote collating sequence are initialized or unmarked at step 710. The weight of each character "s" in the remote collating sequence is compared against the weight of character "t" at step 720. If the character "s" has a lower weight, it is marked along with every character in the remote collating sequence that has the same primary weight as the character "s". At step 730, sequences of adjacent marked characters in the remote collating sequence are determined. For each sequence, at step 740, an interval is created between a character in the sequence having the lowest primary weight and the highest primary weight. The interval is checked at step 750 to determine if there is more than one character in the sequence having the lowest primary weight and/or the highest primary weight. If there is more than one character in the sequence having the lowest primary weight, at step 760, the interval endpoint is changed to the character having the next lowest primary weight. Similarly, if there is more than one character in the sequence having the highest primary weight, the interval endpoint is changed to the character having the next highest primary weight. This process repeats, via step 750, until the endpoints for the intervals are characters having a unique primary weight. At this point, the interval is provided as the rule for that operator and subject character. The intervals are outputted as the rule for x<t and x≧t.
For example, assuming that the character set includes only ten characters (a-e, A-E), the source collating sequence is abcdeABCDE and the primary and secondary weights of the target collating sequence are as shown below.
Primary | Secondary | |
Character | weight | weight |
A | 1 | 1 |
a | 1 | 2 |
B | 2 | 1 |
b | 2 | 2 |
C | 3 | 1 |
c | 3 | 2 |
D | 4 | 1 |
d | 4 | 2 |
E | 5 | 1 |
e | 5 | 2 |
The method of
Note that the rules for "C" do not add any information because the interval covers all characters. Based on the rule for "c", the predicate X<"c" would be converted to X≧`A` and X≦`c`. This may appear to be an exact conversion but it is not. To satisfy the source predicate, the value has to begin with "a", "b", or "c". When evaluating the converted predicate, any character with a primary weight less than or equal to three is acceptable, that is, any one of "a", "b", "c", "A", "B", and "C".
Sometimes, special sort elements, such as the hyphen and apostrophe, are treated differently than other punctuation symbols, so that words like coop and co-op stay together in a list. In Windows NT, this is achieved by having the primary weight being empty, then differentiated by another level weight (Special Weight). This is a special case of multiple level weights with some characters having minimum primary weight (empty).
The following examples are based on the collating sequences shown in Table 2. For conversion between unique and non-unique weight collations, the rules for the same characters `3`, `B`, and `t` are shown below based on binary order and case insensitive dictionary order. Five rules are shown corresponding to operator <, ≦, >, ≧, and =, respectively. Each rule can be either a precise conversion rule (represented by ==>) or an imprecise conversion rule (represented by ∼=>). The empty character, represented by ", is the beginning or end of a target collating sequence.
Case 1: source collating sequence is case insensitive order, target collating sequence is binary order:
`3` | ||
==> [", `2`] [`:`, `@`] [`[`, `{grave over ( )}`] [`{`, "] | // for < operator | |
==> [", `3`] [`:`, `@`] [`[`, `{grave over ( )}`] [`{`, "] | // for ≦ operator | |
==> [`4`, `9`] [`A`, `Z`] [`a`, `z`] | // for > operator | |
==> [`3`, `9`] [`A`, `Z`] [`a`, `z`] | // for ≧ operator | |
==> `3` | // for = operator | |
`b`, `B` | ||
==> [", `A`] [`[`, `a`] [`{`, "] | // for < operator | |
==> [", `B`] [`[`, `b`] [`{`, "] | // for ≦ operator | |
==> [`C`, `Z`] [`c`, `z`] | // for > operator | |
==> [`B`, `Z`] [`b`, `z`] | // for ≧ operator | |
==> `b`, `B` | // for = operator | |
`t`, `T` | ||
==> [", `S`] [`[`, `s`] [`{`, "] | // for < operator | |
==> [", `T`] [`[`, `t`] [`{`, "] | // for ≦ operator | |
==> [`U`, `Z`] [`u`, `z`] | // for > operator | |
==> [`T`, `Z`] [`t`, `z`] | // for ≧ operator | |
==> `t`, `T` | // for = operator | |
Case 2: source collating sequence is binary order, target collating sequence is case insensitive order:
`3` | ||
==> [", `/`] [`0`, `2`] | // for < operator | |
==> [", `/`] [`0`, `3`] | // for ≦ operator | |
==> [`:`, `>`] [`4`, "] | // for > operator | |
==> [`:`, `>`] [`3`, "] | // for ≧ operator | |
==> `3` | // for = operator | |
`B`, | ||
∼=> [", `@`] [`+`, `a`] | // for < operator | |
∼=> [", `@`] [`+`, `b`] | // for ≦ operator | |
∼=> [`[`, `∼`] [`A`, "] | // for > operator, since it includes `a` | |
∼=> [`[`, `∼`] [`A`, "] | // for ≧ operator, since it includes `a` | |
∼=> `B` | // for = operator | |
`t`, | ||
∼=> [", `{grave over ( )}`] [`+`, "] | // for < operator, since it includes `Z` | |
∼=> [", `{grave over ( )}`] [`+`, "] | // for ≦ operator, since it includes `Z` | |
∼=> [`{`, `∼`] [`u`, "] | // for > operator | |
∼=> [`{`, `∼`] [`t`, "] | // for ≧ operator | |
∼=> `T` | // for = operator | |
For collating sequence with multiple level weights, below are exemplary rules between binary order and case sensitive dictionary order.
Case 1: source collating sequence is binary order, target collating sequence is case sensitive order:
`3` | ||
==> [", `/`] [`0`, `2`] | // for < operator | |
==> [", `/`] [`0`, `3`] | // for ≦ operator | |
==> [`:`, `>`] [`4`, "] | // for > operator | |
==> [`:`, `>`] [`3`, "] | // for ≧ operator | |
==> `3` | // for = operator | |
`B`, | ||
==> [", `@`] [`+`, `A`] | // for < operator | |
∼=> [", `@`] [`+`, `B`] | // for ≦ operator | |
∼=> [`[`, `∼`] [`A`, "] | // for > operator, since it includes `a` | |
∼=> [`[`, `∼`] [`A`, "] | // for ≧ operator, since it includes `a` | |
==> `B` | // for = operator | |
`t`, | ||
∼=> [", `{grave over ( )}`] [`+`, "] | // for < operator, since it includes `Z` | |
∼=> [", `{grave over ( )}`] [`+`, "] | // for ≦ operator, since it includes `Z` | |
==> [`{`, `∼`] [`u`, "] | // for > operator | |
==> [`{`, `∼`] [`t`, "] | // for ≧ operator | |
==> `t` | // for = operator | |
Case 2: source collating sequence is case sensitive order, target collating sequence is binary order:
`3` | ||
==> [", `2`] [`:`, `@`] [`[`, `{grave over ( )}`] [`{`, "] | // for < operator | |
==> [", `3`] [`:`, `@`] [`[`, `{grave over ( )}`] [`{`, "] | // for ≦ operator | |
==> [`4`, `9`] [`A`, `Z`] [`a`, `z`] | // for > operator | |
==> [`3`, `9`] [`A`, `Z`] [`a`, `z`] | // for ≧ operator | |
==> `3` | // for = operator | |
`B` | ||
==> [", `A`] [`[`, `b`] [`{`, "] | // for < operator | |
==> [", `B`] [`[`, `b`] [`{`, "] | // for ≦ operator | |
==> [`C`, `Z`] [`c`, `z`] | // for > operator | |
==> [`B`, `Z`] [`b`, `z`] | // for ≧ operator | |
==> `b`, `B` | // for = operator | |
`t` | ||
==> [", `S`] [`[`, `s`] [`{`, "] | // for < operator | |
==> [", `T`] [`[`, `t`] [`{`, "] | // for ≦ operator | |
==> [`U`, `Z`] [`u`, `z`] | // for > operator | |
==> [`T`, `Z`] [`t`, `z`] | // for ≧ operator | |
==> `t`, `T` | // for = operator | |
It should be noted that there are several special cases when predicates can be pushed down to a remote site without conversion. For example, if both the source and target collating sequences have unique weight characters, equality predicates can be pushed down to the target site directly. As described above, character translation may be implemented if the two collating sequences are based on different character sets. Moreover, in most collation sequences, the digital characters 0 to 9 have unique weights and the same order. Much application data, such as phone number, zip code, SIN, etc., contain only digital characters, or with limited symbols like `-` in fixed positions. String compare predicates on this type of data can be pushed down directly, or with minor changes if there are some additional symbols. Furthermore, if the character sets at both the source and target sites are ASCII compatible, that is, have ASCII code points 0 to 127 as part of the character set (which is true for most Windows NT character sets, except Unicode), and both sites use binary order for collation, then any comparison with a string containing only ASCII characters can be pushed down to the target site directly.
Although illustrated and described herein with reference to certain specific embodiments, the present invention is nevertheless not intended to be limited to the details shown. Rather, various modifications may be made in the details within the scope and range of equivalents of the claims and without departing from the invention.
Zhang, Weiye, Larson, Gustav Per-Åke
Patent | Priority | Assignee | Title |
10210241, | May 10 2016 | International Business Machines Corporation | Full text indexing in a database system |
10268754, | May 10 2016 | International Business Machines Corporation | Full text indexing in a database system |
10437883, | Nov 24 2015 | Cisco Technology, Inc. | Efficient graph database traversal |
10984048, | Nov 24 2015 | Cisco Technology, Inc. | Efficient graph database traversal |
6665664, | Jan 11 2001 | iAnywhere Solutions, Inc | Prime implicates and query optimization in relational databases |
6877003, | May 31 2001 | Oracle International Corporation | Efficient collation element structure for handling large numbers of characters |
6889260, | Jun 01 1999 | EC-ENABLER, LTD | Method and system for transferring information |
6920247, | Jun 27 2000 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Method for optical recognition of a multi-language set of letters with diacritics |
6968330, | Nov 29 2001 | International Business Machines Corporation | Database query optimization apparatus and method |
6983275, | Apr 16 2002 | International Business Machines Corporation | Optimizing database query by generating, determining the type of derived predicate based on monotonicity of the column generating expression for each remaining inequality predicate in the list of unexamined predicates |
7162710, | Nov 01 2001 | Microsoft Technology Licensing, LLC | Dynamic modifications to a heterogeneous program in a distributed environment |
7263689, | Jun 30 1999 | Microsoft Technology Licensing, LLC | Application program interface for dynamic instrumentation of a heterogeneous program in a distributed environment |
7299225, | Nov 26 2002 | GOOGLE LLC | High performance predicate push-down for non-matching predicate operands |
7464084, | Jan 30 2006 | International Business Machines Corporation | Method for performing an inexact query transformation in a heterogeneous environment |
7512272, | Jun 27 2000 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Method for optical recognition of a multi-language set of letters with diacritics |
7761430, | May 12 2005 | Microsoft Technology Licensing, LLC | Verification of cross domain data system query results |
7840554, | Mar 27 2008 | International Business Machines Corporation | Method for evaluating a conjunction of equity and range predicates using a constant number of operations |
7844683, | Oct 10 2001 | Juniper Networks, Inc | String matching method and device |
7856462, | Jan 30 2006 | International Business Machines Corporation | System and computer program product for performing an inexact query transformation in a heterogeneous environment |
7945557, | Nov 25 2003 | International Business Machines Corporation | Method, system, and program for query optimization with algebraic rules |
7958133, | Apr 27 2005 | BAMPTON TECHNOLOGIES LLC | Application conversion of source data |
7962576, | Jul 22 2004 | SAP SE | Quantity conversion interface for external program calls |
8015178, | Jul 07 2006 | SIEMENS PRODUCT LIFECYCLE MANAGEMENT SOFTWARE INC | System and method to determine a single SQL BOM solve |
8086614, | Mar 21 2005 | Think Software Pty Ltd | Method and apparatus for generating relevance-sensitive collation keys |
8239406, | Dec 31 2008 | International Business Machines Corporation | Expression tree data structure for representing a database query |
8290930, | Jun 02 2008 | Microsoft Technology Licensing, LLC | Query result generation based on query category and data source category |
8533142, | Feb 22 2011 | Siemens Industry Software Inc | Product lifecycle management system using partial solve |
8688682, | Mar 23 2007 | International Business Machines Corporation | Query expression evaluation using sample based projected selectivity |
Patent | Priority | Assignee | Title |
4701746, | Mar 24 1981 | Canon Kabushiki Kaisha | Device for converting code signals of one bit number into a code signal of a smaller bit number |
4858114, | Jul 22 1985 | International Business Machines Corporation | Emulation system for automatically modifying I/O translation tables in programs designed to run on different types of computer |
5414834, | Apr 08 1988 | International Business Machines Corporation | Method and apparatus for data storage and interchange using a relational database table, and a data record for use in connection therewith |
5416917, | Mar 27 1990 | International Business Machines Corporation | Heterogenous database communication system in which communicating systems identify themselves and convert any requests/responses into their own data format |
5640550, | Apr 15 1994 | MICRO FOCUS US , INC | Computer system for generating SQL statements from COBOL code |
5758125, | Dec 28 1995 | CUFER ASSET LTD L L C | Method of sharing data in a heterogeneous computer system |
5802293, | Jun 28 1993 | DOW BENELUX N V | Integrated plant environment utilizing an advanced program-to-program server enabling communications between programs running in different computing environments |
5832100, | Aug 30 1991 | Computer Sciences Corporation | Method and apparatus for converting documents between paper medium and electronic media using a user profile |
6029178, | Mar 18 1998 | Informatica LLC | Enterprise data movement system and method which maintains and compares edition levels for consistency of replicated data |
6134591, | Jun 18 1997 | SECURITECH LLC | Network security and integration method and system |
6185729, | Mar 04 1996 | Oracle America, Inc | Multibyte locale for single byte languages |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Mar 24 1999 | Microsoft Corporation | (assignment on the face of the patent) | / | |||
Jun 17 1999 | LARSON, GUSTAV PER-AKE | Microsoft Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 010053 | /0734 | |
Jun 17 1999 | ZHANG, WEIYE | Microsoft Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 010053 | /0734 | |
Oct 14 2014 | Microsoft Corporation | Microsoft Technology Licensing, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034541 | /0001 |
Date | Maintenance Fee Events |
Oct 07 2005 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Sep 30 2009 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Sep 25 2013 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Apr 30 2005 | 4 years fee payment window open |
Oct 30 2005 | 6 months grace period start (w surcharge) |
Apr 30 2006 | patent expiry (for year 4) |
Apr 30 2008 | 2 years to revive unintentionally abandoned end. (for year 4) |
Apr 30 2009 | 8 years fee payment window open |
Oct 30 2009 | 6 months grace period start (w surcharge) |
Apr 30 2010 | patent expiry (for year 8) |
Apr 30 2012 | 2 years to revive unintentionally abandoned end. (for year 8) |
Apr 30 2013 | 12 years fee payment window open |
Oct 30 2013 | 6 months grace period start (w surcharge) |
Apr 30 2014 | patent expiry (for year 12) |
Apr 30 2016 | 2 years to revive unintentionally abandoned end. (for year 12) |