What is disclosed are a novel system and method for inferring types of database queries. In one embodiment a program and associated database schema that includes a type hierarchy is accessed. The program includes query operations to a database that contains relations described by a database schema. types are inferred from definitions in the program by replacing each database relationship in the program by the types in the database schema. A new program is generated with the types that have been inferred with the new program only accessing unary relations in the database. In another embodiment, testing of each of the types that have been inferred is performed for type emptiness. In response to type emptiness being found for a type that have been inferred, a variety of different operations are performing including removing the type, providing a notification regarding the emptiness found for the type, and more.
|
0. 53. A computer-implemented method comprising:
obtaining a query program that queries a database described by a database schema that specifies a column type for every column of every extensional predicate that occurs in the query program and also described by a type hierarchy that is of the form ∀x.h(x) in which h(x) does not contain quantifiers or equations and only contains a single free variable x;
representing the query program by a first type program having no free relation variables and having no fixpoint definitions containing intensional relation variables, wherein the column type is at least a portion of a second type program, wherein each of the first type program and the second type program is a derived program, and wherein each of the first type program and the second type program uses monadic extensionals, a type program being a program that only makes use of type symbols and no other extensional relation symbols, and wherein the first type program and the second type program do not contain negation;
translating the first type program into a typing, wherein the typing is a lean set of non-degenerate n-ary type tuple constraints (ttcs), each ttc being defined by
(i) a tuple (t1, . . . tn) of n type propositions ti, a type proposition being a quantifier-free type program with no fixpoint definitions and no free relation variables and having precisely one free element variable,
(ii) an equivalence relation over the set of integers {1, . . . , n} that defines a partition of the set of integers {1, . . . , n}, and
(iii) a set c of inhabitation constraints, each inhabitation constraint being a type proposition, the inhabitation constraints modeling existential quantification;
such that the following three requirements are satisfied whenever i and j belong to the same partition under the equivalence relation:
ti=tj,
C<:tk for all k∈{1, . . . , n}, and
c<:c′ for c, c′∈C implies c=c′;
testing a first ttc of the typing for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for a type program represented by the ttc.
0. 41. A computer-implemented method comprising:
obtaining a query program that queries a database described by a database schema that specifies a column type for every column of every extensional predicate that occurs in the query program and also described by a type hierarchy that is of the form ∀x.h(x) in which h(x) does not contain quantifiers or equations and only contains a single free variable x;
representing the query program by a first type program having no free relation variables and having no fixpoint definitions containing intensional relation variables, wherein the column type is at least a portion of a second type program, wherein each of the first type program and the second type program is a derived program, and wherein each of the first type program and the second type program uses monadic extensionals, a type program being a program that only makes use of type symbols and no other extensional relation symbols and that only has instances of negation in front of type symbols that are the names of monadic extensionals;
translating the first type program into a typing, wherein the typing is a lean set of non-degenerate n-ary type tuple constraints (ttcs), each ttc being defined by
(i) a tuple (t1, . . . tn) of n type propositions ti, a type proposition being a quantifier-free type program with no fixpoint definitions and no free relation variables and having precisely one free element variable,
(ii) an equivalence relation over the set of integers {1, . . . , n} that defines a partition of the set of integers {1, . . . , n}, and
(iii) a set c of inhabitation constraints, each inhabitation constraint being a type proposition, the inhabitation constraints modeling existential quantification;
such that the following three requirements are satisfied whenever i and j belong to the same partition under the equivalence relation:
ti=tj,
C<:tk for all k∈{1, . . . , n}, and
c<:c′ for c, c′∈C implies c=c′;
testing a first ttc of the typing for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for a type program represented by the ttc.
0. 51. A computer program product, the computer program product comprising a hardware storage device comprising instructions to cause a computer system to perform operations comprising:
obtaining a query program that queries a database described by a database schema that specifies a column type for every column of every extensional predicate that occurs in the query program and also described by a type hierarchy that is of the form ∀x.h(x) in which h(x) does not contain quantifiers or equations and only contains a single free variable x;
representing the query program by a first type program having no free relation variables and having no fixpoint definitions containing intensional relation variables, wherein the column type is at least a portion of a second type program, wherein each of the first type program and the second type program is a derived program, and wherein each of the first type program and the second type program uses monadic extensionals, a type program being a program that only makes use of type symbols and no other extensional relation symbols, and wherein the first type program and the second type program do not contain negation;
translating the first type program into a typing, wherein the typing is a lean set of non-degenerate n-ary type tuple constraints (ttcs), each ttc being defined by
(i) a tuple (t1, . . . tn) of n type propositions ti, a type proposition being a quantifier-free type program with no fixpoint definitions and no free relation variables and having precisely one free element variable,
(ii) an equivalence relation over the set of integers {1, . . . , n} that defines a partition of the set of integers {1, . . . , n}, and
(iii) a set c of inhabitation constraints, each inhabitation constraint being a type proposition, the inhabitation constraints modeling existential quantification;
such that the following three requirements are satisfied whenever i and j belong to the same partition under the equivalence relation:
ti=tj,
C<:tk for all k∈{1, . . . , n}, and
c<:c′ for c, c′∈C implies c=c′;
testing a first ttc of the typing for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for a type program represented by the ttc.
0. 52. A system comprising:
a computer system comprising one or more computers each having a memory and a processor communicatively coupled to the memory; and
a type inferencer computer program loaded in the computer system, wherein the type inferencer is adapted to cause the computer system perform the operations of:
obtaining a query program that queries a database described by a database schema that specifies a column type for every column of every extensional predicate that occurs in the query program and also described by a type hierarchy that is of the form ∀x.h(x) in which h(x) does not contain quantifiers or equations and only contains a single free variable x;
representing the query program by a first type program having no free relation variables and having no fixpoint definitions containing intensional relation variables, wherein the column type is at least a portion of a second type program, wherein each of the first type program and the second type program is a derived program, and wherein each of the first type program and the second type program uses monadic extensionals, a type program being a program that only makes use of type symbols and no other extensional relation symbols, and wherein the first type program and the second type program do not contain negation;
translating the first type program into a typing, wherein the typing is a lean set of non-degenerate n-ary type tuple constraints (ttcs), each ttc being defined by
(i) a tuple (t1, . . . tn) of n type propositions ti, a type proposition being a quantifier-free type program with no fixpoint definitions and no free relation variables and having precisely one free element variable,
(ii) an equivalence relation over the set of integers {1, . . . , n} that defines a partition of the set of integers {1, . . . , n}, and
(iii) a set c of inhabitation constraints, each inhabitation constraint being a type proposition, the inhabitation constraints modeling existential quantification;
such that the following three requirements are satisfied whenever i and j belong to the same partition under the equivalence relation:
ti=tj,
C<:tk for all k∈{1, . . . , n}, and
c<:c′ for c, c′∈C implies c=c′;
testing a first ttc of the typing for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for a type program represented by the ttc.
0. 21. A computer program product, the computer program product comprising a hardware storage device comprising instructions to cause a computer system to perform operations comprising:
obtaining a query program that queries a database described by a database schema that specifies a column type for every column of every extensional predicate that occurs in the query program and also described by a type hierarchy that is of the form ∀x.h(x) in which h(x) does not contain quantifiers or equations and only contains a single free variable x;
representing the query program by a first type program having no free relation variables and having no fixpoint definitions containing intensional relation variables, wherein the column type is at least a portion of a second type program, wherein each of the first type program and the second type program is a derived program, and wherein each of the first type program and the second type program uses monadic extensionals, a type program being a program that only makes use of type symbols and no other extensional relation symbols and that only has instances of negation in front of type symbols that are the names of monadic extensionals;
translating the first type program into a typing, wherein the typing is a lean set of non-degenerate n-ary type tuple constraints (ttcs), each ttc being defined by
(i) a tuple (t1, . . . tn) of n type propositions ti, a type proposition being a quantifier-free type program with no fixpoint definitions and no free relation variables and having precisely one free element variable,
(ii) an equivalence relation over the set of integers {1, . . . , n} that defines a partition of the set of integers {1, . . . , n}; and
(iii) a set c of inhabitation constraints, each inhabitation constraint being a type proposition, the inhabitation constraints modeling existential quantification;
such that the following three requirements are satisfied whenever i and j belong to the same partition under the equivalence relation:
ti=tj, C<:tk for all k∈{1, . . . , n}, and
c<:c′ for c, c′∈C implies c=c′;
testing a first ttc of the typing for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for a type program represented by the ttc.
0. 31. A system comprising:
a computer system comprising one or more computers each having a memory and a processor communicatively coupled to the memory; and
a type inferencer computer program loaded in the computer system, wherein the type inferencer is adapted to cause the computer system to perform the operations of:
obtaining a query program that queries a database described by a database schema that specifies a column type for every column of every extensional predicate that occurs in the query program and also described by a type hierarchy that is of the form ∀x.h(x) in which h(x) does not contain quantifiers or equations and only contains a single free variable x;
representing the query program by a first type program having no free relation variables and having no fixpoint definitions containing intensional relation variables, wherein the column type is at least a portion of a second type program, wherein each of the first type program and the second type program is a derived program, and wherein each of the first type program and the second type program uses monadic extensionals, a type program being a program that only makes use of type symbols and no other extensional relation symbols and that only has instances of negation in front of type symbols that are the names of monadic extensionals;
translating the first type program into a typing, wherein the typing is a lean set of non-degenerate n-ary type tuple constraints (ttcs), each ttc being defined by
(i) a tuple (t1, . . . tn) of n type propositions ti, a type proposition being a quantifier-free type program with no fixpoint definitions and no free relation variables and having precisely one free element variable,
(ii) an equivalence relation over the set of integers {1, . . . , n} that defines a partition of the set of integers {1, . . . , n}, and
(iii) a set c of inhabitation constraints, each inhabitation constraint being a type proposition, the inhabitation constraints modeling existential quantification;
such that the following three requirements are satisfied whenever i and j belong to the same partition under the equivalence relation:
ti=tj,
C<:tk for all k∈{1, . . . , n}, and
c<:c′ for c, c′∈C implies c=c′;
testing a first ttc of the typing for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for a type program represented by the ttc.
0. 1. A computer-implemented method comprising:
accessing a program with one or more queries to a database that contains relations described by at least one database schema;
receiving the database schema and at least one entity type hierarchy for the database; and
inferring a first type program from definitions in the program by replacing each use of a database relation in the program by its type in the database schema, and the type is at least portion of a second type program,
wherein each of the first type program and the second type program is a derived program using type symbols without other extensional relation symbols, and each of the first type program and the second type program do not contain negation,
wherein the first type program and the second type program uses monadic extensions;
testing portion of the first type program that has been inferred for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for the type.
0. 2. The computer-implemented method of
in response to type emptiness being found for a portion of the first type program that has been inferred during determining of the testing portion of the first type program, performing at least one of:
removing the a type test;
providing a notification regarding the emptiness found for the portion of the first type program;
providing notification on combining queries by conjunction without creating empty parts in a combined query;
in response to an empty part of a query being a conjunction, finding a smallest set of query parts that has a conjunction that is empty;
in response to a search for a smallest empty part of a database query that traverses all parts of the database query, pushing a conjunction of an approximation of the smallest empty part and a context on top of a stack of approximations of all contexts where the empty part is being used; and
eliminating empty query parts to achieve virtual method resolution in an object-oriented query language.
0. 3. The computer-implemented method of
in response to type inclusion being found for a portion of the first type program that has been inferred during determining of the testing portion of the first type program, performing at least one of:
removing the a type test; and
providing a notification regarding the type inclusion being found for the portion type.
0. 4. The computer-implemented method of
testing the type program that has been inferred for whether a database query is contained in a given portion of the type program; and in response to the database query not being contained in the given portion, performing at least one of:
removing a type test; and
providing a notification regarding the database query not being contained in the given portion.
0. 5. The computer-implemented method of
a tuple of type propositions;
an equivalence relation between tuple components; and
a set of inhabitation constraints.
0. 6. The computer-implemented method of
checking inclusion of a portion of the type program that has been inferred represented by a first set of ttcs into another portion of the type program represented by a second set of ttcs by computing a set of prime implicants of the second set; and
checking each ttc in the first set and finding a larger ttc in the set of prime implicants of the second set of ttcs.
0. 7. The computer-implemented method of
computing the set of prime implicants of the second set of ttcs by saturating the second set by exhaustively applying consensus operations to ensure all relevant ttcs are included.
0. 8. The computer-implemented method of
receiving two ttcs as operands;
receiving a set of indices and equating all columns whose indices occur in the set, and preserving all other equalities and inhabitation constraints of the operands, and taking disjunctions over columns in the set and conjunctions over all other columns; and
taking a union of two or more of the equivalence relations in the operands and a pointwise disjunction of the inhabitation constraints of the operands.
0. 9. The computer-implemented method of
reducing a number of consensus operations that need to be performed during saturation by omitting consensus operations where a resulting ttc will be covered by ttcs already present in the set.
0. 10. The computer-implemented method of
receiving a logical formula that represents a type hierarchy, and representing a component type proposition of a ttc as a binary decision diagram (BDDs), and choosing a BDD variable order by assigning neighboring indices to type symbols that appear in a same conjunct of the logical formula that represents the type hierarchy.
0. 11. The computer-implemented method of
0. 12. The computer-implemented method of
0. 13. The computer-implemented method of
0. 14. The computer-implemented method of
traversing a call graph of that query, and
keeping a stack of approximations of all contexts where the procedure is being used, and
when entering a procedure in the call graph, pushing a conjunction of an approximation of a body of that procedure and a top of the stack onto the stack as a new context.
0. 15. The computer-implemented method of
the value will be included in this portion of the first type program regardless of contents stored in the database; and
the value will not be included in the portion of the first type program.
0. 16. A system comprising:
a memory;
a processor communicatively coupled to the memory; and
a type inferencer communicatively coupled to the memory and the processor, wherein the type inferencer is adapted to:
accessing a program with one or more query operations to a database that contains relations described by at least one database schema;
receiving the database schema and at least one entity type hierarchy for the database; and
inferring a first type program from definitions in the program by replacing each use of a database relation in the program by its type in the database schema, and the type is at least portion of a second type program,
wherein each of the first type program and the second type program is a derived program using type symbols without other extensional relation symbols, and each of the first type program and the second type program do not contain negation,
wherein the first type program and the second type program uses monadic extensions;
testing portion of the first type program that has been inferred for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for the type.
0. 17. The system of
in response to type emptiness being found for a portion of the first type program that has been inferred during determining of the testing portion of the first type program, performing at least one of:
removing a type test;
providing a notification regarding the emptiness found for the portion of the first type program;
providing notification on combining queries by conjunction without creating empty parts in a combined query;
in response to an empty part of a query being a conjunction, finding a smallest set of query parts that has a conjunction that is empty;
in response to a search for a smallest empty part of a database query that traverses all parts of the database query, pushing a conjunction of an approximation of the smallest empty part and a context on top of a stack of approximations of all contexts where the empty part is being used; and
eliminating empty query parts to achieve virtual method resolution in an object-oriented query language.
0. 18. The system of
in response to type inclusion being found for a portion of the first type program that has been inferred during determining of the testing portion of the first type program, performing at least one of:
removing a type test; and
providing a notification regarding the type inclusion being found for the portion type.
0. 19. A non-transitory computer program product, the computer program product comprising instructions for:
accessing a program with one or more query operations to a database that contains relations described by at least one database schema;
receiving the database schema and at least one entity type hierarchy for the database; and
inferring a first type program from definitions in the program by replacing each use of a database relation in the program by its type in the database schema, and the type is at least portion of a second type program,
wherein each of the first type program and the second type program is a derived program using type symbols without other extensional relation symbols, and each of the first type program and the second type program do not contain negation,
wherein the first type program and the second type program uses monadic extensions;
testing portion of the first type program that has been inferred for type emptiness and type inclusion; and
providing at least one of error information and optimization information regarding the type emptiness and type inclusion being found for the type.
0. 20. The non-transitory computer program product of
in response to type emptiness being found for a portion of the first type program that has been inferred during determining of the testing portion of the first type program, performing at least one of:
removing a type test;
providing a notification regarding the emptiness found for the portion of the first type program;
providing notification on combining queries by conjunction without creating empty parts in a combined query;
in response to an empty part of a query being a conjunction, finding a smallest set of query parts that has a conjunction that is itself empty;
in response to a search for a smallest empty part of a database query that traverses all parts of the database query, pushing a conjunction of an approximation of that part and a context on top of a stack; and
eliminating empty query parts to achieve virtual method resolution in an object-oriented query language.
0. 22. The computer program product of claim 21, wherein:
the type propositions of the ttcs do not contain negation.
0. 23. The computer program product of claim 21, wherein:
the type hierarchy includes one or more of (i) a statement of implication between entity types, (ii) a statement of equivalence of entity types, or (iii) a statement of disjointness of entity types.
0. 24. The computer program product of claim 21, wherein:
translating the first type program into the typing comprises eliminating all non-maximal and degenerate ttcs from the set of ttcs.
0. 25. The computer program product of claim 21, further comprising instructions to cause the computer system to perform the operations of:
representing the ttcs in the typing by encoding the component type propositions of the ttcs as binary decision diagrams.
0. 26. The computer program product of claim 21, further comprising instructions to cause the computer system to perform the operations of:
building a TTC∥τ∥ from a pre-ttc, the pre-ttc being a structure τ=(t1, . . . , tn|p|c), in which t1, . . . , tn are a tuple of type propositions, p is a relation over {1, . . . , n}, c is a set of inhabitation constraints, which are type propositions, and τ does not satisfy at least one of the three requirements, by setting:
∥τ∥:=(∥t1∥ where
∥ti∥q=∧{tj|i˜qj}, where i˜qj states that i and j belong to the same partition in q, and
0. 27. The computer program product of claim 21, wherein the query program is a first query program and the typing is a first typing, the operations further comprising:
obtaining a second query program that queries a second database that is described by the database schema and the type hierarchy;
representing the second query program by a third type program having no free relation variables and having no fixpoint definitions containing intensional relation variables;
translating the third type program into a second typing; and
performing a containment check on the first typing and the second typing to determine whether the first type program contains the third type program.
0. 28. The computer program product of claim 27, wherein the data in the second database is different from the data in the first database.
0. 29. The computer program product of claim 21, further comprising:
performing type specialization and type erasure using the typing to optimize the query program.
0. 30. The computer program product of claim 21, wherein the first type program is without fixpoint definitions and has intentional relation variables r1, . . . , rm and element variables x1, . . . , xn, and wherein translating the first type program into a typing comprises performing recursion on the structure of the first type program to translate the first type program, denoted φ, to a mapping (φ): Typingm→Typing, the mapping being a mapping from a tuple of m typings {right arrow over (T)}, wherein each of the m typings is associated with a respective one of the m relation variables r1, . . . , rm and the recursion is defined by:
⊥({right arrow over (T)})=∅ xi≐xj({right arrow over (T)})={(T, . . . , T|{[i,j]})} u(xj)({right arrow over (T)})={Tttcj:=u} ¬u(xj)({right arrow over (T)})={Tttcj:=¬u} ri({right arrow over (x)})({right arrow over (T)})=Ti ψ∧χ({right arrow over (T)})=ψ({right arrow over (T)}) ψ∨χ({right arrow over (T)})=ψ({right arrow over (T)})∨χ({right arrow over (T)}) ∃x wherein
the i-th existential of a ttc τ=(t1, . . . , tn|p|c) is defined as ∃i(τ):=(t1, . . . , ti-1, ti+1, . . . , tn|p′|min(C∪{ti}))
where p′ is the restriction of p to {1,. . . ,i−1,i+1, . . . ,n} and
the i-th existential of a typing T is defined pointwise as ∃i(T) := max⊥ {∃i(τ)|τ∈T},
where the operation max⊥ on a set of ttcs removes all non-maximal and degenerate ttcs from the set of ttcs;
binary meets and joins of two typings T and T′ are defined pointwise and respectively as T
{Tttc} is a greatest element under the order <: on a typing, and {Tttcj:=u} is a second ttc resulting from a first ttc {Tttc} by replacing the j-th type proposition in the tuple of type propositions of the first ttc by u in the second ttc and adding u to the inhabitation constraints of the second ttc; and
{(T, . . . , T|
wherein the overall typing is the least fixpoint of the resulting typings {right arrow over (T)} after the recursion is complete.
0. 32. The system of claim 31, wherein:
the type propositions of the ttcs do not contain negation.
0. 33. The system of claim 31, wherein:
the type hierarchy includes one or more of (i) a statement of implication between entity types, (ii) a statement of equivalence of entity types, or (iii) a statement of disjointness of entity types.
0. 34. The system of claim 31, wherein:
translating the first type program into the typing comprises eliminating all non-maximal and degenerate ttcs from the set of ttcs.
0. 35. The system of claim 31, the operations further comprising:
representing the ttcs in the typing by encoding the component type propositions of the ttcs as binary decision diagrams.
0. 36. The system of claim 31, the operations further comprising:
building a ttc ∥τ∥ from a pre-ttc, the pre-ttc being a structure τ=(t1, . . . , tn|p|c), in which t1, . . . , tn are a tuple of type propositions, p is a relation over {1, . . . , n}, c is a set of inhabitation constraints, which are type propositions, and τ does not satisfy at least one of the three requirements, by setting:
∥τ∥:=(∥t1∥ where
∥ti∥q=∧{tj|i˜qj}, where i˜qj states that i and j belong to the same partition in q, and
0. 37. The system of claim 31, wherein the query program is a first query program and the typing is a first typing, the operations further comprising:
obtaining a second query program that queries a second database that is described by the database schema and the type hierarchy;
representing the second query program by a third type program having no free relation variables and having no fixpoint definitions containing intensional relation variables;
translating the third type program into a second typing; and
performing a containment check on the first typing and the second typing to determine whether the first type program contains the third type program.
0. 38. The system of claim 37, wherein the data in the second database is different from the data in the first database.
0. 39. The system of claim 31, the operations further comprising:
performing type specialization and type erasure using the typing to optimize the query program.
0. 40. The system of claim 31, wherein the first type program is without fixpoint definitions and has intentional relation variables r1, . . . , rm and element variables xi, . . . , xn, and wherein translating the first type program into a typing comprises performing recursion on the structure of the first type program to translate the first type program, denoted φ, to a mapping (φ): Typingm→Typing, the mapping being a mapping from a tuple of m typings {right arrow over (T)}, wherein each of the m typings is associated with a respective one of the m relation variables r1, . . . , rm and the recursion is defined by:
⊥({right arrow over (T)})=∅ xi≐xj({right arrow over (T)})={(T, . . . , T|{[i,j]})} u(xj)({right arrow over (T)})={Tttcj:=u} ¬u(xj)({right arrow over (T)})={Tttcj:=¬u} ri({right arrow over (x)})({right arrow over (T)})=Ti ψ∧χ({right arrow over (T)})=ψ({right arrow over (T)}) ψ∨χ({right arrow over (T)})=ψ({right arrow over (T)})∨χ({right arrow over (T)}) ∃x wherein
the i-th existential of a ttc τ=(t1, . . . , tn|p|c) is defined as ∃i(τ):=(t1, . . . ,ti-1,t1+1, . . . ,tn|p′|min(C∪{ti}))
where p′ is the restriction of p to {1, . . . ,i−1,i+1, . . . , n} and
the i-th existential of a typing T is defined pointwise as ∃i(T):=max⊥{∃i(τ)|τ∈T},
where the operation max⊥ on a set of ttcs removes all non-maximal and degenerate ttcs from the set of ttcs;
binary meets and joins of two typings T and T′ are defined pointwise and respectively as T
{Tttc} is a greatest element under the order <: on a typing, and {Tttcj=u} is a second ttc resulting from a first ttc {Tttc} by replacing the j-th type proposition in the tuple of type propositions of the first ttc by u in the second ttc and adding u to the inhabitation constraints of the second ttc; and
{(T, . . . ,T|
wherein the overall typing is the least fixpoint of the resulting typings {right arrow over (T)} after the recursion is completed.
0. 42. The method of claim 41, wherein:
the type propositions of the ttcs do not contain negation.
0. 43. The method of claim 41, wherein:
the type hierarchy includes one or more of (i) a statement of implication between entity types, (ii) a statement of equivalence of entity types, or (iii) a statement of disjointness of entity types.
0. 44. The method of claim 41, wherein:
translating the first type program into the typing comprises eliminating all non-maximal and degenerate ttcs from the set of ttcs.
0. 45. The method of claim 41, further comprising:
representing the ttcs in the typing by encoding the component type propositions of the ttcs as binary decision diagrams.
0. 46. The method of claim 41, further comprising:
building a ttc ∥τ∥ from a pre-ttc, the pre-ttc being a structure τ=(t1, . . . , tn|p|c), in which t1, . . . , tn are a tuple of type propositions, p is a relation over {1, . . . , n}, c is a set of inhabitation constraints, which are type propositions, and τ does not satisfy at least one of the three requirements, by setting:
∥τ∥:=(∥t1∥ where
∥ti∥q=∧{tj|i˜qj}, where i˜qj states that i and j belong to the same partition in q, and
0. 47. The method of claim 41, wherein the query program is a first query program and the typing is a first typing, the method further comprising:
obtaining a second query program that queries a second database that is described by the database schema and the type hierarchy;
representing the second query program by a third type program having no free relation variables and having no fixpoint definitions containing intensional relation variables;
translating the third type program into a second typing; and
performing a containment check on the first typing and the second typing to determine whether the first type program contains the third type program.
0. 48. The method of claim 47, wherein the data in the second database is different from the data in the first database.
0. 49. The method of claim 41, further comprising:
performing type specialization and type erasure using the typing to optimize the query program.
0. 50. The method of claim 41, wherein the first type program is without fixpoint definitions and has intentional relation variables r1, . . . , rm and element variables x1, . . . , xn, and wherein translating the first type program into a typing comprises performing recursion on the structure of the first type program to translate the first type program, denoted φ, to a mapping (℠): Typingm→Typing, the mapping being a mapping from a tuple of m typings {right arrow over (T)}, wherein each of the m typings is associated with a respective one of the m relation variables r1, . . . , rm and the recursion is defined by:
⊥({right arrow over (T)})=∅ xi≐xj({right arrow over (T)})={(T, . . . , T|{[i,j]})} u(xj)({right arrow over (T)})={Tttcj:=u} ¬u(xj)({right arrow over (T)})={Tttcj:=¬u} ri({right arrow over (x)})({right arrow over (T)})=Ti ψ∧χ({right arrow over (T)})=ψ({right arrow over (T)}) ψ∨χ({right arrow over (T)})=ψ({right arrow over (T)})∨χ({right arrow over (T)}) ∃x wherein
the i-th existential of a ttc τ=(t1, . . . , tn|p|c) is defined as ∃i(τ):=(t1, . . . ,ti-1,t1+1, . . . ,tn|p′| min(C∪{ti}))
where p′ is the restriction of p to {1, . . . , i−1, i+1, . . . , n} and
the i-th existential of a typing T is defined pointwise as ∃i(T):=max⊥{∃i(τ)|τ∈T},
where the operation max⊥ on a set of ttcs removes all non-maximal and degenerate ttcs from the set of ttcs;
binary meets and joins of two typings T and T′ are defined pointwise and respectively as T
{Tttc} is a greatest element under the order <: on a typing, and {Tttcj=u} is a second ttc resulting from a first ttc {Tttc} by replacing the j-th type proposition in the tuple of type propositions of the first ttc by u in the second ttc and adding u to the inhabitation constraints of the second ttc; and
{(T, . . . ,T|
wherein the overall typing is the least fixpoint of the resulting typings {right arrow over (T)} after the recursion is completed.
|
states that the first column of the extensional relation salary will contain values of type employee, and the second one will contain floating point numbers. A database schema provides this column-wise typing information for every extensional relation.
Given the schema and the hierarchy, type inference should infer the following implications from the definitions in the program:
bonus(x, y)→employee(x) ∧ float(y)
factor(x, y)→employee(x) ∧ float(y)
query(x, y)→senior(x) ∧ manager(x) ∧ float(y)
Note how these depend on facts stated in the type hierarchy: for example, in inferring the type of factor, we need to know that employee is the union of junior and senior.
Type inference will catch errors: when a term is inferred to have the type ⊥, we know that it denotes the empty relation, independent of the contents of the database. Also, the programmer may wish to declare types for intensional relations in her program, and then it needs to be checked that the inferred type implies the declared type.
The benefits of having such type inference for queries is not confined to merely checking for errors. Precise types are also useful in query optimisation. In particular, the above query can be optimised to:
query(x, y)←manager(x) ∧ senior(x) ∧ salary(x, z) ∧ y=0.15×z
This rewriting relies on the fact that we are asking for a manager, and therefore the disjunct in the definition of bonus that talks about students does not apply: a student is a parttime employee, and managers cannot be parttime. Again notice how the complex type hierarchy is crucial in making that deduction. The elimination of disjuncts based on the types in the calling context is named type specialisation; type specialisation is very similar to virtual method resolution in optimising compilers for object-oriented languages. Similarly, the junior alternative in the definition of factor can be eliminated. Once type specialisation has been applied, we can eliminate a couple of superfluous type tests: for instance, ¬ student(x) is implied by manager(x). That is called type erasure. The key to both type specialisation and type erasure is an efficient and complete test for type containment, which takes all the complex facts about types into account.
Overview The general problem we address is as follows. Given the class of all Datalog programs , we identify a sublanguage of type programs. Type inference assigns an upper bound ┌p┐ in the language of type programs to each program p. To say that ┌p┐ is an upper bound is just to say that p (plus the schema and the type hierarchy) implies ┌p┐. The idea of such ‘upper envelopes’ is due to Chaudhuri [8, 9].
In Section 2 we propose a definition of the class of type programs: any program where all extensionals called are monadic. Type programs can, on the other hand, contain (possibly recursive) definitions of intensional predicates of arbitrary arity. We then define a type inference procedure ┌p┐, and prove that its definition is sound in the sense that ┌p┐ is truly an upper bound of p. Furthermore, it is shown that for negation-free programs p, the inferred type ┌p┐ is also optimal: ┌p┐ is the smallest type program that is an upper bound of p.
For negation-free programs, the definition of ┌p┐ is in terms of the semantics of p, so that the syntactic presentation of a query does not affect the inferred type. It follows that the application of logical equivalences by the programmer or by the query optimiser does not affect the result of type inference.
The restriction that programs be negation-free can be relaxed to allow negation to occur in front of sub-programs that already have the shape of a type program. Not much more can be hoped for, since sound and optimal type inference for programs with arbitrary negations is not decidable.
To also do type checking and apply type optimisations, we need an effective way of representing and manipulating type programs. To this end, we identify a normal form for a large and natural class of type programs in Section 3. The class consists of those type programs where the only negations are on monadic extensionals (i.e., entity types). There is a simple syntactic containment check on this representation, inspired by our earlier work reported in [13].
That simple containment check is sound but not complete. A geometric analysis of the incompleteness in Section 4 suggests a generalisation of the celebrated procedure of Quine [22] for computing prime implicants. We present that generalised procedure, and show that the combination of the simple containment check plus the computation of prime implicants yields a complete algorithm for testing containment of type programs.
The algorithm we present is very different from the well-known approach of propositionalisation, which could also be used to decide containment. The merit of our novel algorithm is that it allows an implementation that is efficient in common cases. We discuss that implementation in Section 5. We furthermore present experiments with an industrial database system that confirm our claims of efficiency on many useful queries.
Finally, we discuss related work in Section 6, and conclude in Section 8.
In summary, the original contributions of this paper include:
2. Type Programs
It is customary to write Datalog programs as a sequence of rules as seen in
This logic, presented for example in [10], extends first order logic (including equality) with fixpoint definitions of the form [r(x1, . . . , xn)≡ϕ]ψ. Here, r is an n-ary relation symbol, the xi are pairwise different element variables, and ϕ and ψ are formulae. Intuitively, the formula defines r by the formula ϕ, which may contain recursive occurrences of r, and the defined relation can then be used in ψ.
The xi are considered bound in ϕ and free in ψ, whereas the relation r is free in ϕ and bound in ψ. To ensure stratification, free relation symbols are not allowed to occur under a negation, which avoids the problematic case of recursion through negation. For example, the formula
[s(x, y)≡x≐y ∨ ∃z.e(x, z) ∧ s(z, y)]s(x, y) (1)
defines the relation s to be the reflexive transitive closure of e. By way of illustration (and only in this example), free occurrences of element variables and relation variables have been marked up in bold face. The formula is trivially stratified since no negations occur anywhere.
In this program, the relation symbol s is an intensional predicate since it is given a defining equation. On the other hand, relation e is an extensional, which we assume to be defined as part of a database on which the program is run. We will denote the set of intensional relation symbols as and that of extensional relation symbols as , and require them to be disjoint.
Model-theoretically, the role of the database is played by an interpretation which assigns, to every n-ary relation symbol e ∈ , a set of n-tuples of some non-empty domain. To handle free variables we further need an assignment σ that maps free element variables to domain elements and free relation variables to sets of tuples.
Just as in first order logic, we can then define a modelling judgement σ ϕ stating that formula ϕ is satisfied under interpretation and assignment σ. In particular, stratification ensures that formulae can be interpreted as continuous maps of their free relation variables, so that fixpoint definitions can be semantically interpreted by the least fixpoints of their defining equations.
For a formula ϕ with a single free element variable x, we set [[ϕ]]:={c|x:=c ϕ}, omitting the subscript where unambiguous.
Just as the semantics of the intensional predicates is determined with respect to the semantics of the extensional predicates, the types of the intensional predicates are derived with respect to the types of the extensional predicates. These are provided by a schema which assigns, to every n-ary extensional predicate, an n-tuple of types. As is customary, types are themselves monadic predicates from a set ⊂ of type symbols. For consistency, we require that the schema assigns a type symbol to itself as its type.
For the program in (1), for example, we might have two type symbols a and b, with the schema specifying that (e)=(a, b) and of course (a)=(a), (b)=(b).
Semantically, we understand this to mean that in every interpretation of the extensional symbols conforming to the schema, the extension of a given column of an extensional symbol should be exactly the same as the extension of the type symbol assigned to it by the schema.
We can define a first order formula, likewise denoted as , which expresses this constraint: For every n-ary relation symbol e ∈ \ and every 1≤i≤n where u ∈ is the type assigned to the i-th column of e, the formula contains a conjunct
∀xi.(∃x1, . . . , xi−1, xi+1, . . . , xn.e(x1, . . . , xn)) ↔ u(xi) (2)
Our example schema above, for instance, would be expressed by the formula
(∀x.(∃y.e(x,y)) ↔ a(x)) ∧ (∀y.(∃x.e(x, y)) ↔ b(y))
In the literature, the types assigned by the schema are often not required to fit exactly, but merely to provide an over-approximation of the semantics. We could achieve this by relaxing (2) to use an implication instead of a bi-implication.
Instead, we take a more general approach by allowing arbitrary additional information about the types to be expressed through a hierarchy formula . Our only requirements about this formula are that it be a closed first order formula containing only type symbols (and no other relation symbols).
In particular, the hierarchy may stipulate subtyping relations; in our example, the hierarchy could be ∀x.a(x)→b(x), expressing that a is a subtype of b. Perhaps more interestingly, it can also express disjointness of types, e.g. by stating that ∀x.¬a(x) ∨ ¬b(x); we could then deduce that in the definition of the reflexive transitive closure e can, in fact, be iterated at most once. Many other kinds of constraints are just as easily expressed, and provide opportunities for advanced optimisations.
We now approximate programs by type programs, which are programs that only make use of type symbols (and no other extensional relation symbols), and do not contain negation. (Compare this with the hierarchy which can contain negations, but no fixpoint definitions.) To every SLFP formula ϕ we assign a type program ┌ϕ┐ by replacing negated subformulae with
which is semantically equivalent to
x≐y ∨ a(x) ∧ b(y)
We will see in the next section that, in fact, fixpoint definitions can always be eliminated from such type programs, yielding formulae of monadic first order logic.
As the example suggests, the types assigned to programs semantically over-approximate them in the following sense:
Theorem 1 (Soundness). For every program ϕ we have
, , ϕ ┌ϕ┐
That is, every interpretation and assignment satisfying the schema, the hierarchy and ϕ also satisfies ┌ϕ┐.
Proof Sketch. We can show by induction on ϕ the following stronger version of the the—orem: For any interpretation with and any two assignments σ and σ′, which assign the same values to element variables, and where σ(r) ⊂ σ′(r) for every r ∈ , we have σ′ ┌ϕ┐ whenever ├σ ϕ.
From this, of course, it follows that , ϕ ┌ϕ┐, and again , , ϕ ┌ϕ┐ by monotonicity of entailment.
Perhaps more surprisingly, our type assignment also yields the tightest such over-approximation, in spite of our very lax restrictions on the hierarchy:
Theorem 2 (Optimality). For a negation-free program ϕ and a type program ϑ, if we have
, , ϕ ϑ
then also
, , ┌ϕ┐ ϑ
To prove this theorem, we need a monotonicity result for the type assignment.
Lemma 3. For two negation-free programs ϕ and ψ, if it is the case that , , ϕ ψ then also , , ┌ϕ┐ ┌ψ┐.
An easy way to prove this result is to make use of cartesianisation, which is the semantic equivalent of the typing operator ┌⋅┐. For a relation R ⊂ Dn, we define its cartesianisation cart (R)=π1(R) x . . . xπn(R) to be the cartesian product of the columns of R. Likewise, the cartesianisation cart() of an interpretation is the interpretation which assigns cart((e)) to every relation symbol e.
It is then not hard to see that
cart() σ ϕ ⇔ σ ┌ϕ┐ (3)
for any negation-free formula ϕ, interpretation and assignment σ whenever ├ . Also, cart() ├ if , and cart() iff .
From this observation, we easily get the
Proof of Lemma 3. Assume , , ϕ | ψ, and let , σ be given with , and σ ┌ϕ┐.
By (3), we have cart() | , cart() | and also cart() ψσ ϕ, so by assumption cart() | σ ψ, but then by applying (3) again we get | σ ┌ψ┐.
We briefly pause to remark that Lemma 3 also shows that type inference for negation-free programs respects semantic equivalence: If ϕ and ψ are semantically equivalent (under and ), then so are their types. This does, of course, not hold in the presence of negation, for given a type symbol u, we have ┌u(x)┐=u(x) yet ┌¬¬u(x)┐=τ.
Continuing with our development, we can now give the
Proof of Theorem 2. Assume , , ϕ ϑ; by Lemma 3 this means , , ┌ϕ┐ ├ ┌ϑ┐. But since ϑ is a type program we have ┌ϑ┐=ϑ, which gives the result.
So far we have handled negation rather crudely, trivially approximating it by τ. In fact, we can allow negations in type programs and amend the definition of ┌⋅┐ for negations to read
All the above results, including soundness and optimality, go through with this new definition, and the optimality and monotonicity results can be generalised to hold on all programs which contain only harmless negations, i.e. negations where the negated formula is a type program.
There is little hope for an optimality result on an even larger class of programs: Using negation, equivalence checking of two programs can be reduced to emptiness, and a program is empty iff its optimal typing is ⊥. As equivalence is undecidable [23], any type system that is both sound and optimal for all programs would likewise have to be undecidable.
Our proposed definition of type programs is significantly more liberal than previous proposals in the literature. Frühwirth et al. in a seminal paper on types for Datalog [16] propose the use of conjunctive queries, with no negation, no disjunction, no equalities and no existentials.
As an example where the additional power is useful, consider a database that stores the marriage register of a somewhat traditionally minded community. This database might have types male and female, specified to be disjoint by the type hierarchy, and an extensional relation married with schema (female, male). (Note that this means that all male and female entities in the database are married, i.e. occur somewhere in the married relation.)
Using our above terminology, we have ={male, female} and = ∪ {married}; is ∀x.¬male(x) ∨ ¬ female(x).
Let us now define an intensional predicate spouse:
[spouse(x, y)≡married(x, y) ∨ married(y, x)]spouse(x, y)
What is the type of spouse(x, y)? In the proposal of [16], we could only assign it (female(x) ∨ male(x)) ∧ (female(y) ∨ male(y)), so both arguments could be either male or female. By contrast, when employing our definition of type programs, the inferred type is
(female(x) ∧ male(y)) ∨ (male(x) ∧ female(y))
This accurately reflects that x and y must be of opposite sex. By properly accounting for equality, we can also infer that the query spouse(x, y) ∧ x≐y has type ⊥ under the hierarchy: nobody can be married to themselves.
3. Representing Type Programs
For query optimisation and error checking, the single most important operation on types is containment checking, i.e. we want to decide, for two type programs ϑ and ϑ′, whether ∧ ϑ
¬c(z)≡{sorting by variable}∃z·a(x)∧c(x)∧b(y)∧x≐y∧
z≐y∧¬c(z)≡{pushing in ∃}a(x)∧c(x)∧b(y)∧x≐y∧
∃z·z≐y∧¬c(z)≡{eliminating ≐under ∃}a(x)∧c(x)∧b(y)∧
x≐y∧¬c(y)≡{sorting again}a(x)∧c(x)∧b(y)∧¬c(y)∧x≐y
We will now introduce data structures that represent formulae in solved form, and develop a containment checking algorithm that works directly on this representation.
We define an order on type propositions by taking ϕ<: ψ to mean
where ∥ti∥q=∧{tj|i˜q j}.
For a TTC τ=(t1, . . . , tn|p|C), we define τi:=ti, and τi:=r is the TTC resulting from τ by replacing the component ti (and every component tj where i˜p j) by r and adding r to the inhabitation constraints.
The formula represented by a TTC is easily recovered:
Definition 2. Given a list of n different element variables x1, . . . , xn, we define the type program [τ](x1, . . . , xn) corresponding to a TTC τ=(t1, . . . , tn|p|C) as
A TTC is called degenerate if ⊥ is among its inhabitation constraints; it represents an unsatisfiable formula. The equivalence relation is called trivial if it is in fact the identity relation id, and the set of inhabitation constraints is called trivial if it only constrains the component types to be inhabited. When writing a TTC, we will often leave out trivial partitionings or inhabitation constraints.
We extend the order <: to TTCs component-wise:
Definition 3. For two n-ary TTCs τ=(t1, . . . , tn|p|C) and τ′=(t1′, . . . , tn′|p′|C′) we define τ<:τ′ to hold iff
It is routine to check that this defines a partial order with maximal element
Since all type programs can be brought into solved form, we can represent them as sets of TTCs.
Definition 4. A lean set of non-degenerate TTCs is called a typing.
The formula represented by a typing is
The order <: is extended to typings by defining T<:T′ to hold if, for every τ ∈ T there is a τ′ ∈ T′ with τ<:τ′.
To convert any set of TTCs into a typing we can eliminate all non-maximal and degenerate TTCs from it; this operation we will denote by max⊥.
The order on typings is again a partial order with least element ∅ and greatest element
and joins are given by
T ∨ T′=max⊥(T ∪ T′) (6)
We will now show that every type program can be translated to a typing. We already have enough structure to model conjunction and disjunction; existential quantification is also not hard to support.
Definition 5. The i-th existential of a TTC τ=(t1, . . . , tn|p|C), where 1≤i≤n, is defined as
∃i(τ):=(t1, . . . , ti−1, ti+1, . . . , tn|p′|min(C ∪ {ti}))
where p′ is the restriction of p to {1, . . . , i−1, i+1, . . . , n}.
The existential of a typing is defined pointwise:
∃i(T):=max⊥{∃i(τ)|τ ∈ T}
To see that this models the existential quantifier on type programs, we can easily convince ourselves of the following:
Lemma 6. Let τ be an n-ary TTC, an interpretation and σ a statement. Then | σ [∃i(τ)] iff there is a domain element d such that σ, x
Although typings cannot directly represent type programs with fixpoint definitions or free relation variables, we can translate every closed type program to a typing by eliminating definitions. Let us start with a translation relative to an assignment to free relation symbols.
Definition 6. Let ϕ be a type program without fixpoint definitions containing the intensional relation variables r1, . . . , rm and the element variables x1, . . . , xn. We translate it to a mapping ϕ: Typingm→Typing by recursion on the structure of ϕ:
⊥({right arrow over (T)})=∅
xi≐xj({right arrow over (T)})={(τ, . . . , τ|{{i,j}})}
u(xj)({right arrow over (T)})={τTTCj:=u}
¬u(xj)({right arrow over (T)})={τTTCj:=¬u}
ri({right arrow over (x)})({right arrow over (T)})=Ti
ψ∧χ({right arrow over (T)})=ψ({right arrow over (T)})
ψ∧χ({right arrow over (T)})=ψ({right arrow over (T)})∨χ({right arrow over (T)})
∃xi.ψ({right arrow over (T)})=∃i(ψ({right arrow over (T)}))
It is easy to check that ϕ is a monotonic mapping for every ϕ with respect to the order <:, and since the set of typings is finite its least fixpoint can be computed by iteration. Thus we can translate every type program ϕ which does not have free relation variables into a typing ϕ which represents it:
Lemma 7. For every type program ϕ without free relation variables, we have ϕ┤ ├ [ϕ].
Proof Sketch. In fact, we can show that, for any type program ϕ with m free relation variables and every m-tuple {right arrow over (T)} of typings, we have the equivalence ϕ([{right arrow over (T)}]) ┤ ├ [ϕ({right arrow over (T)})], where [{right arrow over (T)}] is the m-tuple of type programs we get from applying [⋅] to every component of {right arrow over (T)}.
This can be proved by structural induction on ϕ by showing that [⋅] commutes with the logical operators (using Lemma 6 for the case of existentials) and with fixpoint iteration. The lemma then follows for the case m=0.
4. Containment Checking
We are now going to investigate the relation between the orders <: (component-wise or syntactic subtyping) and (semantic subtyping). The former is convenient in that it can be checked component-wise and only involves propositional formulae, so it would be nice if we could establish that T1<:T2 iff [T1] | [T2].
Unfortunately, this is not true. The ordering <: is easily seen to be sound in the following sense:
Theorem 8. For any two typings T and T′, if T<:T′ then [T] [T′].
However, it is not complete. Consider, for example, the typings T={(b ∨ c, a)} and T′={(a ∨ b, a), (c ∨ d, a)}. Clearly we have [T] ├ [T′], yet T≮T′.
If we interpret type symbols as intervals of real numbers, this example has an intuitive geometric interpretation, shown in
The individual TTCs then become rectangles in the two-dimensional plain, and it is easy to see that the single TTC in T (depicted by the single rectangle in heavy outline filled with vertical lines) is contained in the union of the TTCs in T′ (the two rectangles filled with slanted lines), but not in either one of them in isolation.
A similar problem can arise from the presence of equality constraints: Consider the typings T={(a ∨ b, a ∨ b|{{1, 2}})} and T′={(a, a), (b, b)}. The former appears in
Intuitively, the problem is that <: tests containment of TTCs in isolation, whereas the semantic inclusion check considers containments between sets of TTCs.
The following lemma paves the way for a solution:
Lemma 9. The mapping [⋅] is the lower adjoint of a Galois connection, i.e. there is a mapping (|⋅|) from type programs to typings such that
[T] ϑ iff T<:(|ϑ|)
Proof. As we have established, typings and type programs form complete lattices under their respective orders. Since [⋅] distributes over joins it must be the lower adjoint of a Galois connection between these lattices.
An immediate consequence of this is that [T1] ├ [T2] implies T1<:(|[T2]|). To check, then, whether T1 is semantically contained in T2, we compute (|[T2]|) and check for component-wise inclusion.
One possible implementation of (|⋅|) comes directly from the definition of the Galois connection: For any type program ϑ, (|ϑ|) is the greatest typing T such that
where C ∨ C′={c ∨ c′|c ∈ C, c′ ∈ C′}.
To see why this is necessary, assume we have type symbols a, b, c with ├ ∀x.a(x) ∨ b(x)→c(x), and consider typings T={(c|id|a ∨ b)} and T′={(c|id|a), (c|id|b)}. Clearly, [T]=c(x1) ∧ ∃z.(a(z) ∨ b(z))=[T′], yet T≮T′ since {a ∨ b}≮:{a} and {a ∨ b}≮:{b}. However, if we add the existential consensus (c|id|a ∨ b) to T′, we will be able to prove that T<:T′.
It is not hard to verify that adding the consensus of two TTCs to a typing does not change its semantics.
Lemma 10. For two TTCs τ, τ and an index set J we have [τ ⊕J τ′] ├ [{τ, τ′}] and [τ ⊕∃ τ′] ├ [{τ, τ′}].
Proof Sketch. Clearly, if J=∅ the consensus is just the meet, so nothing needs to be proved.
Otherwise, let and σ be given such that σ [τ ⊕J τ′]. Then, for some i ∈ J, either | σ ∧j∈J tj(xi) or | σ ∧j∈J tj′; in the former case σ [τ], and in the latter case ├σ [τ′].
For the existential consensus, assume | σ [τ ⊕∃ τ′] and further assume there is some inhabitation constraint c of τ such that ┌y.c(y) (for otherwise ├σ [τ] is immediate). We must have ∃y.c(y) ∨ c′(y) for all inhabitation constraints c′ of τ′, so certainly | ∃y.c′ (y). We then easily see that σ [τ′].
Remember that typings do not contain non-maximal elements. The following result shows that we do not lose anything by eliminating them.
Lemma 11. Consensus formation is monotonic in the sense that for TTCs σ, σ′, τ, τ′ with σ<:σ′ and τ<:τ′ and for an index set J we have σ ⊕J τ<: σ′ ⊕J τ′ and σ ⊕∃ τ<:σ′ ⊕∃ τ′
Proof Sketch. The components of the two TTCs are combined using conjunction and disjunction, which are monotonic with respect to subtyping, and the partitionings and inhabitation constraints are combined using set union which is also monotonic.
Definition 9. An n-ary typing T is said to be saturated if the consensus of any two TTCs contained in it is covered by the typing; i.e. for any τ, τ′ ∈ T and any index set J ⊂ {1, . . . , n} we have τ ⊕J τ′<:T and τ ⊕∃ τ′<:T.
Lemma 12. Every typing T can be converted into a semantically equivalent, saturated typing sat(T) by exhaustive consensus formation.
Proof. We use Algorithm 2 to collect all consensus TTCs of two given TTCs. As shown in Algorithm 1, this is performed for every pair of TTCs in the typing, and this procedure is repeated until a fixpoint is reached.
Notice that in each iteration (except the last) the set of TTCs covered by the typing becomes larger, and since there are only finitely many TTCs, the termination condition must become true eventually.
We now generalise the concepts of implicants and prime implicants.
Definition 10. A TTC τ is an implicant for a typing T′ if we have [τ] [T′]; it is a prime implicant if it is a <:-maximal implicant.
More explicitly, a TTC π is a prime implicant for a typing T′ if
The second condition is equivalent to saying that for any τ with τ<:τ and τ≮:π we have [τ] [T].
Lemma 13. Every implicant implies a prime implicant.
Proof. The set of all implicants of a typing is finite, so for any implicant τ there is a maximal implicant π with τ<:π, which is then prime by definition.
Remark 14. If π is a prime implicant of T, then π<:T iff π ∈ T.
Proof. The direction from right to left is trivial. For the other direction, suppose π<:T, i.e. π<:π′ for some π′ ∈ T. Certainly [π′] ├ [T], so π′<:π since π is prime. But this means that π=π′ ∈ T.
We want to show that sat(T) contains all prime implicants of T, so we show that saturation can continue as long as there is some prime implicant not included in the saturated set yet.
Lemma 15. If there is a prime implicant π of T with π ∉ T, then T is not saturated.
Proof. Let π be a prime implicant of T with π ∉ T. Consider the set
M:={τ|[τ] [π], τ≮:T}
This set is not empty (it contains π by the preceding remark) and it is finite, so we can choose a <:-minimal element ψ ∈ M.
The proof proceeds by considering three cases. In the first, we shall show that a consensus step is possible, thus proving that T is not saturated. In the second, we show that an existential consensus step can be made, again proving non-saturation. Finally, we show that either the first or second case must apply, as their combined negation leads to a contradiction.
We will show that in this case ψ<:T, contrary to our assumptions. We know that [ψ] | [T], so any interpretation and assignment σ with ├ and ├σ [ψ] will satisfy some TTC τ ∈ T. Put differently, if σ (∧ψ) (xi) for all i and ∃y.( ∧ c) (y) for all inhabitation constraints c and moreover a satisfies the partition of ψ, then we will find τ ∈ T with τ [τ].
Observe that for every i the (propositional) formula |h|∧|ψi| can be written as a conjunction li, 1 ∧ . . . ∧ li, m
Let such a decomposition be fixed and let Li:={li, 1, . . . , li, m
We define an interpretation over the domain {x[1], . . . , x[n]} ∪ C of variables modulo the partition of ψ plus the inhabitation constraints by
(b)={x[i]|b ∈ Li} ∪ {c|b ∈ Lc}
for every type symbol b. The assignment is simply defined by σ(xi)=x[i].
It is easy to see that x[i] ∈ [[l]] iff l ∈ Li for all 1≤i≤n, and c ∈ [[l]] iff l ∈ Lc for all inhabitation constraints c, from which we deduce ├ σ ( ∧ ψi) (xi) for all i, and ∃y.( ∧) (y) for all inhabitation constraints c. By definition, σ satisfies the partition of ψ. So we have some τ ∈ T with | σ [τ].
We claim that ψ<:T.
Indeed, let an index 1≤i≤n be given. Then we can write τi as a disjunction of conjunctions, such that one of its disjuncts, of the form li, 1′ ∧ . . . ∧ li, m
Since σ satisfies the partition of τ, this partition must be finer than the partition of ψ.
Finally, let an inhabitation constraint d of τ be given. Since ∃y.( ∧) (y), the interpretation of ∧ d cannot be empty. As above, we can write ∧ d in a disjunctive form such that all the literals in one of its disjuncts are non-empty in . So there must be some domain element that occurs in the denotation of all these literals. If it is one of the x[i], then we have ψi<:d, which implies c<:d for some inhabitation constraint c of ψ; if it is an inhabitation constraint c, then we have c<:d directly.
Taking these facts together, we get ψ<:τ, whence ψ<:T. This contradicts our assumption, and we conclude that this subcase cannot occur.
Now we can declare victory:
Theorem 16. If [T1] [T2], then T1<:sat(T2).
Proof. Assume [T1] ├ [T2] and let τ ∈ T1 be given. Then [τ] [T2], so certainly [τ] [sat(T2)], since saturation does not change the semantics. By Lemma 13 this means that there is a prime implicant π of sat(T2) with τ<:π. By Lemma 15 we must have π ∈ sat(T2), so τ<:sat(T2). Since this holds for any τ ∈ T1 we get T1<:sat(T2).
5. Implementation
It is not immediately clear that the representation for type programs proposed in the preceding two sections can be implemented efficiently. While the mapping ┌⋅┐ from programs to type programs is, of course, easy to implement and linear in the size of the program, the translation of type programs into their corresponding typings involves fixpoint iterations to eliminate definitions. Saturation is also potentially very expensive, since we have to compute the consensus on every combination of columns for every pair of TTCs in a typing, and repeat this operation until a fixpoint is reached.
We report in this section on our experience with a prototype implementation based on Semmle's existing type checking technology as described in [13]. The type checker computes typings for programs and immediately proceeds to saturate them to prepare for containment checking. It uses a number of simple heuristics to determine whether a consensus operation needs to be performed. This drastically reduces the over-head of saturation in practice and makes our type inference algorithm practically usable.
TTCs can be compactly represented by encoding their component type propositions as binary decision diagrams. The same can be done for the hierarchy, so checking containment of type propositions can be done with simple BDD operations.
As with any use of BDDs, it is crucial to find a suitable variable order. We choose a very simple order that tries to assign neighbouring indices to type symbols that appear in the same conjunct of the hierarchy formula . For example, if the hierarchy contains a subtyping statement u1(x)→u2(x) or a disjointness constraint ¬ (u1(x) ∧ u2(x)), then u1 and u2 will occupy adjacent BDD variables. This heuristic is simple to implement and yields reasonable BDD sizes as shown below.
To mitigate the combinatorial explosion that saturation might bring with it, we avoid useless consensus formation, i.e. we will not form a consensus τ ⊕ τ′ if τ ⊕ τ′<:{τ, τ′} or τ ⊕ τ′ is degenerate. The following lemma provides a number of sufficient conditions for a consensus to be useless:
Lemma 17. Let τ=(t1, . . . , tn|p|C) and τ′=(t1′, . . . , tn′|p′|C′) be two TTCs and J ⊂ {1, . . . , n} a set of indices. Then the following holds:
Proof. Recall that τ ⊕J τ′=∥(u1, . . . , un|p ∪ p′ ∪ J×J|C ∪ C′)∥ where
We prove the individual claims:
Otherwise we have (τ ⊗J τ′)i=(τ ⊗J′ τ′)i, so overall τ ⊗J τ′<:τ ⊗J′ τ′.
This suggests an improved implementation of Algorithm 2, shown in two parts as Algorithm 3 and Algorithm 4, which straightforwardly exploit the above properties to avoid performing useless consensus operations whose result would be discarded by the max⊥ operator anyway. In the latter, we make use of an operation P
To show that these improvements make our type inference algorithm feasible in practice, we measure its performance on some typical programs from our intended application domain. The Datalog programs to be typed arise as an intermediate representation of programs written in a high-level object-oriented query language named .QL [14, 15], which are optimised and then compiled to one of several low-level representations for execution.
We measure the time it takes to infer types for the 89 queries in the so-called “Full Analysis” that ships with our source code analysis tool SemmleCode. These queries provide statistics and overview data on structural properties of the analysed code base, compute code metrics, and check for common mistakes or dubious coding practices.
A type inference for the query being compiled is performed at three different points during the optimisation process, giving a total of 267 time measurements. All times were measured on a machine running a Java 1.6 virtual machine under Linux 2.6.28-13 on an Intel Core2 Duo at 2.1 GHz with four gigabytes of RAM, averaging over ten runs, and for each value discarding the lowest and highest reading.
Of the 267 calls to type inference, 65% finish in 0.5 seconds or less, 71% take no more than one second, 93% no more than two, and in 98% of the cases type inference is done in under three seconds. Only two type inferences take longer than four seconds, at around 4.5 seconds each.
The size of the programs for which types are inferred varies greatly, with most programs containing between 500 and 1500 subterms, but some programs are significantly larger at more than 3000 subterms. Interestingly, program size and type inference time are only very weakly correlated (ρ<0.4), and the correlation between the number of stratification layers (which roughly corresponds to the number of fixpoint computations to be done) and type inference time is also not convincing (ρ<0.6).
The low correlation is evident in
This suggests that the asymptotic behaviour of the algorithm in terms of input size is masked by the strong influence of implementation details (at least for the kind of programs we would expect to deal with in our applications), and we can expect significant performance gains from fine tuning the implementation.
Our experiments also show that saturation, while extremely expensive in theory, is quite well-behaved in practice: in 78% of all cases, no new TTCs are added during the first iteration of Algorithm 1, so the typing is already saturated. In another 17% of all cases we need only one more iteration, and we almost never (≈0.01%) need to do more than four iterations.
Although in every individual iteration we potentially have to take the consensus over every combination of columns, our heuristics manage to exclude all combinations of more than one column in 94% of all cases, and sometimes (14%) can even show that no consensus formation is needed at all.
Since our type inference algorithm makes heavy use of BDDs, some statistics about their usage may also be of interest. We use about 300 BDD variables (one for every type symbol), with most of the BDDs constructed during type checking being of rather moderate size: the BDD to represent the type hierarchy (which encodes about 800 constraints automatically derived during translation from .QL to Datalog) occupies around 4000 nodes, while the BDDs for individual type propositions never take more than 100 nodes.
While we have anecdotal evidence to show that the optimisation techniques we have presented in earlier work [13] benefit greatly from combining them with the richer type hierarchies our type system supports, we leave the precise investigation of this matter to future work.
6. Benefits
The present invention handles negated types accurately. Also, the present invention handles inhabitation constraints (i.e., existentials in type programs) precisely.
We achieve pleasing theoretical properties and can support a rich language of type constraints by sacrificing polynomial time guarantees, although our experiments show that this is a reasonable tradeoff for our application area.
While the performance of our prototype implementation is promising, other implementation approaches certainly exist and may be worth exploring. Bachmair et al. [4] develop a decision procedure for monadic first order logic based on the superposition calculus. Since type programs can readily be expressed as monadic first order formulae, their algorithm could be used to decide type containment. Another possibility would be to use Ackermann's translation from monadic first order logic to equality logic [2], and then employ a decision procedure for this latter logic.
7. Review of High Level Flow Diagrams
8. Conclusion
We have presented a type inference procedure that assigns an upper envelope to each Datalog program. That envelope is itself a Datalog program that makes calls to monadic extensionals only. The algorithm is able to cope with complex type hierarchies, which may include statements of implication, equivalence and disjointness of entity types.
The type inference procedure is itself an extremely simple syntactic mapping. The hard work goes into an efficient method of checking containment between type programs. We achieve this via a novel adaption of Quine's algorithm for the computation of the prime implicants of a logical formula. Generalising from logical formulae to type programs, we bring types into a saturated form on which containment is easily checked.
As shown by our experiments, the algorithm for inferring a type and saturating it works well on practical examples. While it may still exhibit exponential behaviour in the worst case, such extreme cases do not seem to arise in our application area. Thus our algorithm is a marked improvement over well-known simpler constructions that always require an exponential overhead.
Many avenues for further work remain. Perhaps the most challenging of these is the production of good error messages when type errors are identified: this requires printing the Boolean formulae represented via TTCs in legible form. We have made some progress on this, employing Coudert et al.'s restrict operator on BDDs, which is another application of prime implicants [12].
There is also substantial further engineering work to be done in the implementation. Careful inspection of the statistics show that our use of BDDs is very much unlike their use in typical model checking applications [26], and we believe this could be exploited in the use of a specialised BDD package. For now we are using JavaBDD [25], which is a literal translation of a C-based BDD package into Java.
Finally, we will need to investigate how to best exploit the advanced features offered by our type system. In particular, much experience remains to be gained in how to make it easy and natural for the programmer to specify the constraints making up the type hierarchy, and which kinds of constraints benefit which kinds of programs.
9. Non-Limiting Hardware Examples
Overall, the present invention can be realized in hardware or a combination of hardware and software. The processing system according to a preferred embodiment of the present invention can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems and image acquisition sub-systems. Any kind of computer system—or other apparatus adapted for carrying out the methods described herein—is suited. A typical combination of hardware and software is a general-purpose computer system with a computer program that, when loaded and executed, controls the computer system such that it carries out the methods described herein.
An embodiment of the processing portion of the present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program means or computer programs in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following a) conversion to another language, code or, notation; and b) reproduction in a different material form.
Computer system 1000 also optionally includes a communications interface 1024. Communications interface 1024 allows software and data to be transferred between computer system 1000 and external devices. Examples of communications interface 1024 include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface 1024 are in the form of signals which may be, for example, electronic, electromagnetic, optical, or other signals capable of being received by communications interface 1024. These signals are provided to communications interface 1024 via a communications path (i.e., channel) 1026. This channel 1026 carries signals and is implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, and/or other communications channels.
Although specific embodiments of the invention have been disclosed, those having ordinary skill in the art will understand that changes can be made to the specific embodiments without departing from the spirit and scope of the invention. The scope of the invention is not to be restricted, therefore, to the specific embodiments. Furthermore, it is intended that the appended claims cover any and all such applications, modifications, and embodiments within the scope of the present invention.
Each of the following twenty-seven references are hereby incorporated by reference in their entirety.
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
6006214, | Dec 04 1996 | International Business Machines Corporation | Database management system, method, and program for providing query rewrite transformations for nested set elimination in database views |
6338055, | Dec 07 1998 | INNOVATION TECHNOLOGY GROUP, INC | Real-time query optimization in a decision support system |
7089266, | Jun 02 2003 | The Board of Trustees of the Leland Stanford Jr. University; BOARD OF TRUSTEES OF THE LELAND STANFORD JR UNIVERSITY | Computer systems and methods for the query and visualization of multidimensional databases |
7512642, | Jan 06 2006 | International Business Machines Corporation | Mapping-based query generation with duplicate elimination and minimal union |
20020069193, | |||
20050086639, | |||
20090234801, | |||
20100017395, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jul 14 2011 | SCHAEFER, MAX | Semmle Limited | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 045690 | /0401 | |
Jul 15 2011 | MOOR, OEGE DE | Semmle Limited | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 045690 | /0401 | |
Apr 21 2017 | Microsoft Technology Licensing, LLC | (assignment on the face of the patent) | / | |||
Nov 29 2019 | Semmle Limited | GITHUB SOFTWARE UK LTD | CHANGE OF NAME SEE DOCUMENT FOR DETAILS | 051244 | /0305 | |
Jan 16 2020 | GITHUB SOFTWARE UK LTD | Microsoft Technology Licensing, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 051710 | /0252 |
Date | Maintenance Fee Events |
Jun 08 2020 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
May 12 2021 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Jan 26 2024 | 4 years fee payment window open |
Jul 26 2024 | 6 months grace period start (w surcharge) |
Jan 26 2025 | patent expiry (for year 4) |
Jan 26 2027 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jan 26 2028 | 8 years fee payment window open |
Jul 26 2028 | 6 months grace period start (w surcharge) |
Jan 26 2029 | patent expiry (for year 8) |
Jan 26 2031 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jan 26 2032 | 12 years fee payment window open |
Jul 26 2032 | 6 months grace period start (w surcharge) |
Jan 26 2033 | patent expiry (for year 12) |
Jan 26 2035 | 2 years to revive unintentionally abandoned end. (for year 12) |