A conceptualization method uses maximum or other substrings of a string pattern to find specific N-tuples of substring triples with N≧2 and m=1 . . . N inside a reference set (SET_r_i) of strings (STR_n_i). Each N-tuple is considered as a candidate for representing related concepts. Each concatenation of the substrings triples is an explicit member of the reference set (SET_r_i). Each middle substring out of middle substrings is unequal to another middle substring out of middle substrings within the substring triples found inside the reference set (SET_r_i). Each prefix substring (X_i) is equal to all other prefix substrings (X_i) within the substring triples found inside the reference set (SET_r_i). Each suffix substring (Z_i) is equal to all other suffix substrings (Z_i) within the substring triples found inside the reference set (SET_r_i). Either the prefix substring (X_i) or the suffix substring (Z_i) is not empty.

Patent
   8311795
Priority
Jan 11 2008
Filed
Dec 31 2008
Issued
Nov 13 2012
Expiry
Jun 02 2031
Extension
883 days
Assg.orig
Entity
Large
4
8
EXPIRED
1. A string pattern conceptualization method, particularly for a pattern of words, comprising:
setting, via a processor, a reference set (SET_r_i) comprising a plurality of strings (STR_n_i);
inside the reference set (SET_r_i), finding specific N-tuples ([Y1_i|Y2_i| . . . |Ym_i]) of substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) with N≧2 and m=1 . . . N; and
considering each N-tuple ([Y1_i|Y2_i| . . . |Ym_i]) as a candidate for representing related concepts;
where:
each concatenation (X_icustom characterY1_icustom character Z_i; X_icustom character Y2_icustom character Z_i; . . . ; X_icustom character Ym_icustom character Z_i) of the substrings triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) is an explicit member of the reference set (SET_r_i);
each middle substring (Y1_i, Y2_i, . . . , Ym_i) out of middle substrings (Y1_i, Y2_i, . . . , Ym_i) is unequal to another middle substring (Y1_i, Y2_i, . . . ,Ym_i) out of middle substrings (Y1_i, Y2_i, . . . ,Ym_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i);
each prefix substring (X_i) is equal to all other prefix substrings (X_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i);
each suffix substring (Z_i) is equal to all other prefix substrings (Z_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i); and
either the prefix substring (X_i) or the suffix substring (Z_i) is not empty.
11. A data processing system, comprising:
a memory element adapted to store strings; and
a processor programmed to:
set a reference set (SET_r_i) of strings (STR_n_i);
inside the reference set (SET_r_i), find specific N-tuples ([Y1_i|Y2_i| . . . |Ym_i]) of substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) with N≧2 and m=1 . . . N; and
consider each N-tuple ([Y1_i|Y2_i| . . . |Ym_i]) as a candidate for representing related concepts;
where the processor is further programmed to provide that:
each concatenation (X_icustom character Y1_icustom character Z_i; X_icustom character Y2_icustom character Z_i; . . . ; X_icustom character Ym_icustom character Z_i) of the substrings triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) is an explicit member of the reference set (SET_r_i);
each middle substring (Y1_i, Y2_i, . . . , Ym_i) out of the middle substrings (Y1_i, Y2_i, . . . , Ym_i) is unequal to another middle substring (Y1_i, Y2_i, . . . ,Ym_i) out of the middle substrings (Y1_i, Y2_i, . . . ,Ym_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . , X_i,Ym_i,Z_i) found inside the reference set (SET_r_i);
each prefix substring (X_i) is equal to all other prefix substrings (X_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i);
each suffix substring (Z_i) is equal to all other suffix substring (Z_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i); and
either the prefix substring (X_i) or the suffix substring (Z_i) is not empty.
10. A computer program product comprising a computer useable storage device that stores a computer readable program, wherein the computer readable program when executed on a computer causes the computer to do the following steps at least one time:
setting a reference set (SET_r_i) comprising a plurality of strings (STR_n_i);
inside the reference set (SET_r_i), finding specific N-tuples ([Y1_i|Y2_i| . . . Ym_i]) of substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) with N≧2 and m=1 . . . N;
considering each N-tuple ([Y1_i|Y2_i| . . . Ym_i]) as a candidate for representing related concepts;
where:
each concatenation (X_icustom character Y1_icustom character Z_i; X_icustom character Y2_icustom character Z_i; . . . ; X_icustom character Ym_icustom character Z_i) of the substrings triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) is an explicit member of the reference set (SET_r_i);
each middle substring (Y1_i, Y2_i, . . . , Ym_i) out of the middle substrings (Y1_i, Y2_i, . . . , Ym_i) is unequal to another middle substring (Y1_i, Y2_i, . . . ,Ym_i) out of the middle substrings (Y1_i, Y2_i, . . . ,Ym_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i);
each prefix substring (X_i) is equal to all other prefix substrings (X_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i);
each suffix substring (Z_i) is equal to all other prefix substrings (Z_i) within the substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i);
either the prefix substring (X_i) or the suffix substring (Z_i) is not empty.
2. The method of claim 1, further comprising ranking candidates according to attributes of at least one of the prefix substring (X_i) and the suffix substring (Z_i).
3. The method of claim 2, further comprising replacing one or more occurrences of the concepts in the string pattern with a most frequently occurring concept yielding an altered string pattern.
4. The method of claim 3, further comprising doing the following steps in the altered string pattern:
setting the reference set (SET_r_i) of strings (STR_n_i);
denoting each string (STR_n_i) with its occurrence count (OCC_n_i);
inside the reference set (SET_r_i), finding specific substring triples (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) of the substring N-tuples ([Y1_i|Y2_i| . . . |Ym_i]) with N≧2 and m=1 . . . N;
considering each N-tuple ([Y1_i|Y2_i| . . . Ym_i]) as the candidate for representing related concepts;
where:
each prefix substring (X_i) and each suffix substring (Z_i) of each substring triple (X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) is an explicit member of the reference set (SET_r_i),
each concatenation (X_icustom character Y1_icustom character Z_i; X_icustom character Y2_icustom character Z_i; . . . ; X_icustom character Ym_icustom character Z_i) of the substrings triples (X_i,Y1_i,Z_i; X_i,Y2_i-Z_i, . . . ; X_i,Ym_i,Z_i) is an explicit member of the reference set (SET_r_i);
each middle substring (Y1_i, Y2_i, . . . , Ym_i) is unequal to another middle substring (Y1_i, Y2_i, . . . ,Ym_i) within the substring triples (X_i,Y1_i,Z_i;
X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i) found inside the reference set (SET_r_i);
either the prefix substring (X_i) or the suffix substring (Z_i) is not empty.
5. The method of claim 1, further comprising filtering a result of one or more N-tuples ([Y1_i|Y2_i| . . . |Ym_i]) found as candidates for representing related concepts considering specific types of a desired result.
6. The method of claim 5, said filtering comprising at least one of the following steps:
using a minimum length for middle substrings (Y1_i, Y2_i, . . . , Ym_i);
using a maximum length for the middle substrings (Y1_i, Y2_i, . . . , Ym_i);
using a minimum length for at least one of the prefix and suffix substring (X_i, Z_i);
requesting that the prefix substring (X_i) ends with a particular regular expression;
requesting that the suffix substring (Z_i) starts with a particular regular expression; and
requesting that at least one of the middle substrings (Y1_i, Y2_i, . . . , Ym_i) has a certain value (V).
7. The method of claim 1, further comprising comparing at least two occurrence counts (Occ1, Occ2, . . . , Occ_m) if at least one of two or more of the middle substrings (Y1_i, Y2_i, . . . , Ym_i) and two or more concatenations of the prefix and middle substrings (X_icustom character Y1_i; X_icustom character Y2_i; . . . ; X_icustom character Ym_i) are explicit members of the reference set (SET_r_i).
8. The method of claim 7, further comprising at least one of the following steps:
if the occurrence count (Occ1, Occ2, . . . , Occ_m) of a first respective middle substring (Y1_i, Y2_i, . . . , Ym_i) is significantly less than the occurrence count (Occ1, Occ2, . . . , Occ_m) of another middle substring (Y1_i, Y2_i, . . . , Ym_i) being compared with, considering the respective first middle substring (Y1_i, Y2_i, . . . , Ym_i) being a specialized concept of the other middle substring (Y1_i, Y2_i, . . . , Ym_i); and
if the occurrence count (Occ1, Occ2, . . . , Occ_m) of the first respective middle substring (Y1_i, Y2_i, . . . , Ym_i) is significantly greater than the occurrence count (Occ1, Occ2, . . . , Occ_m) of another middle substring (Y1_i, Y2_i, . . . , Ym_i) being compared with, considering the respective first middle substring (Y1_i, Y2_i, . . . , Ym_i) being a generalized concept of the other middle substring (Y1_i, Y2_i, . . . , Ym_i).
9. The method of claim 1, further comprising using a string-pattern analysis method, particularly for a pattern of words or a genome pattern, for providing maximum substrings (STR_A_C) as an input for the conceptualization method, comprising the following steps for at least one iteration (A):
defining a subset (SET_A) of substrings (STR_A_B) in said pattern;
keeping track of all said substrings (STR_A_B) and their occurrence counts (Occ_A_B) in said subset (SET_A) of substrings (STR_A_B); and
pruning away each substring (STR_A_B) if said substring (STR_A_B) is subsumed by a longer substring (STR_A_C) in said subset (SET_A) of substrings (STR_A_B) with a same occurrence count (Occ_A_C).
12. The data processing system according to claim 11, further comprising at least one input/output controller and at least one system bus.

This application claims priority to and claims the benefit of European Patent Application Serial No. 08100346.9 titled “STRING PATTERN CONCEPTUALIZATION METHOD AND PROGRAM PRODUCT FOR STRING PATTERN CONCEPTUALIZATION,” which was filed in the European Patent Office/Federal Republic of Germany Processing Location on Jan. 11, 2008, and which is incorporated herein by reference in its entirety.

1. Field of the Invention

The invention relates to a string pattern conceptualization method and to a program product for string pattern conceptualization.

2. Related Art

Searches performed in string patterns such as text or biological sequence data is a commercially prosperous area. However, methods which are used for instance in an Internet environment successfully cannot readily be transferred to enterprise environments. Additionally, content oriented issues become more and more interesting. These methods are semantic-based and less dependent on Internet-specific properties. Compared to link analyses and the like, these methods are far more complex and typically language dependent.

In most of today's computer systems, text representation is not reflecting the real chunks of which the text is composed. In particular, an always co-occurring sequence of words is usually not represented as a chunk, but as a distinct set of words. Knowing the real chunks in a text (i.e., long and very long substrings), however, is desirable for several reasons. It allows for a more compact representation of the text and for a better understanding of the text content, since beginnings and endings of frequently encountered chunks are important spots in the text. In particular, elements occurring adjacently to chunks are frequently related to each other, which for example would allow for an automatic detection of taxonomies.

For reasons of complexity, however, prior art algorithms have problems in finding the maximum substrings even in short texts since the potential number of substrings explodes with the size of the texts.

A main task in content oriented analyses usually is an adequate conceptualization (i.e., acquiring the concepts which are handled in a text as precisely as possible). It is known the art of conceptualization to find a concept of a text in several steps, such as linguistic analysis, noun group determination statistical relevance determination, etc. When processing text for search or other tasks such as conceptualization, categorization, or clustering, the first step usually is to identify a basic set of terms that higher-level components should operate on. This process tries to identify meaningful parts of the overall text, often using immediate context that may be considered as “concepts” or at least “concept candidates.”

In most cases, concepts are represented as noun groups in a language in order to find noun groups in a language. In order to find noun groups in text, a syntactic analysis, which is language dependent and computationally expensive, is needed.

In most cases and across languages, noun groups are formed by consecutive elements of the text. In English, usually a sequence of adjectives followed by a sequence of nouns, in German by a sequence of adjectives followed by a single (but potentially compound) noun. Not all noun groups should be truly considered as “concepts” but only as “candidates.” Usually, some part of the noun group constitutes the concept (i.e., a class of objects) and the rest has the function to identify a particular object or instance of the concept. Therefore, identifying noun groups is not enough to get to a concept level. Some type of contextual analysis is needed. Besides requiring an enormous computing power, such analyses often are language dependent.

However, even in applications such as genome analysis, although only few letters are used as an “alphabet” to represent the essential components, time and space consuming scaling problems appear.

In the paper of S. Kurz and C. Schleiermacher, “REPuter: fast computation of maximal repeats in complete genomes”, Bioinformatics Applications Notes, Oxford University Press, vol. 15, no. 5, 1999, p. 426-427, a software tool is implemented that computes exact repeats and palindromes in entire genomes. DNA (DNA=desoxyribonucleic acid) is a long polymer made from repeating units called nucleotides, wherein the DNA double helix is held together by hydrogen bonds between four bases attached to the two strands. The four bases found in DNA are adenine (abbreviated A), cytosine (abbreviated C), guanine (abbreviated G) and thymine (abbreviated T). These four bases are attached to the sugar/phosphate in the strands to form the complete nucleotide. Although genomes in DNA can be represented by an alphabet of only four characters (i.e., capital letters A, C, G, T) this reveals inherent scaling problems in the analysis. For instance, 160 MByte storage space are needed for 11 MByte doing the genome analysis. For the handling of 63 characters, however, with 26 capital letters, 26 lower case letters, 10 numbers (0-9), 1 whitespace or even 256 characters for ASCII, the suffix tree in the memory grows dramatically.

The invention provides a string pattern conceptualization method and a program product for string pattern conceptualization.

The features of the independent claims, and the other claims and the specification, disclose advantageous and alternative embodiments of the invention.

A string pattern conceptualization method, particularly for conceptualization of a pattern of words, is proposed, comprising doing the following steps one or more times: setting a reference set of strings; inside the reference set, finding specific N-tuples of substring triples; and considering each N-tuple as a candidate for representing related concepts; where each concatenation of the substrings triples is an explicit member of the reference set; each middle substring is unequal to another middle substring within the substring triples found inside the reference set; each prefix substring is equal to each other prefix substring within the substring triples found inside the reference set; each suffix substring is equal to each other suffix substring within the substring triples found inside the reference set; and either prefix or suffix is not empty.

The proposed method is time and resource efficient when compared to conventional methods, particularly when combined with a method to find maximal substrings in a string pattern, which is described below. The proposed method is virtually independent of language and therefore development effort may significantly be reduced. The proposed method is string based and does not involve linguistic syntactic and/or semantic processing steps. The proposed method is virtually providing approximative substitutes for concepts in a text. As no linguistic analysis is necessary, the proposed method may be much less intense in computing power and independent of the language of the text, particularly when combined with the method for finding maximal substrings described below.

According to another aspect of the invention, a program product comprising a computer useable storage medium including a computer readable program is proposed, wherein the computer readable program when executed on a computer causes the computer to perform the following steps one or more times: setting a reference set of strings; inside the reference set, finding specific N-tuples of substring triples; and considering each N-tuple as a candidate for representing related concepts.

A respective data processing system is also proposed.

The above mentioned string pattern analysis method, particularly for a pattern of words or a bio-informatics pattern, comprises the following iterative steps: defining a subset of substrings in said pattern, keeping track of all said substrings in said subset of substrings, and pruning away each substring that is subsumed by a longer substring in said subset of substrings with same occurrence count. Favorably, the method allows finding maximal substrings in the string pattern. Efficiency is improved resulting in a better scalability of systems which are used to perform the preferred method. These systems may be cheaper and faster. The string pattern analysis method may be used for analyzing mass data such as from bio-informatics, genome analysis, real time data of satellites and the like. The method may be used for content management and search engines, for instance. The method is space and time efficient because it is not necessary to keep track of the complete set of substrings at once. Instead, the method keeps track of only a subset of substrings and prunes away such substrings which are subsumed in other substrings. For instance, a subsumed substring may be a smaller substring that is always co-occurring with the same leading or trailing neighbor and/or may be contained inside the substring and/or may be occurring with the same frequency of occurrence as the substring. This pruning step is very favorable to reduce the complexity in a real occurring string pattern (e.g., text). Typically, in algorithms known in the art the string pattern (e.g. text) has to be stored in full length and usually several times the full length. This results in high computing power needed and high storage consumption. The preferred method is more efficient in computing power and storage consumption. One application of the string pattern analysis method may be conceptualization of text.

The combination of the string conceptualization method and string pattern analysis method may further comprise: defining a minimum number of occurrences (MinOcc) for substrings (STR_A_B) to be pruned away; defining a first minimum length (Lmin_1) of substrings (STR_A_B) to be considered in the first iteration (A=1); defining a first maximum length (Lmax_1) of substrings (STR_A_B) to be considered in the first iteration (A=1); and iteratively doing the following steps: searching the pattern for substrings (STR_A_B) in an interval between said minimum length (Lmin_A) and maximum length (Lmax_B); and doing either leaving the iteration if none of said substrings (STR_A_B) found does have the maximum length (Lmax_A); or continue searching the pattern for substrings with increased new minimum and maximum lengths (Lmin_(A+1), Lmax_(A+1)).

Optionally, such a combination may comprise defining the new minimum length (Lmin_(A+1)) above the maximum length of the previous iteration (Lmax_A) and defining said new maximum length (Lmax_(A+1)) above or equal said new minimum length (Lmin_(A+1)). Additionally, the combination may comprise defining the new minimum length (Lmin_(A+1)) without gap above the maximum length of the previous iteration (Lmax_A).

Such a combination may optionally comprise at least one of the following: pruning away all substrings (STR_A_B) with an occurrence count (Occ_A_B) less than said defined minimum number of occurrence (MinOcc); and presenting maximum substrings (STR_A_C) to a user.

The present invention together with the above-mentioned advantages may best be understood from the following detailed description of the embodiments, but not restricted to the embodiments, wherein is shown schematically:

FIG. 1 is an example of an implementation of a flow chart of a preferred string pattern analysis method according to an embodiment of the present subject matter;

FIG. 2 is a block diagram of an example of an implementation of a preferred data processing system for performing the preferred method according to FIG. 1 according to an embodiment of the present subject matter;

FIG. 3a is an example of an implementation of a flow chart of a first portion of a first preferred conceptualization method for a string pattern according to an embodiment of the present subject matter;

FIG. 3b is an example of an implementation of a flow chart of a second portion of a first preferred conceptualization method for a string pattern according to an embodiment of the present subject matter;

FIG. 4a is an example of an implementation of a flow chart of a first portion of a second preferred conceptualization method for a string pattern for N-tuples with N=2 according to an embodiment of the present subject matter;

FIG. 4b is an example of an implementation of a flow chart of a second portion of a second preferred conceptualization method for a string pattern for N-tuples with N=2 according to an embodiment of the present subject matter;

FIG. 5 is an example of an implementation of a preferred data processing system for performing the preferred method according to FIG. 3a and FIG. 3b.

A preferred string pattern analyzing method comprises the steps of defining a subset of substrings in said pattern, keeping track of all said substrings in said subset of substrings, and pruning away each substring that is subsumed by a longer substring in said subset of substrings with the same occurrence count. The invention is exemplified for text as a string pattern. It is to be understood, however, that the invention is not restricted to text and can be applied to any string pattern such as in genome analysis and the like.

A preferred embodiment of the method is depicted as flow chart 100 in FIG. 1, wherein a string pattern is analyzed to find substrings STR_A_B contained in the pattern. A is an index indicating the actual number of the iteration A, with A running between 1 and D, wherein D denotes a total number of iterations A. B is a parameter denoting individual substrings in the iteration step A.

In a first step 102, a threshold for a minimum occurrence MinOcc of a substring STR_A_B is defined, and substrings STR_A_B below said threshold MinOcc are ignored. Preferably, the number of minimum occurrence MinOcc=2. Typically, the threshold for MinOcc is kept constant for all iterations A. However, the minimum occurrence MinOcc may be increased. Due to the threshold, the full and complete text is not considered as one substring STR_A_B. Therefore, substrings STR_A_B are always subsets SET_A of the full text.

In steps 104 and 106 a first minimum length Lmin_1 and a first maximum length Lmax_1 of substrings STR_1_B to be considered in a first iteration step with A=1 looping over the text are defined. Preferred first values are Lmin_1=1 and Lmax_1=5, for instance.

In step 108, the pattern is searched for substrings STR_1_B with a length in an interval between said minimum length Lmin_1 and said maximum length Lmax_1.

Step 110 is counting for occurrence Occ_A_B of each substring STR_A_B found with lengths in the interval between Lmin_1 and Lmax_1.

In optional step 112, all substrings STR_A_B with an occurrence count Occ_A_B less than the minimum occurrence threshold MinOcc are pruned away.

For each iteration A, a subset of substrings SET_A is defined in the pattern. The set SET_A of substrings STR_A_B is different for each iteration A. It is kept track of all the substrings STR_A_B and the occurrence counts Occ_A_B in said subset SET_A of substrings (SET_A).

Step 114 is pruning away for each found substring STR_A_C all other sub-substrings STR_A_B that are at least one of (1) being contained inside the substring STR_A_C in said subset SET_A of substrings STR_A_B, (2) being shorter than the substring STR_A_C, (3) occurring with the same frequency as the substring STR_A_C (i.e., with same occurrence count Occ_A_C). Preferably, all three conditions are fulfilled for substrings STR_A_B being pruned away. Due to this step, the amount of substrings STR_A_B to be stored and analyzed is dramatically reduced to the number of maximum substrings STR_A_C. The index C denotes the maximum substring.

If none of the substrings STR_A_B found has the maximum length Lmax_A (step 116), the loop is left (end in step 118). If at least one substring STR_A_B has a length of Lmax_A, new minimum and maximum lengths are defined in step 120 and steps 108-116 are repeated in the next iteration loop with A=A+1. Preferably, the iteration A may stop, for instance, if a maximum number D of iterations A or a predefined maximum length Lmax_A is exceeded. Preferably, the substrings STR_A_C found may be stored in a TRIE structure or any other suitable structure.

Step 120 is defining a new minimum and maximum length variables with the new minimum length set to Lmin_(A+1)=(Lmax_A)+1 and the new maximum length set to Lmax_(A+1)=Lmax_A)*2.

The preferred method is, when applied to a text, independent of language. Therefore, it may enable advanced text functions not covered today when applied to a text.

By way of example, when all maximal substrings STR_A_C in a text are found with the preferred method, in a very simple embodiment all maximal substrings STR_A_C may be used as an approximative substitute for concepts of the text. This may be improved when statistical relevance of the maximal substrings STR_A_C found is included. An additional preprocessing step prior to identifying maximal substrings STR_A_C may be performed, and thus linguistic variants of words contained in the substrings may be eliminated, which results in improvement of the quality of the substrings found. Additionally or alternatively, it is possible to reduce inflected forms of words in the pattern to their morphological stems. This is still cost efficient as no sophisticated complex linguistic analysis is necessary. The substring STR_A_C found may be presented to a user, preferably together with its count. Further, when determining sub-substrings STR_A_B of each of the identified maximal substrings STR_A_C, a statistical filter may be applied to avoid “overfitting”. Such sub-substrings STR_A_B may be considered in a statistical and/or a linguistic based selection procedure. For instance, if a maximal substring in a text is “ABC CORPORATION HAS”, the verb “HAS” may be omitted.

A further quality improvement may be achieved if the method blinds out the highest detail level and goes back to a lower detail level if a frequency of one or more maximal substrings STR_A_B found in the text drops below a defined threshold (e.g., if the maximal substring STR_A_C found become very rare). This threshold may be chosen dependent on the complexity of the text or the like.

This example elucidated above is one possible usage of the method to identify maximal substrings STR_A_C in a text.

The invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.

Furthermore, the invention may take the form of a computer program product accessible from a computer-usable or computer readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium may be any apparatus that may contain, store, communicate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

The medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device). Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.

A preferred data processing system 200 (computer) as depicted in FIG. 2 suitable for storing and/or executing program code will include at least one processor 202 coupled directly or indirectly to memory elements 204 through a system bus 206. The memory elements 204 may include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

Input/output or I/O-devices 208, 210 (including, but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system 200 either directly or through intervening I/O controllers 212.

Network adapters 214 may also be coupled to the system 200 to enable the data processing system or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

Another preferred aspect of the invention considers a string-pattern-conceptualization method is depicted in a flowchart 250 in FIG. 3a. By way of example, maximal substrings found in a method as elucidated above may be applied in such a conceptualization method. The preferred conceptualization method, however, is not restricted to the use of maximal substrings, instead, any set of substrings of a string pattern may be used.

The goal is to compute a set SN of N-tuples of strings; in the following, these N-tuples of strings are called “N-siblings”. It is of interest to find the maximum set (i.e., all N-siblings for N≧2). Ranking the set SN and pruning the set SN to the set most relevant to a given scenario is an optional preferred embodiment.

Preferably, the string pattern may be a text. The method provides an approximative conceptualization by detecting related concepts from a given amount of text (e.g., in a search machine). The preferred method is time and resource efficient, particularly when combined with the preferred string pattern analysis method to find maximal substrings in a string pattern as described above. In order to avoid unnecessary repetitions, reference is made to the preceding description of the string pattern analysis method if such an analysis method should be combined with the preferred conceptualization method.

SN = { ( Y 1 , , YN ) | 1 i N Y i e i j Y i Y j X U , Z U : ( 1 i N X Y i Z R ( X e Z e ) ) }
In the simplest case, index N=2. In this case, it is looked for “2-siblings”, i.e. pairs of substrings:
S2=custom character(Y1,Y2)|Y1≠ecustom characterY2≠ecustom characterY1≠Y2custom character∃XεU,ZεU:(Xcustom characterY1custom characterZεRcustom characterXcustom characterY2custom characterZεRcustom characterX≠ecustom characterZ≠e))custom character

This means that siblings Y1, Y2 are searched which are different from each other and both occur in the set of strings R as concatenations of identical prefixes X and suffixes Z, where at least the prefix X or the suffix Z is non-empty.

The reference set R is denoted as SET_r_i in the following text, but R is used instead of SET_r_i for the above-mentioned formulas for compactness of the formulas.

As an example, assume that the set SET_r_i of strings consists just of these two strings:

1. “If the process is a long-running business process then the output is”,

2. “If the process is a microflow then the output is”.

Y1=“long-running business process”

Y2=“microflow”

X=“If the process is a”

Z=“then the output is”

In a more general case, this method may allow the empty string to be one of the siblings; semantically this means that the concepts of a set of siblings of which one sibling is the empty string represent optional concepts, that may or may not occur in a particular context.

In this more general case, the formula simplifies to:

SN = { ( Y 1 , , YN ) | i j Y i Yj X U , Z U : ( 1 i N X Y i Z R ( X e Z e ) ) }
and the associated formula for N=2 simplifies to:
S2=custom character(Y1,Y2)|Y1≠Y2custom character∃xεU,ZεU:(Xcustom characterY1custom characterZεRcustom characterXcustom characterY2custom characterZεRcustom character(X≠ecustom characterZ≠e))custom character

FIG. 3a depicts a flow chart 250 representing a preferred string pattern conceptualization method, particularly for a pattern of words. The preferred method represents a quite general case and comprises the steps which are done one or i more times of setting a reference set SET_r_i (in the example above denoted as R) of strings STR_n_i in step 252. The reference set SET_r_i may or may not by way of example consist of maximal substrings. The strings STR_n_i may overlap each other. In step 254, inside the reference set SET_r_i, a search is done in step 254 for finding specific N-tuples of substring triples X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i with N≧2 and m=1 . . . N. In step 256 each N-tuple ([Y1_i|Y2_i| . . . |Ym_i]) is considered as a candidate for representing related concepts, wherein each concatenation X_icustom characterY1_icustom characterZ_i; X_icustom characterY2_icustom characterZ_i; . . . ; X_icustom characterYm_icustom characterZ_i of the substrings triples X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i is an explicit member of the reference set SET_r_i. Each middle substring out of Y1_i, Y2_i, . . . , Ym_i is unequal to another middle substring out of Y1_i, Y2_i, . . . , Ym_i within the substring triples X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i found inside the reference set SET_r_i. Each prefix substring X_i is equal to all other prefix substrings X_i within the substring triples X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i found inside the reference set SET_r_i. Each suffix substring Z_i is equal to all other suffix substrings Z_i within the substring triples X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i found inside the reference set SET_r_i, and either prefix X_i or suffix Z_i is not empty. In step 258, it is checked if a new iteration has to be done, if yes, the iteration index is increased by one and the loop starts again at step 252. If no, the iteration ends in step 260.

Optional steps between steps 256 and 258 of flow chart 250 which optional steps may be performed individually or in combination of any of the optional steps are indicated in FIG. 3b. In optional step 262, competing candidate N-tuples [Y1_i|Y2_i| . . . Ym_i] are ranked considering their context. For ranking it is favourable to denote each string STR_n_i with its occurrence count OCC_n_i and rank according to the occurrence. An example for this is given below with reference to FIG. 4a.

In optional step 264, candidate N-tuples [Y1_i|Y2_i| . . . Ym_i] are identified that occur multiple times with different context. In optional step 266, filtering can be used to restrict results. In optional step 268, generalizations/specializations may be found in concepts.

FIG. 4a depicts a flow chart 300 representing another preferred embodiment of the string pattern conceptualization method, particularly for a pattern of words. The preferred embodiment comprises the steps which are done one or i more times of (1) setting a reference set SET_r_i of strings STR_n_i and (2) denoting each string STR_n_i with its occurrence count OCC_n_i in step 302. The reference set SET_r_i may by way of example consist of maximal substrings. The strings STR_n_i may overlap each other. Inside the reference set SET_r_i, a search is done in step 304 for finding specific N-tuples of substring triples X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; . . . ; X_i,Ym_i,Z_i with N≧2 and m=1 . . . N. Each N-tuple [Y1_i|Y2_i| . . . |Ym_i] is considered as a candidate for representing related concepts.

With N=2 the N-tuple is a pair [Y1|Y2] of substring triples X_i,Y1_i, Z_i and X_i,Y2_i,Z_i. With N=3 the N-tuple is a triple [Y1|Y2|Y3] of substring triples X_i,Y1_i,Z_i; X_i,Y2_i,Z_i; X_i,Y3_i,Z_i, etc. For each iteration i, the reference set SET_r_i reasonably is a different set of strings STR_n_i. By way of example, the method may be done either by running the algorithm for pairs and then doing post-processing of the results by creating groups from pairs or it may be done directly from the algorithm.

As already expressed as relational formula above, the N-tuples [Y1_i|Y2_i| . . . |Ym_i] inside the SET_r_i preferably fulfill the following constraints:

For ranking, one or more of the following criteria (1), (2), (3) are used:

The formulas used in the criteria (1) and (2) do factor in as squared term the affinity between X_i, Z_i, and the middle substrings: if these prefixes and suffixes co-occur almost exclusively with the found middle strings, this deserves a high rank. In addition, the formulas factor is as linear term the frequencies of the complete concatenations: if this number is significantly high, this deserves a high rank.

By way of example, frequent substrings like “the . . . and” with X=the and Z=and may be eliminated by criteria (1) and (2).

Here are the values for an example with concrete numbers:

Criterion (3) comprised in optional step 308 provides a further refinement of the base algorithm or anyone of the refinements elucidated above. Ranking of candidates may be done by considering multiple independent occurrences of N-tuples [Y1_i|Y2_i| . . . Ym_i] takes advantage of multiple independent occurrences. This is achieved by doing at least one of:

Rank candidates with the highest rank being the most significant result.

In next optional step 310 a refinement of the base algorithm or anyone of the refinement steps 306, 308 may be performed by filtering a result of one or more N-tuples [Y1_i|Y2_i| . . . |Ym_i] found as candidates for representing related concepts considering specific types of a desired result. The analysis may be focused on specific results the user may want to find.

For filtering at least one of the following may be done:

In optional refinement step 312 generalizations and/or specializations in concepts may be revealed. This may be achieved by comparing at least two occurrence counts Occ_1, Occ_2, . . . , Occ_m if two or more of the middle strings Y1_i, Y2_i, . . . , Ym_i and/or if two or more concatenations of prefix and middle strings X_i+Y1_i; X_i+Y2_i; . . . ; X_i+Ym_i are explicit members of the reference set SET_r. By way of example, if the occurrence count, e.g. Occ_4, of a first respective middle string, e.g. Y4_i, is significantly greater than the occurrence count Occ_1, Occ_2, . . . , Occ_m of another middle string Y1_i, Y2_i, . . ., Ym_i being compared with, the respective first middle string, e.g. Y4_i, may reasonably be considered being a more generalized concept than the other middle string Y1_i, Y2_i, . . . , Ym_i.

In decision step 314 it is decided if another iteration shall be done. If yes, iteration index i is set to i=i+1 and steps 302 to 314 are repeated. If no, next step 316 is the end of the algorithm.

A further favorable refinement of the algorithm depicted in FIG. 4b may be done as a refinement of either of the previous refinement and base algorithm steps 302-304 by iteratively calling algorithm.

The algorithm as described so far is run to identify the most significant N-tuple [Y1_i|Y2_i| . . . |Ym_i] of concepts at step 318.

Here, one or more occurrences of the N concepts in the string pattern are replaced by the most frequent occurring concept yielding an altered string pattern at step 320. For example, a copy of a source text is made and all occurrences of the N-tuples [Y1_i|Y2_i| . . . |Ym_i] are replaced by the most frequently occurring concept. Then the algorithm is rerun on that copy of the source text (i.e., doing the steps in the altered string pattern as elucidated in FIG. 3a).

According to a preferred embodiment of the invention, a program product comprising a computer useable storage medium including a computer readable program is proposed, wherein the computer readable program when executed on a computer causes the computer to doing one or i times

Preferably, each string STR_n_i may be denoted with its occurrence count OCC_n_i. This is favorably done when the candidates should be ranked.

By way of example some results are shown. Imagine that the algorithm discovers that the following related concepts are the most significant pair of related concepts:

Other examples may be performed by running the algorithm on a BPC samples web site.

A filter is “show only concepts surrounded by blanks” as related concepts. Three most significant pairs in output are these:

Running the algorithm (e.g., on BPC same samples web site), the filter is set to 1 to 3 characters. This yields a most significant pair in output with unbalanced occurrence count:

Furthermore, the invention may take the form of a computer program product accessible from a computer-usable or computer readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium may be any apparatus that may contain, store, communicate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

The medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory 20 (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk.

Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.

A preferred data processing system 400 (computer) as depicted in FIG. 5 suitable for storing and/or executing program code will include at least one processor 402 coupled directly or indirectly to memory elements 404 through a system bus 406. The memory elements 404 may include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

Input/output or I/O-devices 408, 410 (including, but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system 400 either directly or through intervening I/O controllers 412.

Network adapters 414 may also be coupled to the system 400 to enable the data processing system or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

While the foregoing has been with reference to particular embodiments of the invention, it will be appreciated by those skilled in the art that changes in these embodiments may be made without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims.

Seiffert, Roland, Arning, Andreas

Patent Priority Assignee Title
10402492, Feb 10 2010 International Business Machines Corporation Processing natural language grammar
8666729, Feb 10 2010 International Business Machines Corporation Processing natural language grammar
8805677, Feb 10 2010 International Business Machines Corporation Processing natural language grammar
9122675, Feb 10 2010 International Business Machines Corporation Processing natural language grammar
Patent Priority Assignee Title
4342085, Jan 05 1979 International Business Machines Corporation Stem processing for data reduction in a dictionary storage file
6338057, Nov 24 1997 British Telecommunications public limited company Information management and retrieval
6804677, Feb 26 2001 DB SOFTWARE, INC Encoding semi-structured data for efficient search and browsing
7249121, Oct 04 2000 GOOGLE LLC Identification of semantic units from within a search query
7627567, Apr 14 2004 Microsoft Technology Licensing, LLC Segmentation of strings into structured records
7761286, Apr 29 2005 The United States of America as represented by the Director, National Security Agency; National Security Agency Natural language database searching using morphological query term expansion
20050278324,
20060059153,
///
Executed onAssignorAssigneeConveyanceFrameReelDoc
Dec 15 2008ARNING, ANDREASInternational Business Machines CorporationASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0220440677 pdf
Dec 16 2008SEIFFERT, ROLANDInternational Business Machines CorporationASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0220440677 pdf
Dec 31 2008International Business Machines Corporation(assignment on the face of the patent)
Date Maintenance Fee Events
Apr 15 2016M1551: Payment of Maintenance Fee, 4th Year, Large Entity.
Jul 06 2020REM: Maintenance Fee Reminder Mailed.
Dec 21 2020EXP: Patent Expired for Failure to Pay Maintenance Fees.


Date Maintenance Schedule
Nov 13 20154 years fee payment window open
May 13 20166 months grace period start (w surcharge)
Nov 13 2016patent expiry (for year 4)
Nov 13 20182 years to revive unintentionally abandoned end. (for year 4)
Nov 13 20198 years fee payment window open
May 13 20206 months grace period start (w surcharge)
Nov 13 2020patent expiry (for year 8)
Nov 13 20222 years to revive unintentionally abandoned end. (for year 8)
Nov 13 202312 years fee payment window open
May 13 20246 months grace period start (w surcharge)
Nov 13 2024patent expiry (for year 12)
Nov 13 20262 years to revive unintentionally abandoned end. (for year 12)