A system and method of searching a database in which documents of different languages are included. The system includes a synonym or keyword dictionary which is bi-directional and allows for translation of keywords between a first language and other languages. The translated words keywords for the document are stored in an inverted index which is then used for searching, either in a selected language, a second language or in all languages, as determined by the user. This use of multiple searching and a translated synonym dictionary avoids the need for translation of the entire document and avoids inaccuracies which may result from translations.
|
1. A computerized method of searching documents written in a plurality of languages, the method comprising:
receiving a first query that includes at least one keyword in a first language;
generating a second query by translating the at least one keyword into a second language;
applying the first query against documents including at least one document written in the first language and at least one document written in the second language with the at least one keyword in the first language;
applying the second query against documents written in the second language;
generating a first set of results based on the first query, wherein the first set of results includes each document written in the first language that matches the first query; and
generating a second set of results based on the first and second queries, wherein the second set of results includes each document written in the second language that matches at least one of the first query or the second query.
11. A computer system comprising:
a system for searching documents written in a plurality of languages, the system comprising at least one computer, wherein the searching is implemented using a method including:
receiving a first query that includes at least one keyword in a first language;
generating a second query by translating the at least one keyword into a second language;
applying the first query against documents including at least one document written in the first language and at least one document written in the second language with the at least one keyword in the first language;
applying the second query against documents written in the second language;
generating a first set of results based on the first query, wherein the first set of results includes each document written in the first language that matches the first query; and
generating a second set of results that includes each document written in the second language based on the first and second queries, wherein the second set of results matches at least one of the first query or the second query.
16. A document searching program stored on a computer-useable medium, which causes a computer system to perform a method when executed on the computer system, wherein the documents are written in a plurality of languages, the method comprising:
receiving a first query that includes at least one keyword in a first language;
generating a second query by translating the at least one keyword into a second language;
applying the first query against documents including at least one document written in the first language and at least one document written in the second language with the at least one keyword in the first language;
applying the second query against documents written in the second language;
generating a first set of results based on the first query, wherein the first set of results includes each document written in the first language that matches the first query; and
generating a second set of results that includes each document written in the second language based on the first and second queries, wherein the second set of results matches at least one of the first query or the second query.
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
applying the first query against an inverted index in the first language; and
applying the first query against an inverted index in the second language.
7. The method of
identifying keywords from each of the plurality of documents;
translating each identified keyword into each of the plurality of languages; and
creating an index in each of the plurality of languages, wherein the applying the first query uses at least one of the plurality of indexes.
8. The method of
9. The method of
12. The system of
13. The system of
15. The system of
identifying keywords from each of the plurality of documents;
translating each identified keyword into each of the plurality of languages; and
creating an index in each of the plurality of languages, wherein the system for applying the first query uses at least one of the plurality of indexes.
17. The program of
18. The program of
19. The program of
20. The program of
identifying keywords from each of the plurality of documents;
translating each identified keyword into each of the plurality of languages; and
creating an index in each of the plurality of languages, wherein the applying the first query uses at least one of the plurality of indexes.
|
This application is a continuation of U.S. patent application Ser. No. 11/151,047, filed on 13 Jun. 2005 now U.S. Pat. No. 7,433,894, which is a continuation of U.S. patent application Ser. No. 10/066,346, filed on 1 Feb. 2002 now U.S. Pat. No. 6,952,691, both of which are hereby incorporated herein by reference.
1. Field of the Invention
The present invention relates to the field of searching a database using search term(s) entered by a user. More particularly, the present invention is a system and method for searching on a database including material in different languages where the search term(s) are entered in one of the languages where the database need not be translated into the different languages.
2. Background Art
Various methods have been proposed for searching a database wherein the database includes material in multiple languages. One approach is to translate the entire database into the language in which a search term is entered or the language of the user. However, this could involve a large amount of translation for a sizable database (and multiple translations if the database is used by users in different languages). Further, each process of translating a document has the potential for losing (or distorting) some of the meaning of the original text.
For these reasons, it is desirable to avoid translating the documents to allow for a search in a particular language.
Another approach is to use synonym list and apply it to the search term(s) entered in one language. That is, the text of the documents in the database remain in the original language and synonyms in each language for each search term(s) are used for the search of the database. This system may work in some cases but is undesirable in other cases because considering all of synonyms in the different languages could lead to incorrect results. The word for “network” in Spanish is “red” and a search on “network” which blindly translates the search term would incorrectly find English documents which include the color “red”.
Further, some of the documents include text in one language and key words presented in a different language to avoid changing the meaning. Thus, it is desirable to search a database which includes these terms but would not be effective to search only for the translated form of the word.
As will be apparent to one skilled in the relevant art, the process of translating and searching in multiple languages can consume substantial computing resources. Many of the multi-language database searching techniques require a powerful computer or take an inordinate amount of time to process a single search, the amount depending on the size of the database, the number of supported languages and the nature of the queries. However, the computing resources have a cost associated with them, either in requiring a larger or faster system or in terms of tying up the computer while a large task is running to the exclusion of other users. Further, a search which takes a long period of time may prevent the user from interactively modifying the search to obtain meaningful results. Accordingly, it is desirable to avoid using large computing resources.
Accordingly, existing systems methods for searching databases have undesirable disadvantages and limitations which will be apparent to those skilled in the art in view of the following description of the present invention.
The present invention overcomes the disadvantages and limitations of the prior art systems by providing a simple, yet effective, method and system for searching a database including documents in multiple supported languages. The present invention also supports searching a database in which the text is comprised of documents written in multiple languages, including those documents which are written in one language but which include words or phrases from a second language.
The present invention has the advantage that a translation of the documents in the database into each of the supported languages is not required.
The present invention also has the advantage that the meaning of the original document is not lost or distorted through a translation process to allow searching of the document in different languages.
The present invention also allows for the searching of a database in a native or natural language while finding documents which are written in other languages.
Other objects and advantages of the system and method of the present invention will be apparent to those skilled in the relevant art, in view of the following description of the preferred embodiment, taken together with the accompanying drawings and the appended claims.
Having thus described some of the objects and advantages of the present invention, other objects and advantages will be apparent to those skilled in the art in view of the following description of the invention taken in conjunction with the accompanying drawings in which:
In the following description of the preferred embodiment, the best implementation of practicing the invention presently known to the inventor will be described with some particularity. However, this description is intended as a broad, general teaching of the concepts of the present invention describing a specific embodiment but is not intended to be limiting the present invention to that as shown in this embodiment, especially since those skilled in the relevant art will recognize many variations and changes to the specific structure and operation shown and described with respect to these figures.
However, some technical documents are written in a native language (such as Spanish) but use technical terms from another language (for example, from English). In such a system, searching the national language database for the national language equivalent of a search term will not find the search term if it is included in the document in another language.
Thus, the process of creating an inverted index involves steps of creating in block 232 an index in each language and in creating a merged inverted index in block 234 using the keyword dictionary 220 which includes synonyms in each supported language. While two languages are shown in the figures of the present invention, the present invention can easily be expanded to support the desired number of languages, and, while English is described as one language for the documents and for the searches, the present invention is not limited to serving documents in English and another language could be substituted, if desired.
In
The present invention, it will be recognized, is especially adapted for use in a data processing system such as a general purpose computer with a stored program containing computer program means including a plurality of instructions. Those instructions will generally be written in a high level language which is readable by a human and translated into machine language, that is, simple instructions which are understood by the data processing system. In an appropriate instance such instructions could be directly written in a machine language programming language, if desired, a system which allows for efficiency of execution but which is more difficult to program. The present invention is not limited to any particular input language.
As used in the present document, software, computer program and computer program means are used interchangeably. Software in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following a) conversion to another language, code or notation; b) reproduction in a different material form. The use of the Unicode system for managing different languages has been used in the description of the preferred embodiment but other suitable methods for representing different languages could also be used to advantage in the present invention, if desired.
The term national language has been used to represent a language associated with a user of the system. This language could be any language supported by the system, and might include different languages for different users. So, “national language” might represent Spanish for a Mexican or a person from Spain and might represent French for a person from France or other French-speaking locales. Appropriate synonym tables are available for a variety of common languages as are systems for locating key words and separating common text with little uniqueness from key words which are descriptive of the document under consideration. Such key word locating systems are often technologically directed and identify words which are of interest to the technology under consideration.
Of course, many modifications of the present invention will be apparent to those skilled in the relevant art in view of the foregoing description of the preferred embodiment, taken together with the accompanying drawings and the appended claims. For example, the present invention has been described in connection with documents and searches in English and in a national language whereas the number of supported languages need not be 2 and need not be a single national language. Further, in some circumstances, the documents could be written in a combination of supported languages. Additionally, some elements of the present invention can be used to advantage without the corresponding use of other elements. For example, the use of the synonym or keyword dictionary is not the only way to accomplish the translation of keywords into other language. Further, various other devices could be substituted to advantage depending on the environmental circumstances. Accordingly, the foregoing description of the preferred embodiment should be considered as merely illustrative of the principles of the present invention and not in limitation thereof.
Drissi, Youssef, Kim, Moon Ju, Kozakov, Lev, Leon Rodriguez, Juan
Patent | Priority | Assignee | Title |
10671251, | Dec 22 2017 | FATHOM TECHNOLOGIES, LLC | Interactive eReader interface generation based on synchronization of textual and audial descriptors |
11443646, | Dec 22 2017 | FATHOM TECHNOLOGIES, LLC | E-Reader interface system with audio and highlighting synchronization for digital books |
11657725, | Dec 22 2017 | FATHOM TECHNOLOGIES, LLC | E-reader interface system with audio and highlighting synchronization for digital books |
Patent | Priority | Assignee | Title |
5136505, | Aug 03 1988 | Sharp Kabushiki Kaisha | Electronic translator apparatus for translating words or phrases and auxiliary information related to the words or phrases |
5398302, | Feb 07 1990 | Method and apparatus for adaptive learning in neural networks | |
5499366, | Aug 15 1991 | Borland Software Corporation | System and methods for generation of design images based on user design inputs |
5737734, | Sep 15 1995 | BHW INFO EDCO COM, LLC | Query word relevance adjustment in a search of an information retrieval system |
5794178, | Sep 20 1993 | Fair Isaac Corporation | Visualization of information using graphical representations of context vector based relationships and attributes |
5819263, | Jul 19 1996 | AMERIPRISE FINANCIAL, INC | Financial planning system incorporating relationship and group management |
5878423, | Apr 21 1997 | GOOGLE LLC | Dynamically processing an index to create an ordered set of questions |
5893092, | Dec 06 1994 | University of Central Florida Research Foundation, Inc | Relevancy ranking using statistical ranking, semantics, relevancy feedback and small pieces of text |
5899991, | May 12 1997 | Teleran Technologies, L.P. | Modeling technique for system access control and management |
5956708, | Mar 06 1997 | International Business Machines Corporation | Integration of link generation, cross-author user navigation, and reuse identification in authoring process |
5956711, | Jan 16 1997 | Walter J., Sullivan, III | Database system with restricted keyword list and bi-directional keyword translation |
5956740, | Oct 23 1996 | Mazda Motor Corporation | Document searching system for multilingual documents |
5987457, | Nov 25 1997 | HANGER SOLUTIONS, LLC | Query refinement method for searching documents |
5991713, | Nov 26 1997 | GOOGLE LLC | Efficient method for compressing, storing, searching and transmitting natural language text |
6005860, | May 30 1997 | HANGER SOLUTIONS, LLC | Using a routing architecture to route information between an orignation module and a destination module in an information retrieval system |
6008817, | Dec 31 1997 | Comparative Visual Assessments, Inc.; COMPARATIVE VISUAL ASSESSMENTS, INC | Comparative visual assessment system and method |
6041326, | Nov 14 1997 | International Business Machines Corporation | Method and system in a computer network for an intelligent search engine |
6055528, | Jul 25 1997 | JUSTSYSTEMS EVANS RESEARCH INC | Method for cross-linguistic document retrieval |
6065026, | Jan 09 1997 | CANTOR FITZGERALD SECURITIES, AS SUCCESSOR AGENT; GLEACHER PRODUCTS CORP , AS ADMINISTRATIVE AGENT AND COLLATERAL AGENT | Multi-user electronic document authoring system with prompted updating of shared language |
6081774, | Aug 22 1997 | RPX Corporation | Natural language information retrieval system and method |
6085162, | Oct 18 1996 | Gedanken Corporation | Translation system and method in which words are translated by a specialized dictionary and then a general dictionary |
6085186, | Sep 20 1996 | AT HOME BONDHOLDERS LIQUIDATING TRUST | Method and system using information written in a wrapper description language to execute query on a network |
6094647, | Jun 14 1989 | GOOGLE LLC | Presearch type document search method and apparatus |
6102969, | Sep 20 1996 | AT HOME BONDHOLDERS LIQUIDATING TRUST | Method and system using information written in a wrapper description language to execute query on a network |
6111572, | Sep 10 1998 | International Business Machines Corporation | Runtime locale-sensitive switching of calendars in a distributed computer enterprise environment |
6141005, | Sep 10 1998 | INTERNATIONAL BUSINESS MACHINES CORPORATIONB | Combined display of locale-sensitive calendars in a distributed computer enterprise environment |
6163785, | Sep 04 1992 | Caterpillar Inc. | Integrated authoring and translation system |
6169986, | Jun 15 1998 | Amazon Technologies, Inc | System and method for refining search queries |
6219646, | Oct 18 1996 | Gedanken Corp. | Methods and apparatus for translating between languages |
6226638, | Mar 18 1998 | Fujitsu Limited | Information searching apparatus for displaying an expansion history and its method |
6237011, | Oct 08 1997 | Nuance Communications, Inc | Computer-based document management system |
6240408, | Jun 08 1998 | KCSL, Inc. | Method and system for retrieving relevant documents from a database |
6240412, | Mar 06 1997 | International Business Machines Corporation | Integration of link generation, cross-author user navigation, and reuse identification in authoring process |
6259933, | Jul 20 1998 | Lucent Technologies Inc | Integrated radio and directional antenna system |
6262725, | Sep 10 1998 | International Business Machines Corporation | Method for displaying holidays in a locale-sensitive manner across distributed computer enterprise locales |
6275789, | Dec 18 1998 | Method and apparatus for performing full bidirectional translation between a source language and a linked alternative language | |
6275810, | Sep 10 1998 | International Business Corporation | Method for scheduling holidays in distributed computer enterprise locales |
6278967, | Aug 31 1992 | CANTENA SERVICE AGENT CORPORATION; CATENA SERVICE AGENT CORPORATION | Automated system for generating natural language translations that are domain-specific, grammar rule-based, and/or based on part-of-speech analysis |
6327590, | May 05 1999 | GOOGLE LLC | System and method for collaborative ranking of search results employing user and group profiles derived from document collection content analysis |
6338055, | Dec 07 1998 | INNOVATION TECHNOLOGY GROUP, INC | Real-time query optimization in a decision support system |
6349307, | Dec 28 1998 | Gula Consulting Limited Liability Company | Cooperative topical servers with automatic prefiltering and routing |
6360196, | May 20 1998 | Sharp Kabushiki Kaisha | Method of and apparatus for retrieving information and storage medium |
6424973, | Jul 24 1998 | Jarg Corporation | Search system and method based on multiple ontologies |
6453159, | Feb 25 1999 | Extreme Networks, Inc | Multi-level encryption system for wireless network |
6463430, | Jul 10 2000 | KOFAX, INC | Devices and methods for generating and managing a database |
6516312, | Apr 04 2000 | International Business Machine Corporation | System and method for dynamically associating keywords with domain-specific search engine queries |
6523026, | Feb 08 1999 | Huntsman International LLC | Method for retrieving semantically distant analogies |
6526440, | Jan 30 2001 | GOOGLE LLC | Ranking search results by reranking the results based on local inter-connectivity |
6560634, | Aug 15 1997 | Verisign, Inc; VERISIGN REGISTRY SERVICES, INC | Method of determining unavailability of an internet domain name |
6571249, | Sep 27 2000 | Siemens Aktiengesellschaft | Management of query result complexity in hierarchical query result data structure using balanced space cubes |
6581072, | May 18 2000 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Techniques for identifying and accessing information of interest to a user in a network environment without compromising the user's privacy |
6602300, | Feb 03 1998 | Fujitsu Limited | Apparatus and method for retrieving data from a document database |
6604099, | Mar 20 2000 | GOOGLE LLC | Majority schema in semi-structured data |
6604101, | Jun 28 2000 | QNATURALLY SYSTEMS INC | Method and system for translingual translation of query and search and retrieval of multilingual information on a computer network |
6629097, | Apr 28 1999 | Douglas K., Keith | Displaying implicit associations among items in loosely-structured data sets |
6636848, | May 31 2000 | International Business Machines Corporation | Information search using knowledge agents |
6643661, | Apr 27 2000 | Oracle International Corporation | Method and apparatus for implementing search and channel features in an enterprise-wide computer system |
6654734, | Aug 30 2000 | GOOGLE LLC | System and method for query processing and optimization for XML repositories |
6711568, | Nov 25 1997 | R2 SOLUTIONS LLC | Method for estimating coverage of web search engines |
6718333, | Jul 15 1998 | NEC Corporation | Structured document classification device, structured document search system, and computer-readable memory causing a computer to function as the same |
6738764, | May 08 2001 | VALTRUS INNOVATIONS LIMITED | Apparatus and method for adaptively ranking search results |
6738767, | Mar 20 2000 | Meta Platforms, Inc | System and method for discovering schematic structure in hypertext documents |
6766316, | Jan 18 2001 | Leidos, Inc | Method and system of ranking and clustering for document indexing and retrieval |
6772150, | Dec 10 1999 | A9 COM, INC | Search query refinement using related search phrases |
6778979, | Aug 13 2001 | III Holdings 6, LLC | System for automatically generating queries |
6813496, | Jul 30 1999 | Nokia Corporation | Network access control |
6829599, | Oct 02 2002 | Xerox Corporation | System and method for improving answer relevance in meta-search engines |
6836777, | Nov 15 2001 | WHOWHATWARE, LLC | System and method for constructing generic analytical database applications |
6901399, | Jul 22 1997 | Microsoft Technology Licensing, LLC | System for processing textual inputs using natural language processing techniques |
6928432, | Apr 24 2000 | BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY, THE | System and method for indexing electronic text |
6941294, | Aug 28 2000 | BEN GROUP, INC | Method and apparatus for digital media management, retrieval, and collaboration |
6952691, | Feb 01 2002 | International Business Machines Corporation | Method and system for searching a multi-lingual database |
7027974, | Oct 27 2000 | Leidos, Inc | Ontology-based parser for natural language processing |
7039625, | Nov 22 2002 | International Business Machines Corporation | International information search and delivery system providing search results personalized to a particular natural language |
7051023, | Apr 04 2003 | R2 SOLUTIONS LLC | Systems and methods for generating concept units from search queries |
7117199, | Feb 22 2000 | Nokia Technologies Oy | Spatially coding and displaying information |
7124364, | Nov 21 2001 | Contecs:DD LLC | Data dictionary method |
7127456, | Dec 05 2002 | TERADATA US, INC | System and method for logging database queries |
7136845, | Jul 12 2001 | Microsoft Technology Licensing, LLC | System and method for query refinement to enable improved searching based on identifying and utilizing popular concepts related to users' queries |
7174564, | Sep 03 1999 | Intel Corporation | Secure wireless local area network |
7197508, | Jul 25 2003 | System and method for obtaining, evaluating, and reporting market information | |
7318057, | May 31 2000 | International Business Machines Corporation | Information search using knowledge agents |
20010021947, | |||
20020002452, | |||
20020007364, | |||
20020007384, | |||
20020016787, | |||
20020042789, | |||
20020059289, | |||
20020091671, | |||
20020095594, | |||
20020095621, | |||
20020107992, | |||
20020156776, | |||
20020156792, | |||
20020184206, | |||
20030126136, | |||
20030142128, | |||
20030144982, | |||
20030149686, | |||
20030149687, | |||
20030177111, | |||
20030221171, | |||
20030225722, | |||
20030225747, | |||
20040019588, | |||
20040024745, | |||
20040024748, | |||
20040030690, | |||
20040044669, | |||
20040068486, | |||
20040111408, | |||
20040181511, | |||
20040181525, | |||
20040205656, | |||
20040214570, | |||
20040220905, | |||
20040249808, | |||
20040254920, | |||
20050055341, | |||
20050065773, | |||
20050065774, | |||
20050154708, | |||
20060036588, | |||
20060191996, | |||
20090036159, | |||
EP851368, | |||
EP964344, | |||
EP1072984, | |||
JP10187752, | |||
JP11219368, | |||
WO201400, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Aug 21 2008 | International Business Machines Corporation | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Jan 21 2015 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jan 15 2019 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
May 15 2023 | REM: Maintenance Fee Reminder Mailed. |
Oct 30 2023 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Sep 27 2014 | 4 years fee payment window open |
Mar 27 2015 | 6 months grace period start (w surcharge) |
Sep 27 2015 | patent expiry (for year 4) |
Sep 27 2017 | 2 years to revive unintentionally abandoned end. (for year 4) |
Sep 27 2018 | 8 years fee payment window open |
Mar 27 2019 | 6 months grace period start (w surcharge) |
Sep 27 2019 | patent expiry (for year 8) |
Sep 27 2021 | 2 years to revive unintentionally abandoned end. (for year 8) |
Sep 27 2022 | 12 years fee payment window open |
Mar 27 2023 | 6 months grace period start (w surcharge) |
Sep 27 2023 | patent expiry (for year 12) |
Sep 27 2025 | 2 years to revive unintentionally abandoned end. (for year 12) |