In a qa (question/Answer) system, candidate answers in response to a question received are ranked by probabilities estimated by a language model. The language model is created based on an ordered centroid created from the question and information learned from an information source such as the Internet.
|
11. A computer implemented method for building a language model for use in a qa system receiving definitional questions, the method comprising:
receiving a definitional question and determining a question focus of the definitional question;
obtaining information relevant to the question focus;
generating a centroid vector based on the question focus and the information relevant to the question focus;
generating an ordered centroid based on centroid vector; and
building the language model based on the ordered centroid.
1. A computer implemented method for building a language model for use in a qa system comprising:
receiving a question, comprising a definitional question or a factoid, and determining a question focus of the question;
querying a source of information with the question focus and obtaining one or more relevant documents;
generating a centroid vector based on the question focus and said one or more relevant documents:
generating an ordered centroid based on the centroid vector: and
utilizing a computer processor that is a component of a computing device to build the language model based on the ordered centroid.
16. A computer readable medium having instructions, which when executed by a computer, implement a qa system that builds a language model, the instructions comprising:
receiving a definitional question and determining a question focus of the definitional question;
querying a source of information with the question focus and obtaining one or more relevant documents;
generating a centrold vector based on the question focus and said one or more relevant documents;
generating an ordered centroid based on the centroid vector: and
utilizing a computer processor that is a component of a computing, device to build the language model based on the ordered centroid.
7. The computer implemented method of
building the language model using co-occurring terms with the question focus.
8. The computer implemented method of
obtaining relevant sentences and/or phrases having the question focus and one or more co-occurring terms.
9. The computer implemented method of
10. The computer implemented method of
12. The computer implemented method of
13. The computer implemented method of
receiving results based on the query comprising the question focus and clue words based on the type of the definitional question;
processing the results to obtain expansion terms; and
querying the source of information with the question focus and selected expansion terms based on the results; and
wherein generating the centroid vector comprises using the results from querying the source of information with the question focus and selected expansion terms based on the results.
14. The computer implemented method of
15. The computer implemented method of
|
The discussion below is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
With the explosive growth of the Internet, the ability to obtain information on just about any topic is possible. Although queries provided to search engines may take any number of forms, one particular form that occurs frequently is a “definitional question.” A definitional question is a question of the type such as but not limited to “What is X?”, “Who is Y?”, etc. Statistics from 2,516 Frequently Asked Questions (FAQ) extracted from Internet FAQ Archives (http://www.faqs.org/faqs/) shows that around 23.6% are definitional questions, thereby validating the importance of this type of question.
A definitional question answering (QA) system attempts to provide relatively long answers to such questions. Stated another way, the answer to a definitional question is not a single named entity, quantity, etc., but rather a list of information nuggets. A typical definitional QA system extracts definitional sentences that contain the most descriptive information about the search term from a document or documents and summarizes the sentences into definitions.
Many QA systems utilize statistical ranking methods based on obtaining a centroid vector (profile). In particular, for a given question, a vector is formed consisting of the most frequent co-occurring terms with the question target as the question profile. Candidate answers extracted from a given large corpus are ranked based on their similarity to the question profile. The similarity is normally the TFIDF score in which both the candidate answer and the question profile are treated as a bag of words in the framework of Vector Space Model (VSM).
VSM is based on an independence assumption. Specifically, VSM assumes that terms in a vector are statistically independent from one another. However, terms in an answer or nugget are based on a sentence where the words are commonly not independent. For example, if a definitional question is “Who is Tiger Woods?”, a candidate answer may include the words “born” and “1975”, which are not independent. In particular, the sentence may include the phrase “ . . . born in 1975” . . . . However, the existing VSM framework does not accommodate term dependence.
This Summary and the Abstract are provided to introduce some concepts in a simplified form that are further described below in the Detailed Description. The Summary and Abstract are not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. In addition, the description herein provided and the claimed subject matter should not be interpreted as being directed to addressing any of the short-comings discussed in the Background.
One aspect described herein provides term dependence to improve the answer reranking for questions in a QA system. Although other forms of questions can be presented to the QA system such as a factoid, reranking of answers to definitional questions is particularly beneficial. The QA system described uses a language model to capture the term dependence. Since a language model is a probability distribution that captures the statistical regularities of natural language use, the language model is used to rerank the candidate answers.
In one embodiment, given a question such as a definitional question q, an ordered centroid, denoted as OC, is learned from a large information source such as the Internet, and a language model LM(OC) is trained with it. Candidate answers obtained from another information source such as an online encyclopedia are then ranked by probabilities estimated by LLM(OC). In further specific embodiments, bigram and biterm language models are used. Both these two language models have been beneficial in capturing the term dependence, and thereby have improved the ranking of the candidate answers.
One general concept herein described includes reranking candidate answers in a QA system using a language model. Referring to
At this point it should be noted that the modules illustrated in
In addition, it should also be noted that input question 108 and output answer 110 are not limited to textual information in that audible or other forms of input and output communication can be used. Similarly, information accessed by QA system 100 is not limited to textual data. In other words, audible and visual information could also be accessed and processed using the techniques described below. For instance, if the information accessed is audible information, a speech recognizer can be used to convert the audible information to text for processing as discussed below.
Depending on the type of question such as a definitional question rather than a factoid question, it may be helpful to expand the query of the question such as illustrated by optional step 204. Definitional questions are normally short (i.e., “Who is Tiger Woods?”). Question expansion is used to refine the query intention. Steps 206, 208 and 210 illustrate one technique for expanding the question.
Question expansion can include reformulating the question, which may then take the form of a more general query by simply adding clue words to the questions at step 206. For example, for the “Who is . . . ?” question, word or words such as “biography” “life story” or “life history” can be added. Likewise, for the “What is . . . ?” question, words such as “is usually”, “refers to”, etc. can be added. Many known techniques can be used to add clue words to the query based on the type of question. One technique for learning which words to add is described by Deepak Ravichandran and Eduard Hovy in “Learning Surface Text Patterns for a Question Answering System” published by Proceedings of the 40th Annual Meeting of the ACL, pp. 41-47, 2002.
At step 208, an Internet or other large corpus 124 is accessed using, for example, a search engine that is provided with the question focus or reformulated query in order to obtain snippets (small portions) of information about the question focus. As is well known, when a query is provided to a search engine, the search engine will return links to documents having the words contained in the query. In addition to the links, the search engine will commonly display small portions from the document that contain the words of the query. From the small portions returned, at step 208, a selected number of the most frequent co-occurring terms (e.g. five terms) with the question focus from returned snippets are added to the question focus as query expansion terms.
At step 210 in
where Co(t,T) denotes the number of sentences in which t co-occurs with the question focus or target T, and Count(t) gives the number of sentences containing the word t. The above equation can also contain the inverse document frequency of t, idf(t) (e.g. obtained from statistics from British National Corpus (BNC) site to approximate words' IDF, http://www.itri.brighton.ac.uk/˜Adam.Kilgarriff/bnc-readme.html), as a measurement of the global importance of the word.
At step 218, the ordered centroid is obtained. Specifically, for each sentence in W, the terms in the centroid vector are retained as the ordered centroid list. Words not contained in the centroid vector will be treated as the “stop words” and ignored. For example, for the question “Who is Aaron Copland?”, the ordered centroid list is provided below (where words/phrases bolded are extracted and put in the ordered centroid list):
At step 220, a language model is trained using the ordered centroid for each question that is given.
At this point, it may be helpful to provide a discussion concerning the form or type of language model that can be used. In practice, a language model is often approximated by N-gram models such as a Unigram model:
P(w1,n)=P(w1)P(w2) . . . P(wn)
or, a Bigram model:
P(w1,n)=P(w1)P(w2|w1) . . . P(wn|wn−1)
The unigram model makes a strong assumption that each word occurs independently. However, the bigram model takes the local context into consideration. Biterm language models are similar to bigram language models except that the constraint of order in terms is relaxed. Therefore, a document containing “information retrieval,” and a document containing “retrieval (of) information” will be assigned the same generation probability. The biterm probabilities can be approximated using the frequency of occurrence of terms, for example, using the so-called min-Adhoc approximation as represented by the following equation:
where C(X) gives the occurrences of the string X. It has been found that bigram and biterm language models are particularly advantageous. As a smoothing approach, linear interpolation of unigrams and bigrams can also be employed.
Commonly, training of the language models 120 as described above in steps 202, 210, 218 and 220 is performed based on possible input questions land prior to receipt of an actual input question 108 that will receive a corresponding output answer 110. Nevertheless, if desired, QA system 100 can also be implemented using a computing environment capable of performing the steps of method 200 just after receipt of input question 108 from a user and before use of the generated corresponding language model 120 used to rerank candidate answers in a manner discussed below.
At step 304, using a suitable candidate answer generating module 104 (e.g. having a search engine), a corpus of information 128 is accessed using the question 108 to obtain candidate answers 130 contained in one or more relevant documents. Corpus 128 can take many forms. For instance, corpus 128 may be a general, computer-based encyclopedia, stored locally on or in communication with the computer implementing QA system 102. In addition, corpus 128 may be a general information corpus, or be directed to a specific area such as medical information.
At step 306, the document(s) are separated into sentences or other suitable phrases and those sentences or phrases containing the question focus are retained as candidate answers 130. In one embodiment, in order to improve recall, simple heuristics rules can be used to handle the problem of co-reference resolution. In other words, if a sentence is deemed to contain the question focus and its next sentence starts with “he”, “she”, “it”, or “they”, then the next sentence is also retained.
At step 308, reranking module 106 receives the candidate answers 130 and using the appropriate language model 120 reranks the candidate answers based on term dependence. In particular, given a set of candidate answers A=t1t2 . . . ti . . . tn and a bigram or biterm back-off language model trained as discussed above, the probability of generating A can be estimated by the following equation:
where OC stands for the language model of the ordered centroid and λ is the mixture weight combining the unigram and bigram (or biterm) probabilities. After taking the logarithm and exponential, the following equation can be realized:
It should be noted that this equation penalizes verbose candidate answers. This can be alleviated by adding a brevity penalty, BP,
where Lref is a constant standing for the length of reference answer (i.e., centroid vector). LA is the length of the candidate answer. By combining the immediately preceding equations, a final scoring function can be realized
It should be noted the unigram model can also be applied and its scoring function is similar to that above. The main difference is that unigram probability P(ti|OC) is of concern in a unigram-based scoring function.
In Equation (1), three parameters need to estimated: P(ti|OC), P(ti|ti-1, OC) and λ. For P(ti|OC), P(ti|ti-1, OC), maximum likelihood estimation (MLE) can be employed such that
where CountOC(X) is the occurrences of the string X in the ordered centroid and NOC stands for the total number of tokens in the ordered centroid.
For a biterm language model, the afore-mentioned min-Adhoc approximation can be used realizing the following equation
In the case of unigram modeling, smoothing is not needed because the only terms that are of concern are in the centroid vector, where bigram and biterm probabilities may have already been smoothed by interpolation.
The λ can be learned from a training corpus using an Expectation Maximization (EM) algorithm. Specifically, λ can be estimated by maximizing the likelihood of all training instances, given the bigram or biterm model:
BP and P(t1) are ignored because they do not affect λ. λ can be estimated using an EM iterative procedure such as:
where INS denotes all training instances and |INS| gives the number of training instances which is used as a normalization factor. 1j gives the number of tokens in the jth instance in the training data;
If desired at step 310 illustrated in
Steps 314 and 316 comprise a loop, where, at step 314, the jth element from CA, denoted as CAj is then obtained. The cosine similarity is then computed between CAj and each element i of A, which is expressed as Sij. Then let Sik=max{S1j, S2j, . . . , sij}, and if Sik<threshold (e.g. 0.75), then add the jth element to the set A. At step 316, if length of A exceeds a predefined threshold, exit; otherwise, j=j+1, and return to step 314.
In addition to the examples herein provided, other well known computing systems, environments, and/or configurations may be suitable for use with concepts herein described. Such systems include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The concepts herein described may be embodied in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Those skilled in the art can implement the description and/or figures herein as computer-executable instructions, which can be embodied on any form of computer readable media discussed below.
The concepts herein described may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both locale and remote computer storage media including memory storage devices.
With reference to
Computer 410 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 410 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 400.
The system memory 430 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 431 and random access memory (RAM) 432. A basic input/output system 433 (BIOS), containing the basic routines that help to transfer information between elements within computer 410, such as during start-up, is typically stored in ROM 431. RAM 432 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 420. By way of example, and not limitation,
The computer 410 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 410 through input devices such as a keyboard 462, a microphone 463, and a pointing device 461, such as a mouse, trackball or touch pad. These and other input devices are often connected to the processing unit 420 through a user input interface 460 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port or a universal serial bus (USB). A monitor 491 or other type of display device is also connected to the system bus 421 via an interface, such as a video interface 490.
The computer 410 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 480. The remote computer 480 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 410. The logical connections depicted in
When used in a LAN networking environment, the computer 410 is connected to the LAN 471 through a network interface or adapter 470. When used in a WAN networking environment, the computer 410 typically includes a modem 472 or other means for establishing communications over the WAN 473, such as the Internet. The modem 472, which may be internal or external, may be connected to the system bus 421 via the user-input interface 460, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 410, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
It should be noted that the concepts herein described can be carried out on a computer system such as that described with respect to
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not limited to the specific features or acts described above as has been held by the courts. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Patent | Priority | Assignee | Title |
10347146, | Dec 23 2014 | International Business Machines Corporation | Managing answer feasibility |
10347147, | Dec 23 2014 | International Business Machines Corporation | Managing answer feasibility |
10535041, | Jun 26 2012 | Yahoo Ad Tech LLC | System and method of predicting community member responsiveness |
10614725, | Sep 11 2012 | International Business Machines Corporation | Generating secondary questions in an introspective question answering system |
10621880, | Sep 11 2012 | International Business Machines Corporation | Generating secondary questions in an introspective question answering system |
10832011, | Sep 30 2015 | International Business Machines Corporation | Question answering system using multilingual information sources |
10957213, | Dec 23 2014 | International Business Machines Corporation | Managing answer feasibility |
10957214, | Dec 23 2014 | International Business Machines Corporation | Managing answer feasibility |
11822588, | Oct 24 2018 | International Business Machines Corporation | Supporting passage ranking in question answering (QA) system |
12174839, | May 23 2016 | Microsoft Technology Licensing, LLC | Relevant passage retrieval system |
8826226, | Nov 05 2008 | GOOGLE LLC | Custom language models |
9529894, | Nov 07 2014 | International Business Machines Corporation | Context based passage retreival and scoring in a question answering system |
9613133, | Nov 07 2014 | International Business Machines Corporation | Context based passage retrieval and scoring in a question answering system |
9734238, | Nov 07 2014 | International Business Machines Corporation | Context based passage retreival and scoring in a question answering system |
Patent | Priority | Assignee | Title |
5694592, | Nov 05 1993 | University of Central Florida Research Foundation, Inc | Process for determination of text relevancy |
5794237, | Nov 13 1995 | International Business Machines Corporation | System and method for improving problem source identification in computer systems employing relevance feedback and statistical source ranking |
5884302, | Dec 02 1996 | HANGER SOLUTIONS, LLC | System and method to answer a question |
5893092, | Dec 06 1994 | University of Central Florida Research Foundation, Inc | Relevancy ranking using statistical ranking, semantics, relevancy feedback and small pieces of text |
6233571, | Jun 14 1993 | Software Rights Archive, LLC | Method and apparatus for indexing, searching and displaying data |
6714897, | Jan 02 2001 | Battelle Memorial Institute | Method for generating analyses of categorical data |
6829599, | Oct 02 2002 | Xerox Corporation | System and method for improving answer relevance in meta-search engines |
6850937, | Aug 25 1999 | Hitachi, Ltd.; Hitachi, LTD | Word importance calculation method, document retrieving interface, word dictionary making method |
20040167875, | |||
20040260692, | |||
20050060290, | |||
20050114327, | |||
20050149268, | |||
20060026152, | |||
20060117002, | |||
20060206476, | |||
20070150473, | |||
20070214131, | |||
20080114751, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Aug 04 2006 | ZHOU, MING | Microsoft Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 018222 | /0512 | |
Aug 04 2006 | CHEN, YI | Microsoft Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 018222 | /0512 | |
Aug 11 2006 | Microsoft Corporation | (assignment on the face of the patent) | / | |||
Oct 14 2014 | Microsoft Corporation | Microsoft Technology Licensing, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034542 | /0001 |
Date | Maintenance Fee Events |
May 28 2014 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jun 07 2018 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Aug 08 2022 | REM: Maintenance Fee Reminder Mailed. |
Jan 23 2023 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Dec 21 2013 | 4 years fee payment window open |
Jun 21 2014 | 6 months grace period start (w surcharge) |
Dec 21 2014 | patent expiry (for year 4) |
Dec 21 2016 | 2 years to revive unintentionally abandoned end. (for year 4) |
Dec 21 2017 | 8 years fee payment window open |
Jun 21 2018 | 6 months grace period start (w surcharge) |
Dec 21 2018 | patent expiry (for year 8) |
Dec 21 2020 | 2 years to revive unintentionally abandoned end. (for year 8) |
Dec 21 2021 | 12 years fee payment window open |
Jun 21 2022 | 6 months grace period start (w surcharge) |
Dec 21 2022 | patent expiry (for year 12) |
Dec 21 2024 | 2 years to revive unintentionally abandoned end. (for year 12) |