An embodiment includes a computer-implemented method. For one embodiment, the computer-implemented method includes displaying a plurality of first groups of words in a browser for display to a user, constructing a new combination of search terms using more than one user-selected word groups, wherein the user-selected word groups are selected by the user from the plurality of first groups of words displayed to the user; invoking a search service with the new combination of search terms, displaying results of the search service using the new combination of search terms in the browser, and storing the results in memory coupled to a processor.
|
5. A computer-implemented method comprising:
automatically extracting a plurality of groups of words from a set comprising a first document, wherein each group of the plurality of groups comprises a word;
automatically determining a plurality of first counts of a number of times said each of the groups of words in said plurality matches said set;
automatically determining a plurality of second counts of the number of times said each group of words in said plurality matches a corpus of second documents;
obtaining a weight of said each group of words based on the first counts and second counts, comprising automatically performing function fitting on at least first counts of said plurality of groups of words and corresponding second counts of said plurality of groups of words to obtain a fitted function, and using at least one processor in automatically comparing a first count of said each group of words in the plurality of first counts to an evaluation of said fitted function at a second count of said each group of words in the plurality of second counts, to obtain a weight of said each group of words;
automatically ranking based on said weight, said at least one group of words relative to another group of words in said plurality of groups; and
selecting a plurality of first groups of words from the plurality of groups, based on said weights;
displaying the plurality of first groups of words in a browser for display to a user;
constructing a new combination of search terms using more than one user-selected word groups, wherein the user-selected word groups are selected by the user from the plurality of first groups of words displayed to the user;
matching the new combination of search terms using an inverted index;
displaying results of the match in the browser; and
storing the results in memory coupled to a processor.
1. A computer-implemented method comprising:
automatically extracting a plurality of groups of words from a set comprising a first document, wherein each group of the plurality of groups comprises a word;
automatically determining a plurality of first counts of a number of times said each of the groups of words in said plurality matches said set;
automatically determining a plurality of second counts of the number of times said each group of words in said plurality matches a corpus of second documents;
obtaining a weight of said each group of words based on the first counts and second counts, comprising automatically performing function fitting on at least first counts of said plurality of groups of words and corresponding second counts of said plurality of groups of words to obtain a fitted function, and using at least one processor in automatically comparing a first count of said each group of words in the plurality of first counts to an evaluation of said fitted function at a second count of said each group of words in the plurality of second counts, to obtain a weight of said each group of words;
automatically ranking based on said weight, said at least one group of words relative to another group of words in said plurality of groups; and
selecting a plurality of first groups of words from the plurality of groups, based on said weights;
displaying the plurality of first groups of words in a browser for display to a user;
constructing a new combination of search terms using more than one user-selected word groups, wherein the user-selected word groups are selected by the user from the plurality of first groups of words displayed to the user;
invoking a search service with the new combination of search terms;
displaying results of the search service using the new combination of search terms in the browser; and
storing the results in memory coupled to a processor.
9. A non-transitory computer-readable medium comprising a plurality of instructions, that instructions comprising:
instructions to automatically extract a plurality of groups of words from a set comprising a first document, wherein each group of the plurality of groups comprises a word;
instructions to automatically determine a plurality of first counts of a number of times said each of the groups of words in said plurality matches said set;
instructions to automatically determine a plurality of second counts of the number of times said each group of words in said plurality matches a corpus of second documents;
instructions to obtain a weight of said each group of words based on the first counts and second counts, comprising automatically performing function fitting on at least first counts of said plurality of groups of words and corresponding second counts of said plurality of groups of words to obtain a fitted function, and using at least one processor in automatically comparing a first count of said each group of words in the plurality of first counts to an evaluation of said fitted function at a second count of said each group of words in the plurality of second counts, to obtain a weight of said each group of words;
instructions to automatically rank based on said weight, said at least one group of words relative to another group of words in said plurality of groups; and
instructions to select a plurality of first groups of words from the plurality of groups, based on said weights;
instructions to display the plurality of first groups of words in a browser for display to a user;
instructions to construct a new combination of search terms using more than one user-selected word groups, wherein the user-selected word groups are selected by the user from the plurality of first groups of words displayed to the user;
instructions to invoke a search service with the new combination of search terms;
instructions to display results of the search service using the new combination of search terms in the browser; and
instructions to store the results in memory coupled to a processor.
2. The computer-implemented method of
3. The computer-implemented method of
4. The computer-implemented method of
6. The computer-implemented method of
7. The computer-implemented method of
8. The computer-implemented method of
10. The non-transitory computer-readable medium of
11. The non-transitory computer-readable medium of
12. The non-transitory computer-readable medium of
|
This patent application is a continuation of U.S. patent application Ser. No. 12/703,758, filed Feb. 10, 2010, having the title “Finding Relevant Documents, which is herein incorporated by reference.
There are 1.3 billion people on the web and over 100 million active websites. The Internet's universe of information and people, both published and addressed to the user is growing every day. Published content includes web pages, news sources, RSS feeds, social networking profiles, blog postings, job sites, classified ads, and other user generated content like reviews. Email (both legitimate and spam), text messages, newspapers, subscriptions, etc. are addressed directly to the user. The growth of Internet users and competition among publishers is leading to a backlog or heap of hundreds or thousands of unread email, rss, and web content in user inboxes/readers—forcing users to settle somewhere between the extremes of either reading the all the items or starting fresh (as in “email bankruptcy”).
Accordingly the inventors of the current patent application believe that use of a search engine (such as BING available from Microsoft Corporation or GOOGLE available from Google Inc) is not enough, because its use is like using a fishing line, useful for finding what you want right now. The current inventors have made an invention (described in the next paragraph) that can be used more like a fishing net, to help you capture content tailored to your interests, and for which it is either painful or inefficient to repeatedly use a conventional search engine. Google Inc. offers a service called Google Alerts which are email messages of the latest relevant Google results (web, news, etc.) based on the user's choice of a query or topic. Conventional uses of Google Alerts include monitoring a developing news story, keeping current on a competitor or industry, getting the latest on a celebrity or event, keeping tabs on your favorite sports teams. However, Google Alerts requires the user to enter one or more “search terms” in order to initiate the service. Hence, relevance of documents identified by Google Alerts depends on the search term selected by the user. The current inventors believe that it is not easy for users to manually generate appropriate search terms, without using the invention as discussed below.
An embodiment includes a computer-implemented method. The computer-implemented method includes automatically extracting a plurality of groups of words from a set comprising a first document, wherein each group of the plurality of groups comprises a word. The computer-implemented method further includes automatically determining a plurality of first counts of a number of times said each of the groups of words in said plurality matches said set, automatically determining a plurality of second counts of the number of times said each group of words in said plurality matches a corpus of second documents, obtaining a weight of said each group of words based on the first counts and second counts, automatically ranking based on said weight, said at least one group of words relative to another group of words in said plurality of groups, and selecting a plurality of first groups of words from the plurality of groups, based on said weights. The computer-implemented method further includes displaying the plurality of first groups of words in a browser for display to a user, constructing a new combination of search terms using more than one user-selected word groups, wherein the user-selected word groups are selected by the user from the plurality of first groups of words displayed to the user; invoking a search service with the new combination of search terms, displaying results of the search service using the new combination of search terms in the browser, and storing the results in memory coupled to a processor.
One or more computers are programmed in accordance with the invention to implement a particular machine that receives as input one or more documents that contain text that is relevant to a user (“interest documents”). As used herein the term “document” includes, but is not limited to any file stored in a computer-readable storage medium and from which text can be extracted and displayed to a user, such as a file produced by a word processor (e.g. an RTF file or an HTM file by Microsoft WORD), a file in a portable document format (PDF by e.g. Adobe ACROBAT), a file containing text in ASCII format (e.g. Microsoft NOTEPAD), a file produced by an audio editing program (e.g. a “WAV” file by Audacity), a file produced by an image editing program (e.g. a “JPG file” by Adobe PHOTOSHOP), a file produced by scanning a physical book, a file produced by typesetting a book, a magazine, or article by a publisher.
The computer(s) use the interest document(s) to automatically identify word groups that are then used in computerized filtering of documents in accordance with the invention. One illustrative embodiment of the just-described particular machine is illustrated by a server computer 120 (
Unless otherwise described below, a user operates client computer 101 in the normal manner. For example, the user may use web browser 102 to conduct a search on the Internet, via a search engine 105 that returns identifiers 106 of documents that are responsive to the user's search terms. The user may then use computer 101 to retrieve one or more documents 103 from the Internet, e.g. obtain a copy of a document 108 from a website 107 in the normal manner. Additionally, the user may use computer 101 to access a web server 109 to subscribe to an RSS Feed to obtain document identifiers 110, and or read blogs thereon. The user may also use computer 101 to obtain email messages in the form of documents 112 from an email server 111. In a similar manner, the user may use computer 101 to obtain documents 114 that contain real-time social media, such as Tweets supplied by server 113 executing Twitter software. Also, the user may use computer 101 to obtain documents 116 that contain social networking profiles supplied by server 115 (e.g. at the website Facebook). Additionally, the user may use computer 101 to obtain from sever 117, one or more documents 118 containing professional networking profiles (e.g. on Linked-in).
Server computer 120 typically includes a web framework 121 that interacts with a web browser 102 in client computer 101, to supply thereto one or more web pages, e.g. in Hyper Text Markup Language (HTML) and/or word groups. In some embodiments, several of the web pages supplied by web framework 121 contain one or more identifiers of documents that have been determined to be relevant to the user by software program instructions 130 (“relevance engine”) stored in a memory 1106 of server computer 120 and executed by a processor 1105 illustrated in
Depending on the embodiment, instead of or in addition to supplying identifiers of documents, relevance engine 130 supplies identifiers of one or more word groups to web framework 121, as illustrated in
In
In some embodiments (“A”), an interest document 127 (or a document identifier) is uploaded by a user via client computer 101, and subscription documents 128 are retrieved by document crawler 122 based on instructions thereto as described below. In other embodiments (“B”), a user submits as an interest document, a description of a job for which a person needs to be hired. In still other embodiments (“C”), a faculty member at a university submits as an interest document, a paper written by the faculty member.
Although in some embodiments, interest document 127 is explicitly identified by a user (e.g. in input box 161 as described in the next paragraph, in reference to
In one embodiment (“D”), a web page that is currently displayed in web browser 102 is automatically identified by one of computers 101, 120, 104 as an interest document for the user of web browser 102. In another illustrative embodiment (“E”), one or more documents identified by corresponding hyperlinks in the web page currently displayed in web browser 102 are automatically identified by one of computers 101, 120, 104 as the interest document(s) for the user of web browser 102. In a still another embodiment (“F”), one or more documents identified as search results (e.g. by corresponding hyperlinks, titles and/or snippets) in the web page currently displayed in web browser 102 are automatically identified by one of computers 101, 120, 104 as the interest document(s) for the user of web browser 102. In yet another illustrative embodiment (“G”), a running transcription of an audio or video that is currently being played by computer 101 (e.g. in web browser 102) is automatically identified by one of computers 101, 120, 104 as an interest document for the user of web browser 102.
In one more illustrative embodiment (“H”), one or more books or documents described (e.g. by title and author) in the web page currently displayed in web browser 102 are automatically identified by computer 101 as the interest document(s) for the user of web browser 102. In another embodiment (“I”), a text transcript of an audio recording or audio stream or a video recording or a video stream is used as an interest document. Such a text transcript is generated in some embodiments (e.g. “G” and “I” described above) by text-to-speech recognition by a suitably programmed computer. Note that in the embodiments (“A”-“I”) described above, the interest documents may or may not be identified using a uniform resource locator (URL), depending on the implementation. For example although a URL is used in some implementations, in certain implementations a proprietary identifier is used to automatically identify the interest document.
The following TABLE 1 summarizes inputs in various embodiments.
ILLUSTRATIVE
SUBSCRIPTION
REFERENCE
USE CASES
INTEREST DOCUMENT(s)
CORPUS
CORPUS
Embodiment “A”
User-supplied document
User-identified
World wide
(e.g. 161, 162 in FIG. 1E
corpus of
web (e.g. 141
and 127 in FIG. 1A)
documents (such
in FIG. 1A)
as emails, RSS
feeds, e.g. 165 in
FIG. 1G to FIG.
1O)
Embodiment “B”
Job Description(e.g. 161,
Resumes (e.g.
Resumes (e.g.
(head hunter)
162 in FIG. 1E and 127 in
117, 118 in FIG.
117, 118 in
FIG. 1A)
1A)
FIG. 1A)
Embodiment “C”
Research paper (e.g. 161,
Journal articles in
World wide
(university
162 in FIG. 1E and 107, 108
archive ARXIV
web (e.g. 141
researcher)
in FIG. 1A)
(e.g. 128 in FIG.
in FIG. 1A)
1A)
Embodiment “D”
Web page currently
World wide web
World wide
(web surfing)
displayed by web browser
(e.g. 140 in FIG.
web (e.g. 140
(e.g. 103 in FIG. 8B, FIG.
8B)
in FIG. 8B)
9E)
Embodiment “E”
Documents identified by
World wide web
World wide
hyper links currently
(e.g. 140 in FIG.
web
displayed in a webpage by
8B)
(e.g. 140 in
web browser
FIG. 8B)
Embodiment “F”
Documents identified as
World wide web
World wide
search results currently
(e.g. 140 in FIG.
web
displayed by web browser in
8B)
(e.g. 140 in
response to a user's search
FIG. 8B)
(see FIG. 9H)
Embodiment “G”
Running transcription of
Set of
World wide
audio/video currently being
advertisements
web
played in web browser
(e.g. see FIG. 9G
(e.g. 140 in
and 840, 841 in
FIG. 8B)
FIG. 8B)
Embodiment “H”
Book description currently
Books available
World wide
(On-line Retailer)
displayed by web browser
for sale (by an
web (e.g. 141
(e.g. 103 in FIG. 8A, FIG.
on-line retailer
in FIG. 8A)
9C, FIG. 9D)
such as
Amazon, e.g. 108
in FIG. 8A, FIG.
9C, FIG. 9D)
Embodiment “I”
Existing transcript of pre-
Transcripts of
World wide
recorded audio/video (e.g.
other
web
103 in FIG. 8B)
audios/videos
(e.g. 140 in
available for
FIG. 8B)
viewing
Embodiment “K”
Web page currently
Set of
Set of
(Advertisement
displayed by web browser
advertisements
advertisements
Server)
(e.g. 103 in FIG. 8B, see
(e.g. 840 in FIG.
(e.g. 840 in
FIG. 9F)
8B, see FIG. 9G)
FIG. 8B, see
FIG. 9G
Embodiment “L”
Web page currently
Set of
World wide
(Advertisement
displayed by web browser
advertisements
web
Server)
(e.g. 103 in FIG. 8B, see
(e.g. 840 in FIG.
(e.g. 140 in
FIG. 9F)
8B, see FIG. 9G)
FIG. 8B)
Medical Research
User-supplied medical
Journal articles in
Journal articles
Server
article (see 161, 162 in FIG.
a subspeciality of
in all of
1E)
Medicine
Medicine
Social Networking
User profile on social
User Profiles
User Profiles
networking website
(e.g. 115, 116 in
(e.g. 115, 116
FIG. 1A)
in FIG. 1A)
On-line Retailer
Book description currently
Books available
Books
displayed by web browser
for sale (by an
available for
(e.g. 103 in FIG. 8A, FIG.
on-line retailer
sale (by an on-
9C, FIG. 9D)
such as
line retailer
Amazon, e.g. 108
such as
in FIG. 8A, FIG.
Amazon, e.g.
9C, FIG. 9D)
108 in FIG. 8A,
FIG. 9C, FIG.
9D)
Search results used and/or displayed in one or more of the embodiments described herein are obtained automatically by any of computers 101, 120 and 104 using a search service 140 available on the Internet. Accordingly, for a given word group, the search results and the number of occurrences on the web for the given word group are obtained in one embodiment by using the search service Yahoo-BOSS API (which is an example of search service 140) as described in reference to “resultset_web”, “totalhits” and “deephits” in Yahoo-BOSS API Guide available over the Internet at the link obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%developer=yahoo=com%search% boss%boss_guide%ch02s02=html. An example, of output received by computer 120 from the search service Yahoo-BOSS shown below, is retrieved from the Internet at the link obtained by replacing “%” with “/” and replacing “=” with “.” in the following string:
http:%%developer=yahoo=com%search %boss%boss_guide%Web_Search=html
<ysearchresponse responsecode=“200”>
<nextpage><![CDATA[/ysearch/web/v1/foo?appid={yourBOSSappid}&format=xm
l&start=10]]></nextpage>
<resultset_web count=“10” start=“0” totalhits=“29440998”
deephits=“881000000”>
<result>
<abstract><![CDATA[World <b>soccer</b> coverage
from ESPN, including Premiership, Serie A, La Liga, and Major League
<b>Soccer</b>. Get news headlines, live scores, stats, and
tournament information.]]></abstract>
<date>2008/06/08</date>
<dispurl><![CDATA[www.<b>soccernet.com</b>]]></dispurl>
<clickurl>http://us.lrd.yahoo.com/_ylc=X3oDMTFkNXVldGJyBGFwcGlkA2Jvc3Nk
ZW1vBHBvcwMwBHNlcnZpY2UDWVNIYXJjaARzcmNwdmlkAw--
/SIG=10u3e8260/**http%3A//www.soccernet.com/</clickurl>
<size>94650</size>
<title>ESPN Soccernet</title>
<url>http://www.soccernet.com/</url>
</result>
</resultset_web>
</ysearchresponse>
Although an illustrative embodiment uses the Yahoo-BOSS service as described above, other embodiments may use other such services, e.g. MICROSOFT, GOOGLE, ONERIOT.
As illustrated in
Depending on the operation being performed by server computer 120, the documents being fetched by document crawler 122 may be either interest documents for use by relevance engine 130 in automatically identifying word groups, or subscription documents 128 for use by relevance engine 130 in determining relevance thereof, based on word groups automatically identified from the interest document(s). An interest document 127 used by relevance engine 130 can be, for example, a research paper, or an article that the user wrote or any document that the user thinks is highly relevant. As noted above, an interest document 127 (
In one embodiment, subscription documents are identified in one or more streams that are preselected by a user, as illustrated in
In yet another embodiment, the subscription documents are automatically identified to be any subset of the documents available on the world wide web. In a first example, documents available at a predetermined website, such as the Wall Street Journal are identified as the subscription documents. In a second example, documents available in a proprietary database, such as all books available for sale on Amazon are identified as the subscription documents. In a third example, documents available at a predetermined website, such as the ARXIV are identified as the subscription documents. In a fourth example, a set of documents containing advertisements are identified as the subscription documents in another embodiment (“J”).
Depending on the embodiment, subscription documents are either identified via input box 161 of
Although individual screens are not further illustrated in the attached figures, as would be readily apparent from this current detailed description of the invention, subscription documents in several embodiments may include blogs or blog posts, incoming email or email folders, RSS feeds from websites or news sites, web search results bounded in a specified time (past hour, past day, past week, past month, past year, and all results), documents (links) or profiles appearing in social networks (Facebook, Linked-in) and real-time social media (Twitter). In another embodiment, the subscription corpus includes search results (or RSS feeds) queries using automatically selected word groups from a list of interest documents, independent of the user selection.
In addition to storing the user-supplied URL, web framework 121 invokes document crawler 122 with the user-supplied document identifier(s) 162. In response to receipt of a user-uploaded document, web framework 121 passes the received document to the document crawler 122 that in turn directly invokes document to text & hyperlink converter 123. As noted above, in response document crawler 122 uses the document identifier(s) 162 to retrieve the identified document(s). Thereafter, document crawler 122 supplies the retrieved documents to server computer 120 executing software 123 (“document to text & hyperlink converter” or “document converter”), described next.
Document to text & hyperlink converter 123 generates and stores in computer memory, text and hyperlinks from original documents 108, 112, 114, 116 and 118 which may be originally in one of several document formats such as HTML, PDF, Microsoft WORD, PostScript, or plain text etc. In several embodiments, all documents are converted into an intermediate format, on which multiple tools of server computer 120 operate. In one illustrative embodiment, the intermediate format used by server computer 120 is HTML, and all documents not in HTML are first converted to HTML. For example, PDF documents are converted by server computer 120 to HTML using a tool called pdftohtml invoked with options ‘-q’, ‘-i’, ‘-nodrm’, ‘-noframes’, ‘-stdout’. The tool pdftohtml is available on the Internet, at the website address obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%poppler=freedesktop=org%. For information on the just-described pdf2html tool, see the documentation “pdftohtml version 0.10.7” incorporated by reference herein in its entirety.
In another illustrative embodiment, the intermediate format is XML. Some embodiments avoid misleadingly high counts for word groups contained in page headers of a single document, by server computer 120 checking if the number of times an identical line (containing a specific word group) appears in a document is more than a predetermined number, such as 4 or 5 and if so discarding redundant lines (found to be identical) from the text generated as output by document to text & hyperlink converter 123.
Once a document is in the intermediate (e.g. HTML or XML) format, the text and hyperlinks are extracted in some embodiments by server computer 120 using an HTML and/or XML parser to locate and extract text from elements of the document tree (Document Object Model). Some elements containing non-text are identified by server computer 120 based on the name of the surrounding HTML and XML tags, e.g. by use of another tool called BeautifulSoup available at the website address obtained from the following string in the above-described manner: http:%%www=crummy=com%software%BeautifulSoup%documentation=html. Documentation on the just-described tool from this website is incorporated by reference herein in its entirety.
Several HTML/XML tags in a document are ignored by server computer 120 in some embodiments of the invention. In several embodiments, server computer 120 is programmed to review attributes of the tag <div> in an HTML document to check if any one of the following character strings (regardless of whether upper case or lower case) is present as a value of any <div> attribute, and if so the tagged section is ignored: hide, hidden, poll, comment, header, footer, extra, noscript, script, style, option, col4wide margin-left, reallywide clear-left, masterVideoCenter hidden, printSummary.
In addition to specific tag names, a text-to-tag ratio (TTR) is used in some embodiments of document to text & hyperlink converter 123 to extract from within a HTML document, the text of its body (discarding its advertisements and its headers), etc. Specifically, several of the just-described embodiments of server computer 120 remove script and remark tags as well as empty lines from the HTML document and then compute a ratio of a count of non-HTML-tag characters that are ASCII in a line of text, to a count of HTML-tags in that line, unless the count of HTML-tags is zero, and if so the TTR is simply set to the length of the line.
In several such embodiments, lines in the HTML document with a TTR that lies equal to or above two standard deviations is automatically determined by converter 123 to be content (e.g. HTML web page's body) and those lines whose TTR is less than two standard deviations are determined to be non-content (e.g. HTML web page's header and advertisements). For more detail on how converter 123 implements the calculation and use of TTR, see the following document that is incorporated by reference herein in its entirety: “Text Extraction from the Web via Text-to-Tag Ratio” by Tim Weninger and William H. Hsu, published at pp. 23-28, 2008, in the 19th International Conference on Database and Expert Systems Application and available on the Internet at the website address obtained from the following string in the above-described manner: http:%%www=uni-weimar=de%medien%webis%research%workshopseries% tir-08%proceedings%18_paper—652=pdf. The just-described paper is incorporated by reference herein in its entirety. In some embodiments, the text inside HTML/XML tags used primarily for display formatting is retained and the tag itself is ignored in calculation of TTR. Examples of HTML/XML tags whose text is retained, but tags are ignored are ‘br’, ‘sub’, ‘sup’, ‘pre’, ‘plaintext’, ‘blockquote’, ‘q’, ‘cite’, ‘span’. In such embodiments, other tags such as ‘h1’, “h2’, etc and ‘p’, ‘hr’, ‘o:p’ are replaced with a period ‘.’ to indicate the end of a sentence.
As noted above, server computer 120 is programmed to extract text and hyperlinks in each subscription document and each interest document by executing converter 123. Certain embodiments use search service 140 to access a subscription corpus (which equals the world wide web) and in these embodiments indexing may not be necessary. In other embodiments that do not use search service 140 to access a subscription corpus, the extracted text is thereafter stored in an index (“inverted index”) of words, by using software 124 (“text indexer”), such as Sphinx available on the Internet, at the website address obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%www=sphinxsearch=com. In certain embodiments interest document(s) and subscription document(s) are both indexed, with the indexed interest document(s) being used for snippet generation and the indexed subscription documents being used to find matches to word groups (identified from interest documents). In alternative embodiments, only the subscription documents are indexed for use in finding matches to word groups. In the alternative embodiments, the snippets are generated by matching the word groups against the text of the interest document stored in the database, using an SQL query to do the matching.
Note that the word “occur” is used herein whenever a word group is identically present in a document, whereas the word “match” is used more loosely herein, to mean presence of one or more words in a word group or a variant thereof based on stemming or even another word related thereto, e.g. depending on a mode as follows. In some embodiments, by server computer 120 using an inverted index with position information, one or more of the following modes of matching are supported: matching all words in the word group (default mode), matches any word in the word group, matching the word group identically, matching a word group as a boolean expression, proximity matching and matching a query as an expression in a query language Sphinx. Specifically, to implement matching of multi-word phrases, an index 125 is implemented (and stored) in server computer 120 of some embodiments, and configured to include position information, such as an offset of each word in a document from a beginning of that document.
Certain embodiments of server computer 120 implement only matching of a word group identically in the subscription document (“phrase matching”) to preserve the meaning of the word group and therefore improve relevance of the matching documents. Other embodiments of server computer 120 use proximity matching, wherein proximity distance is specified in words, adjusted for word count, and applied to all words within quotes. For instance, “lion dog tiger” within a proximity distance of 5 means that there must be less than 8-word span which contains all 3 words. Still other embodiments of server computer 120 match stemmed versions of the word group and can be used with either phrase matching or proximity matching.
In one illustrative embodiment, text indexer 124 is configured to store each subscription document A with a field (“in-link field”) that contains a list of identifiers of subscription documents that hyperlink to that subscription document A (“in-link identifiers”) as shown in
In addition to text 151, converter 123 also supplies to indexer 124 zero or more identifiers, such as identifiers 152 and 153 of two subscription documents that internally contain hyperlinks to the text 151. As noted above, these and other such hyperlinks are identified by converter 123 during document conversion (e.g. from PDF to HTML). In
The just-described configuration of indexer 124, by using a document to be indexed as well as identifiers of incoming hyperlinks not only enables a match API 126 to match a word group to subscription document A, but also enables matching that word group to each subscription document B to which subscription document A points. Specifically, a relevance engine 130 in server computer 120 invokes match API 126 with the input illustrated in
For example, in
Also, the same word group “long dwell detection” (in field 154) is to be further checked for match with subscription documents having integer identifiers 25, 42, 41, 34 and 31 (in list field 196 of
In one embodiment, the server computer 120 includes relevance engine 130, document crawler 122, document to text & hyperlink converter 123 that are implemented making use of both standard and supplemental libraries written in a programming language called Python which is available on the Internet, at the website address obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%www=python=org, and described in the book “Learning Python” by Mark Lutz, published by O'Reilly Media (2009).
The list of supplemental libraries that are used in the just-described embodiment include a library to perform natural language processing functions such as sentence tokenization, word tokenization, part-of-speech classification, stop word filtering, stemming called NLTK which is available on the Internet, at the website address obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%www=nltk=org and described in the book “Natural Language Processing with Python—Analyzing Text with the Natural Language Toolkit” by Steven Bird, Ewan Klein, and Edward Loper published by O'Reilly Media, 2009. The list of supplemental libraries also includes a library to perform concurrent network downloads and access to email servers made use by the document crawler called Twisted which is available on the Internet, at the website address obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%www=twistedmatrix=com and described in the book “Twisted Network Programming Essentials” by By Abe Fettig published by O'Rielly Media (2005).
In one illustrative embodiment, web framework 121 is implemented in server computer 120 by software called Pylons, which is available on the Internet, at the website address obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%www=pylonshq=com, as well as described in “The Definitive Guide to Pylons by James Gardner, published by Apress (2008). The server computer 120 includes web server software 1908 (
The web framework 121 used in server computer 120 of some embodiments consists of software libraries that operate in conjunction with the web server software 1908 to enable user-interest display logic 380 (
In the illustrative embodiment, database 150 is implemented in server computer 120 by relational database management software system called MySQL, which is available on the Internet, at the website address obtained as described above from the following string: http:%%www=mysql=com.
In addition to using standard SQL to access the relational database, in the illustrative embodiment, objects stored in the database 150 are accessed in software using Object Relational Mapper software called SQLAlchemy which is available on the Internet, at the website address obtained as described above from the following string: http:%%www=sqlalchemy=org. and a declarative layer on top of the SQLAlchemy called Elixir which is also available on the Internet, at the website address obtained as described above from the following string: http:%%elixir=ematia=de. Elixir and SQLAlchemy are used together to generate relational database schema and access methods to store, retrieve, and modify objects in the relational database.
Operation of relevance engine 130 of many embodiments is now described starting with receipt of an interest document 200 illustrated in
TABLE 2
REFERENCE
INTEREST
CORPUS
CORPUS
WORD GROUP
COUNT
COUNT
WEIGHT
budget for passive
25
1
0.04
portal at border
25
1
0.04
passive detection of heu
31
1
0.032258
uniform detection coverage
63
1
0.015873
plutonium show
100
1
0.01
detection technique need
100
1
0.01
x-ray interrogation
158
1
0.006329
detection of heu
158
1
0.006329
available detection technique
158
1
0.006329
interior nest
251
1
0.003984
light road vehicle
251
1
0.003984
border at border
316
1
0.003165
mode terrorist
398
1
0.002513
sharp attenuation
630
1
0.001587
vehicle from light
794
1
0.001259
garage door wide open
1000
1
0.001
nuclear detection system
1584
1
0.000631
uniform detection
1995
1
0.000501
active neutron
1995
1
0.000501
terrorist vehicle
3162
1
0.000316
border terrorist
5011
1
0.0002
number minutes
6309
1
0.000159
available detection
6309
1
0.000159
detection coverage
7943
1
0.000126
passive detection
12589
1
7.94E−05
border container
15848
1
6.31E−05
handheld detector
19952
1
5.01E−05
technique need
39810
1
2.51E−05
nuclear terrorist
39810
1
2.51E−05
nuclear detection
39810
1
2.51E−05
light road
50118
1
2E−05
worldwide transportation
125892
1
7.94E−06
link budget
398107
1
2.51E−06
transportation mode
630957
1
1.58E−06
detection technique
630957
1
1.58E−06
door wide open
794328
1
1.26E−06
time detection
2511886
1
3.98E−07
type of vehicle
3162277
1
3.16E−07
national border
3162277
1
3.16E−07
road vehicle
5011872
1
2E−07
sufficient number
5011872
1
2E−07
heu
25118864
4
1.59E−07
air passenger
6309573
1
1.58E−07
gamma ray
7943282
1
1.26E−07
private jet
7943282
1
1.26E−07
detection system
7943282
1
1.26E−07
neutron
19952623
2
1E−07
plutonium
10000000
1
1E−07
attenuation
15848931
1
6.31E−08
garage door
19952623
1
5.01E−08
detector
79432823
3
3.78E−08
suffice
31622776
1
3.16E−08
terrorist
1.58E+08
4
2.52E−08
wide open
39810717
1
2.51E−08
tanker
39810717
1
2.51E−08
uranium
39810717
1
2.51E−08
interrogation
50118723
1
2E−08
front door
50118723
1
2E−08
deter
50118723
1
2E−08
enrich
50118723
1
2E−08
detection
2.51E+08
5
1.99E−08
detect
1.58E+08
2
1.26E−08
x-ray
79432823
1
1.26E−08
pu
2E+08
2
1E−08
gamma
1E+08
1
1E−08
livestock
1E+08
1
1E−08
passive
1.26E+08
1
7.94E−09
consist
1.26E+08
1
7.94E−09
nest
1.26E+08
1
7.94E−09
handheld
1.58E+08
1
6.31E−09
physically
1.58E+08
1
6.31E−09
shield
1.58E+08
1
6.31E−09
uniform
2E+08
1
5.01E−09
exclusively
2E+08
1
5.01E−09
passenger
2E+08
1
5.01E−09
container
2E+08
1
5.01E−09
sufficient
2.51E+08
1
3.98E−09
nationwide
3.16E+08
1
3.16E−09
As seen from the last several rows of TABLE-2 above, the word groups of many embodiments include groups of single words, such as “nationwide”, “sufficient”, “container”, “passenger”, “exclusively” etc, as well as groups of multiple words. In several embodiments, the word groups of an interest document are identified by relevance engine 130 automatically classifying each word in the interest document into a part of speech in English grammar, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction and interjection. Then one or more patterns of positions of parts of speech relative to one another are used by relevance engine 130 to automatically select multiple groups of words.
The current inventors believe that identification of word groups based on identification of a part of speech (POS) of each word in an interest document is a valuable aspect of many embodiments of relevance engine 130 for the following two reasons. Firstly, the current inventors have found that detection of word groups that relies on interest corpus statistics is unreliable, unpredictable, and produces irrelevant phrases and unrecognizable phrases. Secondly, the current inventors have further found that results of identifying a part of speech (POS) can be tested at the sentence level independent of the rest of the document or corpus, and better correlates with intuitive and natural understanding of the subject matter being discussed in the interest corpus.
In many embodiments, patterns that are used by relevance engine 130 include all word groups in which a noun is preceded by at least one of another noun, an adjective and a preposition. Such word groups (“noun phrases”) improve the relevance of documents that are identified by relevance engine 130 because nouns describe subject matter and concepts, in contrast to subject-verb-object (SVO) phrases which describe actions and therefore produce irrelevant phrases and unrecognizable phrases. Hence this is a third reason for the current inventors' belief that identification of noun phrases based on user's interest document(s), and use of such identified noun phrases to select documents are valuable aspects of certain embodiments of relevance engine 130.
Note that neither identification of POS nor use of noun phrases is required in several alternate embodiments of relevance engine 130. Specifically, several alternative embodiments do identify word groups based on word count statistics within a document or set of documents to identify words appearing more often together than they would be expected to given random, independent occurrences of the words. Some such alternative embodiments of relevance engine 130 use a likelihood ratio and hypothesis testing (t-test, chi-square test) as described in pages 162-172 in Chapter 5 of the book entitled “Foundations of Statistical Natural Language Processing” by Christopher D. Manning and Hinrich Schutze, published by Massachussetts Institute of Technology, 1999 (sixth printing with corrections, 2003). Note that Chapter 5 of the just-described book is incorporated by reference herein in its entirety.
In many embodiments, relevance engine 130 automatically selects certain groups of words to be used in identifying subscription documents, based on how infrequently the word groups occur or match one or more corpus(es). For example in some embodiments, relevance engine 130 calculates a weight of each identified group of words that has been extracted from one or more interest document(s) as described above. Depending on the embodiment, the weight of a word group is a function of either one of or both of: an interest count of a number of times the group of words matches the interest documents and a reference count of the number of times the corresponding group of words matches a corpus of reference documents.
In a first example, an embodiment of relevance engine 130 ranks word groups 206, 207, 208 and 209 (
In a second example, another embodiment ranks word groups 206, 207, 208 and 209 by sorting them in ascending order of an inverse of the number of times each word group appears in an interest corpus (including one or more interest documents identified by the user). In
In a third example, yet another embodiment ranks word groups 206, 207, 208 and 209 by sorting them in ascending order of a weight based on the ratio of the reference corpus count to the interest corpus count. The ratios for word groups 206, 207, 208 and 209 are 25/3, 27/2, 33/1, and 38/1 and these word groups are again illustrated in
One or more highest ranked word groups resulting from the above described ranking are used in accordance with the invention, either directly or indirectly depending on the embodiment, in computerized filtering or selection of documents that contain content of various types, such as results of searching the world wide web, social networking profiles (e.g. on Facebook), professional networking profiles (e.g. on Linked-in), blog postings, job sites, classified ads, and other user generated content like reviews. In some embodiments the documents being filtered or selected by server computer 120 occur in streams that change over time, such as news sources, RSS feeds, incoming email, and real-time social media (e.g. Twitter).
In certain embodiments (“first embodiments”), the automatically ranked word groups are supplied by server computer 120 via web browser 102 to the user who then manually submits them to a conventional service, such as Google Alerts. In other embodiments (“second embodiments”), automatically ranked word groups are directly used automatically by server computer 120 that is further programmed in accordance with the invention, to eliminate the need for a user to manually generate and supply search terms for personalized filtering, as discussed next.
In some embodiments, relevance engine 130 automatically repeatedly uses the highest ranked group of words with match API 126 to identify from among a set of identifiers of subscription documents 220 (
In the illustration of
In some embodiments, relevance engine 130 automatically stores in the memory 1106 of server computer 120, information about the ranked document identifiers to enable web framework 121 to generate the display illustrated in
Relevance engine 120 further stores in memory 1106 a score 244 (e.g. of value −14.0833) showing that this document has the highest score. Relevance engine 120 also stores in memory 1106, a group of words 246 (e.g. “nuclear dhs's decision”) that caused the document identifier 242 to be displayed, and corresponding thereto a flag 245 (shown as a check mark in a box control in
Although only recordation by relevance engine 120 of one group of words 246 has been discussed above, relevance engine 120 actually records several such word groups in memory 1106, e.g. in
In some embodiments of the invention, relevance engine 120 additionally stores in memory 1106 one or more snippets of text in a subscription document that surround one or more word groups identified therewith. For example,
Some embodiments of a relevance engine 130 is implemented by five functional blocks 310, 320, 350, 380 and 390 as illustrated in
Several embodiments of relevance engine 130 include a word group extraction engine 310 (
Thereafter, each word in each sentence is tokenized by word tokenization block 312 (
After word tokenization in block 312, the entire sentence is passed to block 316, wherein each word of the sentence is in turn tagged with its part if speech (noun, verb, adjective, etc), as described in, for example, pages 341-380 in Chapter 10 of the book entitled “Foundations of Natural Language Processing” by Christopher Manning and Hinrich Schutze, 1999. Pages 341-380 of the just-described book are incorporated by reference herein in their entirety. For additional detail see the article entitled “Taggers in NLTK” available on the Internet, at the website address obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%docs=huihoo=com%nltk%0=9=5%guides%tag=html. The just-described article is incorporated by reference herein in its entirety. In one embodiment, POS classification block 316 uses a Brill Tagger. In another embodiment, POS classification block 316 uses a Markov model Tagger.
Each sequence of words that matches a predetermined pattern 315 (such as adjective-noun or noun-noun or adjective-adjective-noun) is selected by multi-word group detection block 317 as a noun phrase, e.g. as described in pages 153-157 of the book entitled “Foundations of Natural Language Processing” by Christopher Manning and Hinrich Schutze, 1999, incorporated by reference herein in its entirety. There are several alternative embodiments for selecting phrases, based on word co-occurrence statistics and word proximity, e.g. as described in Chapter 5, of the book entitled “Foundations of Natural Language Processing” by Christopher Manning and Hinrich Schutze, 1999, which is incorporated by reference herein in its entirety. See also an article by Church and Hanks entitled “Word Association Norms, Mutual Information, and Lexicography” at the link obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%www=aclweb=org%anthology-new %J%J90% J90-1003=pdf that is also incorporated by reference herein in its entirety. Also see the article entitled “Using statistics in lexical analysis” available on the Internet at the link obtained by replacing “%” with “/” and replacing “&” with “.” in the following string: http:%%citeseerx&ist&psu&edu%viewdoc%summary?doi=10&1&1&136&6572 also incorporated by reference herein in its entirety.
In another embodiment, word groups are selected by block 317 based on matching to specific content headers such as fields in an email (from, to, subject, etc) or meta-data inside the document (such as HTML/XML tags). In some embodiments, a word group relevance evaluator 320 in relevance engine 130 is implemented in two blocks 323 and 328, wherein block 323 is used to perform an operation on demand, e.g. in response to user upload of an interest document and block 328 is used to perform an operation periodically, e.g. every hour. Block 323 includes a block 321 that obtains a count (“reference corpus count’) of each word group in a reference corpus, such as the world wide web.
In one illustrative embodiment, block 321 stores the reference corpus count of each looked up word group in database 150, for re-use in future. Hence block 321 first checks if the word group is found in the database first and thereafter issues a query to a count server 141 (in a search service 140 available on the Internet as noted above), and result is stored in the database. Depending on the implementation either the “totalhits” or the “deephits” returned by Yahoo-BOSS (used as search service 140) may be used as the reference corpus counts. Some embodiments of relevance engine 130 directly use the reference corpus count as a weight while alternative embodiments additionally use an interest corpus count. In the alternative embodiments, block 323 additionally includes a block 322 that obtains a count of each word group in the interest corpus, by looking up index 125, and the looked up count is stored in the database 150.
In one such embodiment, in a block 324, periodic operation block 328 in relevance engine 130 computes as word group weight, a ratio where the denominator is equal to number of times the word group appears in the interest corpus and the numerator is equal to the number of times it appears in search results on the web or other reference corpus. Other embodiments use as weight an arbitrary function of the numerator and/or denominator, i.e. any function of either the reference corpus count or the interest corpus count or both counts. In the just-described embodiment, periodic operation block 328 in relevance engine 130 also includes a block 325 that identifies word groups for use in selecting relevant documents, based at least partially on weight.
Accordingly, block 325 of several embodiments includes a block 326 that sorts all word groups identified from an interest document in ascending order of their weight, as computed in block 324. Also as noted above, in one embodiment the word group weight is the count of the number of times they appear in search results on the web or other reference corpus. All groups of words, in the sorted list are supplied to document relevance ranker block 350, for use in selection of relevant documents from a subscription corpus.
In one embodiment, to create a subscription corpus automatically, due to a large number of word groups from an interest corpus, relevance engine 130 does not use search results for all word groups generated from the interest corpus that result from block 326. Instead, in one embodiment, the word groups from one or more interest document(s) are automatically selected by first starting with the list of all word groups, eliminating those which have a reference corpus count less than some predetermined threshold (like 10^4), sorting the remaining word groups based on weight, and selecting the top N of the word groups remaining in the list (for example N=20). These top N word groups are then provided to block 329 for invoking a document crawler to perform N web searches using the N word groups, and the results of these searches (and optionally hyperlinked documents therefrom) are used to form a subscription corpus for use by document relevance ranker block 350.
Some embodiments automatically select the top N word groups by eliminating not only those word groups which have a reference corpus count less than a first predetermined threshold (like 102) as just described, but also additional word groups which have the reference corpus count greater than a second predetermined threshold (like 107), i.e. the top N word groups are selected for having their reference corpus counts within a predetermined range. Although some embodiments use a single range as just described, additional ranges may be used in other embodiments. For example one illustrative embodiment uses three ranges to select three lists of word groups as follows: a high list 911 (
In some embodiments, in a list 911, word groups that satisfy a predetermined condition form a first set 914 (
In some embodiments, one or more lists of word groups matching the interest document are supplied by word group relevance evaluator 320 (via branch 351) to a database and stored therein. The word groups in the database are read by user interest display logic 380 that in turn supplies the word groups list(s) to web framework 121. In one such embodiment web framework 121 directly accesses a search service 140 on the Internet, to generate and store in computer memory, search results in the form of a list of documents, each item in the list identifying details of each document, such as a URL, a title and a snippet and then web framework 121 supplies this list to web browser 102 in client computer 101 for display to the user as shown in
In the example illustrated above in reference to TABLE-2, the following seven word groups are used to form a subscription corpus: passive detection, border container, handheld detector, technique need, nuclear terrorist, nuclear detection, light road. Note that the first word group “passive detection” is picked for having a reference corpus count of greater than 10,000. The remaining six word groups are picked sequentially thereafter based on weight. These seven word groups are thereafter used by a document crawler 122 to invoke seven searches by a search engine and the results are used as the subscription corpus in one embodiment. In another embodiment, the just-described results as well as documents hyperlinked therefrom together form the subscription corpus.
In some embodiments, a document relevance ranker block 350 initializes a document count to zero in act 331 and thereafter goes to act 332. In act 332, block 350 selects a group of words as the current group, from the top of sorted groups of words, excluding any word groups that have been eliminated by the user, e.g. by clicking on a check box control as described above. Next, in act 333, block 350 checks if a predetermined number (e.g. 10) of unique subscription documents have been found so far. If the answer is yes, then block 350 goes to act 336. In act 336, block 350 checks if the weight of the current word group (selected in act 332) exceeds a limit on the largest value of the minimum matching word group weight among the subscription documents found so far. If the answer in act 336 is yes, block 350 goes to act 337 to sort the documents as illustrated in
In some embodiments of act 334, block 350 invokes the match API block 126 using the current word group and document identifiers of all documents obtained from block 329. In other embodiments of act 334, block 350 invokes the match API block 126 using the current word group and document identifiers obtained from an RSS feed and/or an email folder as described above. Next, in act 335, block 350 checks to see if a document has been found yet. If no document is found, block 350 returns to act 332 described above, to select another group of words as the current group from the sorted list (excluding user-eliminations). If the answer in act 335 is yes, then one or more documents have been found to be relevant, and block 350 goes to act 338 to perform document ranking.
Specifically, in act 338, block 350 decides whether to enter a ranking block 340 depending on whether any documents are currently unranked (i.e. their score is not fully computed, including a tie-breaker for identically ranked documents). If there are no unranked documents in act 338, then block 350 returns to act 332 described above. While there are any unranked documents in act 338, block 350 enters block 340 with an unranked document as the current document. In an act 341, block 350 checks if this is the first time that a document is found while traversing the list of word groups in acts 332-338. In act 341, block 350 further checks if the document count is less than a predetermined number. If both conditions are met, then block 350 goes to act 342, else it goes to act 343.
In act 342, block 350 records the weight of the current word group as a minimum matching word group weight for this current document, and initializes the accumulated detection count for this document to 1 and increments a document count by 1. Next block 350 returns to act 338 (described above). In act 343, block 350 checks if the word group weight is less than a sum of this document's minimum matching word group weight and a predetermined limit. If the answer is yes, then block 350 goes to act 344 and increments by 1 the accumulated detection count for this document, and then goes to act 338 (described above). If the answer in act 343 is no, then block 350 simply returns to act 338.
In some embodiments, block 350 includes a document sorter which is invoked in act 337 as described above. As illustrated in
In several embodiments, a user interest display logic 380 in relevance engine 130 performs acts 401-404 illustrated in
In some embodiments, display logic 380 receives identifiers of one or more word groups (via branch 351) from word group relevance evaluator 320, and on receipt performs act 405 (
In several embodiments, a user interest feedback logic 390 in relevance engine 130 performs acts 411-412 illustrated in
As illustrated in
Note that although deselection of a check box has been described above as an illustrative example, the user is enabled by some embodiments to select a check box that has been previously deselected for exclusion of a word group thereby to initiate re-ranking including the previously excluded word group. While in some embodiments a web page displays a check box for excluding a word group, other embodiments display a control to enable the user to modify the weight of a word group, e.g. to promote or demote the word group relative to other word groups automatically identified as being of interest to the user. Several such embodiments display a list of all word groups automatically identified as being of interest to the user, and also display controls wherein the user can change the relative weights of the word groups.
Some embodiments in accordance with the invention convert the reference corpus count of each word group to decibel units, i.e. ten times logarithm of the reference corpus count to base 10, and then use this result in scoring and ranking each document relative to other documents to be displayed to the user.
Although in some embodiments the identifiers of relevant documents selected by document relevance ranker 350 (
In some embodiments, each word group is stored in database 150 in association with a unique identifier in the form of an integer. In these embodiments, integer identifiers of all word groups matching an interest document, corresponding counts in the interest document, and corresponding reference corpus counts are also stored sequentially with an identifier of the interest document in database 150, instead of storing one record for each word group. Use of integer identifiers to identify word groups as just described improves speed of storage and retrieval of this information during operation of the relevance engine 130, compared to embodiments that directly use character strings of the word groups in such storage and retrieval.
The server computer 120 of
Main memory 1106 also may be used for storing temporary variables or other intermediate information (e.g. index 125 shown in
Server computer 120 may be coupled via bus 1102 to a display device or video monitor 1112 such as a cathode ray tube (CRT) or a liquid crystal display (LCD), for displaying information to a user, e.g. identifiers of the automatically selected documents are displayed on display 1112. An input device 1114, including alphanumeric and other keys (e.g. of a keyboard), is coupled to bus 1102 for communicating information and changes to objects 216 and 217 to processor 1105. Another type of user input device is cursor control 1116, such as a mouse, a trackball, or cursor direction keys for communicating information and command selections to processor 1105 and for controlling cursor movement on display 1112. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
As described elsewhere herein, automatically identification of word groups and automatic selection of documents is performed by server computer 120 in response to processor 1105 executing one or more sequences of one or more instructions for a processor that are contained in main memory 1106. Such instructions may be read into main memory 1106 from another computer-readable storage medium, such as storage device 1110. Execution of the sequences of instructions contained in main memory 1106 causes processor 1105 to perform the operations of a process described herein and illustrated in
The term “computer-readable storage device” as used herein refers to any storage device that participates in providing instructions to processor 1105 for execution. Such a storage device may take many forms, including but not limited to (1) non-volatile computer memory, and (2) volatile memory. Common forms of non-volatile computer memory include, for example, a floppy disk, a flexible disk, hard disk, optical disk, magnetic disk, magnetic tape, or any other magnetic device, a CD-ROM, any other optical device, punch cards, paper tape, any other physical device with patterns of holes, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge that can be used as storage device 1110. Volatile memory includes dynamic memory, such as main memory 1106 which may be implemented in the form of a random access memory or RAM.
Instead of or in addition to a storage device, transmission link may be used to provide instructions to processor 1105. A transmission link includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1102. A transmission link can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications, any of which can be used to implement a carrier wave as described herein.
Accordingly, instructions to processor 1105 can be provided by a transmission link or by a storage device from which a computer can read information, such as data and/or code. Specifically, various forms of transmission link and/or storage device may be involved in providing one or more sequences of one or more instructions to processor 1105 for execution. For example, the instructions may initially be comprised in a storage device, such as a magnetic disk, of a remote computer. The remote computer can load the instructions into its dynamic memory (RAM) and send the instructions over a telephone line using a modem.
A modem local to server computer 120 can receive information about a user's interest document on the telephone line and use an infra-red transmitter to transmit the information in an infra-red signal. An infra-red detector can receive the information carried in the infra-red signal and appropriate circuitry can place the information on bus 1102. Bus 1102 carries the information to main memory 1106, from which processor 1105 retrieves and executes the instructions. The instructions received by main memory 1106 may optionally be stored on storage device 1110 either before or after execution by processor 1105.
Server computer 120 also includes a communication interface 1115 coupled to bus 1102. Communication interface 1115 provides a two-way data communication coupling to a network link 1120 that is connected to a local network 1122. Local network 1122 may interconnect multiple computers (as described above). For example, communication interface 1115 may be an integrated services digital network (ISDN) card or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1115 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 1115 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 1120 typically provides data communication through one or more networks to other data devices. For example, network link 1120 may provide a connection through local network 1122 to a host computer 1125 or to data equipment operated by an Internet Service Provider (ISP) 1126. ISP 1126 in turn provides data communication services through the world wide packet data communication network 1124 now commonly referred to as the “Internet”. Local network 1122 and network 1124 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 1120 and through communication interface 1115, which carry the digital data to and from server computer 120, are exemplary forms of carrier waves transporting the information.
Server computer 120 can send messages and receive data, including program code, through the network(s), network link 1120 and communication interface 1115. In the Internet example, a server 1100 might transmit information related to objects 216 and 217 retrieved from a distributed database system through Internet 1124, ISP 1126, local network 1122 and communication interface 1115. The instructions for performing the operations of
Note that
Instead of requiring a user to supply search terms, some computer(s) of second embodiments are programmed in accordance with the invention to require a user to supply or identify one or more interest documents. Other computer(s) of second embodiments are programmed to automatically identify one or more interest documents as described above in reference to the embodiments of TABLE 1. The programmed computers of the second embodiments then use the interest documents to automatically identify one or more word groups, and then use the automatically identified word group(s) to automatically select documents from a corpus (“subscription corpus”). Identifiers of the automatically selected documents are included in a new document that is initially stored in a computer memory, and eventually transmitted to a client computer and displayed to the user, e.g. in an email reader, or in a web browser.
Depending on the embodiment, a subscription corpus may be either identified by the user to client computer(s) 101 or alternatively identified automatically by server computer(s) 120. In some of the just-described alternative embodiments, the server computer(s) 120 use the interest documents to further identify additional word groups which are in addition to the above-described automatically identified word groups. These additional word groups are thereafter used to filter another corpus (“super-subscription corpus”) that is two or more orders of magnitude larger than the subscription corpus (e.g. illustrated in
Several embodiments in accordance with the invention also use a subscription corpus that is generated by use of one or more word groups that are automatically identified by use of interest documents (as described below) to conduct one or more searches on the world wide web, using a search engine 105. The results from a search engine, illustrated in
Additionally, certain third embodiments further include in the new document, at least one control to receive user input on one or more group of words included therein. The control can be, for example, a hyperlink that is activated by the user clicking on a displayed word group, or a check-box that is displayed adjacent to the displayed word group. Depending on the embodiment, the programmed computer(s) are programmed to respond to user input via such a control in one of several ways. For example, in some embodiments, the programmed computer(s) use the user's input via such a control to exclude one or more word group(s) from use in filtering documents in future, for this user. Hence, several such third embodiments respond to the user's input via such control(s) by performing another iteration that excludes the user-identified word group(s) during filtering, and then return new results to the user. As another example, in several embodiments, the programmed computer(s) use the user's input via such a control to display text in the selected document surrounding the word group, e.g. display words that precede the word group, followed by the word group itself, followed by words that occur subsequent to the word group, i.e. in the same sequence as in the document.
Although some embodiments of the type described above in reference to act 324 in
Although operation 328 is periodic in some embodiments, in other embodiments operation 328 can be performed aperiodically, e.g. in response to a predetermined event. Accordingly, operation 628 is shown in
A computer is programmed in some embodiments of the invention to perform acts 631-637 illustrated in
After a pair of counts is determined for each of several word groups, the programmed computer uses at least one processor to compute a weight of each word group in act 634. Specifically, in act 634 of
In some embodiments, act 634 of
After a function is fitted, in an act 711 the computer compares the first count for a given word group to the fitted function in order to obtain a weight of the given group of words. The comparison in act 711 can be performed by a suitably programmed computer in different ways, depending on the embodiment. In some embodiments, the fitted function is evaluated at the second count of the given word group as per act 712 in
The above-described function fitting in an act 701 can be performed in different ways depending on the embodiment. For example, in some embodiments, a function is fitted by identifying it from a predetermined family of functions as per act 702, while in other embodiments the function is identified based on a formula as per act 708. Two examples of the formula that are used in some embodiments are a simple moving average and an exponential moving average. Act 702 can also be implemented differently depending on the embodiment, e.g. by identifying a function from among a family of parametric functions, such as linear functions, quadratic functions, exponential functions, or alternatively from among non-parametric functions such as an infinite series, such as the fourier series. In a first example, one illustrative embodiment uses a family of linear functions in the form of y=bx+c, wherein y denotes the first count, x denotes the second count, and wherein b and c are constants that identify a single function (fitted to the pairs of word group counts) on performance of act 702. In a second example, another illustrative embodiment uses a family of quadratic functions in the form of y=ax2−bx+c, wherein y denotes the first count, x denotes the second count, and wherein a, b and c are constants that identify a single function (fitted to the pairs) on performance of act 702.
In several embodiments, act 702 is performed by the computer performing a regression analysis in an act 703, to select from among a family of functions, a function that minimizes a sum of deviations between first counts and corresponding values of the function (evaluated at the second count). In one illustrative example, the computer 120 is programmed to determine at least one constant that uniquely identifies a single function from the family, by performing a quantile regression as illustrated by act 704, while in another such example the computer performs a linear regression as per act 705. For additional implementation detail of such examples, see the description of
In some embodiments of act 707, the computer is programmed to divide a range of the second count across all ordered pairs into intervals, and generate an average within each interval, and then connect up the average of each interval at the midpoint of each interval to generate a piece-wise linear function. In other embodiments of act 706, the computer is programmed to also divide a range of the second count across all ordered pairs into intervals, and generate a threshold within each interval. The threshold may be based on a predetermined statistical criterion, such as the highest N word groups within each interval (e.g. N=10), ranked by the first count. Then the threshold of each interval is connected up at the mid-point of each interval by the programmed computer, to generate a piece-wise linear function.
In some embodiments of act 624 (
In some embodiments, module “rq” is invoked from the “quantreg” package, which is available for use in the statistical language R. Documentation for the quantreg package for R is available at the CRAN website address obtained by replacing “%” with “.” in the following string: “cran%r-project%org”. In some embodiments, module “Im” is invoked in server computer 120 as a built-in function of the statistical language R, e.g. as described in the book entitled “An Introduction to R, Notes on R: A Programming Environment for Data Analysis and Graphics”, Version 2.10.1, (2009-12-14) by W. N. Venables, D. M. Smith and the R Development Core Team, incorporated by reference herein in its entirety.
In certain embodiments, quantile regression is performed by server computer 120 as described in an article by Hunter DR, and Lange K, entitled “Quantile regression via an MM algorithm” published in J Comput Graphical Stat 2000; (9): 60-77 which is incorporated by reference herein in its entirety, and available at the URL obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%www=stat=psu=edu% ˜dhunter%papers%qr=pdf.
Moreover, several embodiments of server computer 120 use the “quantreg” package (including an “rq” module”) as described in the article entitled “QUANTILE REGRESSION IN R: A VIGNETTE” by Roger Koenke, published Nov. 4, 2009 that is incorporated by reference herein in its entirety, and available at the URL obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%cran=r-project=org%web%packages%quantreg%vignettes%rq=pdf.
For additional information on quantreg as used in server computer 120 of some embodiments, see the user manual entitled Package ‘quantreg’, published Nov. 5, 2009 that is incorporated by reference herein in its entirety and available at the URL obtained by replacing “%” with “/” and replacing “=” with “.” in the following string: http:%%cran=r-project.org%web%packages%quantreg%quantreg=pdf.
In a few embodiments, server computer 120 is programmed to perform quantile regression as described in an article entitled “A gentle introduction to quantile regression for ecologists”, published Front Ecol Environ 2003; 1(8): 412-420 that is incorporated by reference herein in its entirety.
Furthermore, some embodiments of server computer 120 use perform quantile regression as described in the article entitled “QUANTILE REGRESSION” by Roger Koenker and Kevin F. Hallock, published Journal of Economic Perspectives—Volume 15, No. 4, Fall 2001, pages 143-156 that is incorporated by reference herein in its entirety,
In several embodiments, a function is automatically selected by computer 120 from among a family of functions to minimize a sum of deviations between the first counts in ordered pairs of a group of words and corresponding values of the function evaluated at the respective second counts in ordered pairs of the group of words. In some such embodiments, the function is selected by computer 120 from among a family of functions based on having the minimum sum of deviation of ordered pairs of the group of words from the function, such that the deviation has the value equal to (i) a predetermined multiple “tau” of the difference of the first count from the corresponding value of said function if r is non-negative, and (ii) a complement of the predetermined multiple (tau-1) of the difference r if r is negative, wherein 0<tau<1.
In some embodiments, computer 120 invokes the quantile regression function “rq” as follows, “q1=rq(y˜x,data=D,tau=0.95)”, to identify the two constants b and c of a specific linear function y=bx+c (described above), illustrated by straight line 791 in
One illustrative example of certain above-described embodiments is shown below in TABLE 3. In this example, the module “rq” is invoked as follows:
> r$q2
Call:
rq(formula = lintcount ~ lrefcount + I(lrefcount{circumflex over ( )}2), tau = 0.95,
data = x)
wherein lintcount is the logarithm to the base 10 of the interest corpus count, and
lrefcount is the logarithm to the base 10 of the reference corpus count. In response, “rq” returns the following coefficients, in the above formula y=ax2−bx−c (e.g. Intercept is the zeroth order coefficient “c”, and so on).
(Intercept)
lrefcount
I (lrefcount{circumflex over ( )}2)
0.31191287
−0.02357209
0.01268922
In this illustrative example, the weight (called “deviation” below) for the word group “in-vehicle detector” is calculated as follows:
lintcount=log10(intcount)=1.38
lrefcount=log10(refcount)=2.10
lfitted=0.3119 −0.0236*lrefcount +0.01269*(lrefcount{circumflex over ( )}2)=0.32
deviation=lintcount−lfitted=1.38−0.32=1.06
TABLE 3
log base 10
log base 10
Value of
Interest
Reference
of Interest
of Reference
fitted
Word group
corpus count
corpus count
corpus count
corpus count
curve
deviation
in-vehicle detector
24
126
1.38
2.1
0.32
1.06
u-238
35
398107
1.54
5.6
0.58
0.97
gamma ray
42
7943282
1.62
6.9
0.75
0.87
u-232
20
31623
1.3
4.5
0.46
0.84
nuclear material
34
3162278
1.53
6.5
0.69
0.84
detection distance
17
31623
1.23
4.5
0.46
0.77
mev gamma
15
7943
1.18
3.9
0.41
0.76
plutonium
34
10000000
1.53
7
0.77
0.76
nuclear detection
15
39811
1.18
4.6
0.47
0.7
detector reading
11
2512
1.04
3.4
0.38
0.66
mev gamma ray
11
3162
1.04
3.5
0.38
0.66
grade of plutonium
10
794
1
2.9
0.35
0.65
in-vehicle
26
10000000
1.41
7
0.77
0.65
disarm program
9
251
0.95
2.4
0.33
0.63
detector area
11
10000
1.04
4
0.42
0.62
10 cm lead
8
251
0.9
2.4
0.33
0.57
neutron emission
10
15849
1
4.2
0.44
0.56
linear attenuation coefficient
9
3981
0.95
3.6
0.39
0.56
mev
20
6309573
1.3
6.8
0.74
0.56
rand-mipt
8
1000
0.9
3
0.36
0.55
In some embodiments (“H”), computers 800 of an on-line retailer (such as Amazon) include server computer 120 that is internally connected (e.g. via a proprietary network) to search engine 105 and website 107 as illustrated in
In certain embodiments (“K”), a website 107 is included in server computer 120. In these embodiments, as illustrated in
Screens displayed during an illustrative interaction of a user with one embodiment of server computer 120 are shown in
Note that web page 900 is constructed by server computer 120 of this embodiment to include lists 911-913, generated by use of the interest document as noted above. Accordingly, on viewing web page 900, the user can selectively combine their own search term “terrorist” with one or more of the word groups displayed in one of lists 911-913. In the illustrative example, the user clicks on the word group 916 (
In this manner, the user can refine their search, by appropriately combining any one or more of the word groups in lists 911-913 with search terms (if any) typed by the user in box 901. In one embodiment, any word group present in search box 901 can be easily removed from use in the next search, either by the user operating the delete button on the keyboard or alternatively by the user clicking on the same search term in one of lists 911-913 (e.g. a first click on a word group adds it to box 901 and a second click on the word group removes it from box 901).
In the illustrative example, web page 920 constructed by server computer 120 identifies several search services in a section 924, such as Google, Yahoo, YouTube, Oneriot, Crunchbase, Books, Patents, Videos, Images, News, Blogs etc. The search service currently used is shown in different attributes (e.g. bolded and without underlining) in section 924 in this example, relative to the other search services that are available (e.g. shown underlined, as hyperlinks). In the illustrative interaction the user now clicks on the “Books” hyperlink 925 and on doing so, server computer 120 is notified by web browser 102. Server computer 120 then queries the user-selected search service 140 (in this case GOOGLE books), and generates a web page of the type illustrated in
In the illustrative interaction, the user returns back to the search results shown in
Although each document description illustrated in
In several embodiments, a fitted function is selected by computer 120 from among a family of functions based on deviations of ordered pairs of the group of words from the function, such that the deviation has a value based on a difference of the first count from the corresponding value of the function. The deviations can be difference in absolute value, or difference of squares (linear regression), etc depending on the embodiment.
Numerous modifications and adaptations of the embodiments described herein will become apparent to the skilled artisan in view of this disclosure.
Numerous modifications and adaptations of the embodiments described herein are encompassed by the scope of the invention.
Srikrishna, Devabhaktuni, Coram, Marc
Patent | Priority | Assignee | Title |
10977284, | Jan 29 2016 | MICRO FOCUS LLC | Text search of database with one-pass indexing including filtering |
9618396, | Mar 15 2013 | HRL Laboratories, LLC | Thermomagnetic resonator-based temperature sensing |
Patent | Priority | Assignee | Title |
7664740, | Jun 26 2006 | Microsoft Technology Licensing, LLC | Automatically displaying keywords and other supplemental information |
7818315, | Mar 13 2006 | Microsoft Technology Licensing, LLC | Re-ranking search results based on query log |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 30 2012 | Python4Fun, Inc. | (assignment on the face of the patent) | / | |||
May 28 2013 | CORAM, MARC A | PYTHON4FUN, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 030838 | /0075 | |
Jul 16 2013 | SRIKRISHNA, DEVABHAKTUNI | PYTHON4FUN, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 030838 | /0075 |
Date | Maintenance Fee Events |
May 21 2017 | M2551: Payment of Maintenance Fee, 4th Yr, Small Entity. |
Jan 05 2021 | M2552: Payment of Maintenance Fee, 8th Yr, Small Entity. |
Date | Maintenance Schedule |
Dec 10 2016 | 4 years fee payment window open |
Jun 10 2017 | 6 months grace period start (w surcharge) |
Dec 10 2017 | patent expiry (for year 4) |
Dec 10 2019 | 2 years to revive unintentionally abandoned end. (for year 4) |
Dec 10 2020 | 8 years fee payment window open |
Jun 10 2021 | 6 months grace period start (w surcharge) |
Dec 10 2021 | patent expiry (for year 8) |
Dec 10 2023 | 2 years to revive unintentionally abandoned end. (for year 8) |
Dec 10 2024 | 12 years fee payment window open |
Jun 10 2025 | 6 months grace period start (w surcharge) |
Dec 10 2025 | patent expiry (for year 12) |
Dec 10 2027 | 2 years to revive unintentionally abandoned end. (for year 12) |