A human or hand-labeled ranking of url results for a search query is compared against actual click data for the respective query/url pairs (e.g., which URLs were actually clicked on by users when the URLs were presented to users when the search query was run in the real world). The human ranking or ordering of the url results (e.g., pre-existing relevance ranking) for the query can then be adjusted, if necessary, based upon the real world click data (e.g., click relevance ranking). The modified pre-existing relevance ranking can be used in providing future search results.
|
1. A method for improving relevance of web searches for a query, comprising:
computing a click relevance ranking of a plurality of query and url pairs based upon user log data comprising user click data, the computing comprising:
aggregating the user log data by query and url;
reducing click bias by determining a normalized click rate;
creating a click relevance ordering of the plurality of query and url pairs;
creating a directed acyclic graph of a relevance relationship between the plurality of query and url pairs; and
mapping the directed acyclic graph into a linear ordering; and
identifying and correcting mislabeled query and url pairs within a pre-existing relevance ranking based upon the click relevance ranking, at least some of at least one of the computing or the identifying performed at least in part via a processing unit.
18. A tangible computer-readable storage medium comprising processor-executable instructions that when executed perform a method for improving relevance of web searches for a query, comprising:
computing a click relevance ranking of a plurality of query and url pairs based upon user log data comprising user click data, the computing comprising:
aggregating the user log data by query and url pair;
reducing click bias by determining normalized click rate;
creating a click relevance ordering of the plurality of query and url pairs;
creating a directed acyclic graph of a relevance relationship between the plurality of query and url pairs; and
mapping the directed acyclic graph into a linear ordering; and
identifying and correcting mislabeled query and url pairs within a pre-existing relevance ranking based upon the click relevance ranking.
10. A system for improving relevance of web searches for a query, comprising:
one or more processing units; and
memory comprising instructions that when executed by at least some of the one or more processing units implement the following alone or in combination with hardware:
a click relevance ranking component configured to:
compute a click relevance ranking of a plurality of query and url pairs based upon user log data comprising user click data, the computing comprising:
aggregating the user log data by query and url pair;
reducing click bias by determining normalized click rate;
creating a click relevance ordering of the plurality of query and url pairs;
creating a directed acyclic graph of a relevance relationship between the plurality of query and url pairs; and
mapping the directed acyclic graph into a linear ordering; and
a dynamic program component configured to:
identify and correct mislabeled query and url pairs within a pre-existing relevance ranking based upon the click relevance ranking.
2. The method of
respective ranks assigned to query and url pairs at one or more times;
respective total numbers of impressions associated with the ranks; or
respective total number of clicks associated with the ranks.
3. The method of
4. The method of
5. The method of
7. The method of
determining a longest common subsequence (LCS) of query and url pairs that is decreasing in both the pre-existing and click relevance ranking; and
removing labels from query and url pairs which are not in the LCS.
8. The method of
determining a longest common subsequence (LCS) of query and url pairs that is decreasing in both the pre-existing and click relevance ranking;
assigning pre-existing relevance ranking labels associated with the LCS of query and url pairs to the click relevance ranking; and
relabeling a label associated with a query and url pair not in the LCS with a new label interpolated from the click relevance ranking.
9. The method of
computing a distribution of labels in the pre-existing relevance ranking; and
relabeling one or more labels associated with the query and url pairs in the click relevance ranking according to the distribution of labels in the pre-existing relevance ranking.
11. The system of
respective ranks assigned to query and url pairs at one or more times;
respective total numbers of impressions associated with the ranks; or
respective total number of clicks associated with the ranks.
12. The system of
evaluate at least one of: total number of clicks, click rates, normalized click rates, or total number of impressions corresponding to a first url and a second url to determine whether the first url or the second url is more relevant to a query.
13. The system of
modify the pre-existing relevance ranking based upon the corrected query and url pairs.
14. The system of
map the directed acyclic graph into the linear ordering using a flooding technique.
15. The system of
determine a longest common subsequence (LCS) of query and url pairs that is decreasing in both the pre-existing and click relevance ranking; and
remove a label from a query and url pair which is not in the LCS.
16. The system of
determine a longest common subsequence (LCS) of query and url pairs that is decreasing in both the pre-existing and click relevance ranking;
assign pre-existing relevance ranking labels associated with the LCS of query and url pairs to the click relevance ranking; and
relabel a label associated with a query and url pair not in the LCS with a new label interpolated from the click relevance ranking.
17. The system of
compute a distribution of labels in the pre-existing relevance ranking; and
relabel one or more labels associated with the query and url pairs in the click relevance ranking according to the distribution of labels in the pre-existing relevance ranking.
19. The tangible computer-readable storage medium of
respective ranks assigned to query and url pairs at one or more times;
respective total numbers of impressions associated with the ranks; or
respective total number of clicks associated with the ranks.
20. The tangible computer-readable storage medium of
evaluating at least one of: total number of clicks, click rates, normalized click rates, or total number of impressions corresponding to a first url and a second url to determine whether the first url or the second url is more relevant to a query.
|
This application is a continuation of U.S. patent application Ser. No. 12/056,302 filed on Mar. 27, 2008, entitled “IMPROVED WEB SEARCHING,” which is hereby incorporated by reference in its entirety.
The internet has vast amounts of information distributed over a multitude of computers, thereby providing users with large amounts of information on varying topics. This is also true for a number of other communication networks, such as intranets and extranets. Finding information from such large amounts of data can be difficult.
Search engines have been developed to address the problem of finding information on a network. Users can enter one or more search terms into a search engine. The search engine will return a list of network locations (e.g., uniform resource locators (URLs)) that the search engine has determined contain relevant information. Often search engines rely upon human judges to decide on the relevancy of search results. This generally involves a group of relevancy experts employed or otherwise engaged by a search engine entity to hand label a number of query/URL pairs. These labels are used for training ranking algorithms, relevance evaluation, and a variety of other search engine tasks.
Human labeling is an expensive and labor intensive task. Therefore, financial and logistical constraints allow a small fraction of query/web page pairs to be labeled by experts. Furthermore, the quality of the labels is of great importance as labels are also used as “ground truth” when evaluating relevancy performance of search engines. Unfortunately, the quality of some of the human expert labels used in search engines may be less than desirable. Further, the quality of labels varies among different judges based on their experience and quality of work. For any given query, a significant number of relevancy labels may be inconsistent or incorrect.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key factors or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
As provided herein, the relevance of web based search results can be improved through a method of identifying and correcting mislabeled query/URL pairs based upon a click relevance ranking computed from user data comprising user click information. The click relevance ranking is formed by applying a set of relevance ordering rules to user log data aggregated by query and URL and by mapping the results of the relevance ordering rules into a linear ordering. For a given query, the aggregated user log data comprises a relative total number of impression, a relative total number of clicks received and a rank associated with the query/URL pair at the time of the total number of impressions and total number of clicks received. The click relevance ranking is used to identify and correct mislabeled query/URL pairs of other rankings according to a number of disclosed methods. Other embodiments and methods are also disclosed.
To the accomplishment of the foregoing and related ends, the following description and annexed drawings set forth certain illustrative aspects and implementations. These are indicative of but a few of the various ways in which one or more aspects may be employed. Other aspects, advantages, and novel features of the disclosure will become apparent from the following detailed description when considered in conjunction with the annexed drawings.
The claimed subject matter is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the claimed subject matter. It may be evident, however, that the claimed subject matter may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the claimed subject matter.
At 104 a pre-existing relevance ranking is provided for a word query. The pre-existing relevance ranking can be a based upon any method or algorithm of relevance ranking. In one embodiment, the pre-existing relevance ranking is a pre-existing ranking that is formed from human expert ranking.
At 106 search result URLs for the word query are click relevance ranked. Click relevance ranking is performed on a plurality of search result query/URL pairs and is based on user log information comprising user click data. A set of relevancy ordering rules are applied to the user log information to form a click relevance ranking.
At 108 mislabeled query/URL pairs in the pre-existing relevance ranking are identified and corrected by utilizing the click relevance ranking. The click relevance ranking and the pre-existing relevance ranking are compared using a number of different disclosed methods. The comparison identifies errors in the pre-existing relevance ranking. The pre-existing relevance ranking is then modified based upon the click relevance ranking to correct the identified errors.
For respective query/URL pairs a click rate is calculated at 206. The click rate is equal to the sum of the total number of impressions for respective ranks divided by the sum of the total number of clicks received for respective ranks.
The click bias is reduced at 208. Reduction of the click bias is performed by calculating a normalized click rate from the aggregated user log data. The normalized click rate allows an unbiased comparison of the relevance of different query/URL pairs associated with the same query. The unbiased comparison mitigates the influence of ranking on query/URL pair relevance.
At 210 a click relevance ordering is performed. The click relevance ordering is performed by applying a set of click relevance ordering rules to the aggregated user log data and the normalized click rate to form a relevance hierarchy of query/URL pairs. The click relevance ordering rules compare data (e.g., normalized click rate, total impressions, etc.) associated with a first URL to data associated with a second URL to determine if the first URL is more relevant than the second URL. Click relevance ordering rules are applied to the different combinations of the URLs for the word query. For some comparisons, the click relevance ordering rules may be undecided about the relation between the two URLs.
A directed acyclic graph of the relations between query/URL pairs is formed at 212. The results of the click relevance order of query/URL pairs are used to form the acyclic graphic. The acyclic graph provides a graphical representation of the relevance comparisons between query/URL pairs associated with the word query.
At 214 the acyclic graph is mapped into a linear ordering. Linear ordering determines query/URL pair relations that were undecided by the click relevance ordering rules. Mapping the acyclic graph into a linear ordering can be performed by a number of methods such as flooding, for example.
A more detailed example of aggregating user log data is set forth in
In
It will be appreciated, however, that other algorithms may be utilized for estimating a click bias. Once the click rate at rank 402 and click weight at rank 404 are calculated, the normalized click rate can be calculated by taking the sum of the product of the number of clicks (Ci) 306 and click weight at rank (CWi) 402 divided by the total number of impressions (Ii) 304,
The normalized click rate is calculated for respective URL/query pairs.
A more detailed example of click relevance ordering rules utilized in click relevance ranking is set forth in
if Cm≈Cn and CRm>CRn and NCRm>NCRn (1)
if Im≈In and CRm>CRn and NCRm>NCRn (2)
if Im≈In and CRm>2×CRn and NCRm≈NCRn (3)
if Im≈In and CRm≈CRn and NCRm>2×NCRn (4)
if Im>100 and In>100 and CRm>5×CRn and NCRm>0.8×NCRn, (5)
If CRm>50×Cn (6)
wherein Cm and Cn are the total number of clicks received, CRm and CRn are the click rate, NCRm and NCRn are the normalized click rate, and Im and In are the total number of impressions for a first URL, URLm, and a second URL, URLn. It will be appreciated, however, that these rules are merely exemplary and that different rules may be utilized to determine click relevance.
At 502 the first relevance order rule (1) is applied. If a first URL and a second URL satisfy rule (1) then the first URL is determined to be more relevant than the second URL and the flow chart goes to 514. If the first and second URL do not satisfy rule (1), then the second rule (2) is applied at 504 to the first and second URLs. If the first URL and the second URL satisfy the second rule (2) then the first URL is more relevant than the second URL. If the first and second URLs do not satisfy the second rule (2) then the third rule is applied at 506 to both URLs. A similar application of rules (3) to (6) applies in 508 to 512. The click relevance ordering rules are applied to the different combinations for query/URL pairs for the word query. It is possible that not all relationships between query/URL pairs will be decided by the click relevance ordering rules. In such a situation, a subsequent linear order (e.g., 210 of
A more detailed example of mapping the acyclic graph into a linear ordering by a flooding technique is set forth in
In
Once the URL search results are click relevance ranked, the click relevance ranking is used to identify and correct mislabeled URLs in the pre-existing relevance ranking of the word query. Exemplary methods are set forth in
A first embodiment of a method to identify and correct mislabeled query/URL pairs is set forth in
An additional embodiment of a method used to identify and correct mislabeled query/URL pairs is set forth in
An additional embodiment of a method used to identify and correct mislabeled query/URL pairs is set forth in
In a further embodiment of the embodiments shown in
Still another embodiment involves a computer-readable medium comprising processor-executable instructions configured to apply one or more of the techniques presented herein. An exemplary computer-readable medium that may be devised in these ways is illustrated in
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
As used in this application, the terms “component,” “module,” “system”, “interface”, and the like are generally intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term “article of manufacture” as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.
Although not required, embodiments are described in the general context of “computer readable instructions” being executed by one or more computing devices. Computer readable instructions may be distributed via computer readable media (discussed below). Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types. Typically, the functionality of the computer readable instructions may be combined or distributed as desired in various environments.
In other embodiments, device 1502 may include additional features and/or functionality. For example, device 1502 may also include additional storage (e.g., removable and/or non-removable) including, but not limited to, magnetic storage, optical storage, and the like. Such additional storage is illustrated in
The term “computer readable media” as used herein includes computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data. Memory 1508 and storage 1516 are examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by device 1502. Any such computer storage media may be part of device 1502.
Device 1502 may also include communication connection(s) 1520 that allows device 1502 to communicate with other devices. Communication connection(s) 1526 may include, but is not limited to, a modem, a Network Interface Card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interfaces for connecting computing device 1502 to other computing devices. Communication connection(s) 1526 may include a wired connection or a wireless connection. Communication connection(s) 1526 may transmit and/or receive communication media.
The term “computer readable media” may include communication media. Communication media typically embodies computer readable instructions or other data in a “modulated data signal” such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” may include a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
Device 1502 may include input device(s) 1524 such as keyboard, mouse, pen, voice input device, touch input device, infrared cameras, video input devices, and/or any other input device. Output device(s) 1522 such as one or more displays, speakers, printers, and/or any other output device may also be included in device 1502. Input device(s) 1524 and output device(s) 1516 may be connected to device 1502 via a wired connection, wireless connection, or any combination thereof. In one embodiment, an input device or an output device from another computing device may be used as input device(s) 1524 or output device(s) 1522 for computing device 1502.
Components of computing device 1502 may be connected by various interconnects, such as a bus. Such interconnects may include a Peripheral Component Interconnect (PCI), such as PCI Express, a Universal Serial Bus (USB), firewire (IEEE 1394), an optical bus structure, and the like. In another embodiment, components of computing device 1502 may be interconnected by a network. For example, memory 1508 may be comprised of multiple physical memory units located in different physical locations interconnected by a network.
Those skilled in the art will realize that storage devices utilized to store computer readable instructions may be distributed across a network. For example, a computing device 1530 accessible via network 1528 may store computer readable instructions to implement one or more embodiments provided herein. In one configuration, computing device 1530 includes at least one processing unit 1532 and memory 1534. Depending on the exact configuration and type of computing device, memory 1506 may be volatile (such as RAM, for example), non-volatile (such as ROM, flash memory, etc., for example) or some combination of the two. In one embodiment, computer readable instructions to implement one or more embodiments provided herein may be in memory 1534. For example, the memory may comprise a browser 1536 in relation to one or more of the embodiments herein.
Computing device 1502 may access computing device 1530 and download a part or all of the computer readable instructions for execution. Alternatively, computing device 1502 may download pieces of the computer readable instructions, as needed, or some instructions may be executed at computing device 1502 and some at computing device 1530.
Various operations of embodiments are provided herein. In one embodiment, one or more of the operations described may constitute computer readable instructions stored on one or more computer readable media, which if executed by a computing device, will cause the computing device to perform the operations described. The order in which some or all of the operations are described should not be construed as to imply that these operations are necessarily order dependent. Alternative ordering will be appreciated by one skilled in the art having the benefit of this description. Further, it will be understood that not all operations are necessarily present in each embodiment provided herein.
Moreover, the word “exemplary” is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “exemplary” is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word exemplary is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X employs A or B” is intended to mean any of the natural inclusive permutations. That is, if X employs A; X employs B; or X employs both A and B, then “X employs A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims may generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form.
Also, although the disclosure has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The disclosure includes all such modifications and alterations and is limited only by the scope of the following claims. In particular regard to the various functions performed by the above described components (e.g., elements, resources, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms “includes”, “having”, “has”, “with”, or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”
Wang, Xuanhui, Chellapilla, Kumar, Mityagin, Anton
Patent | Priority | Assignee | Title |
10430473, | Mar 09 2015 | Microsoft Technology Licensing, LLC | Deep mining of network resource references |
8812520, | Apr 23 2010 | GOOGLE LLC | Augmented resource graph for scoring resources |
Patent | Priority | Assignee | Title |
6640218, | Jun 02 2000 | HUDSON BAY MASTER FUND LTD | Estimating the usefulness of an item in a collection of information |
7542970, | May 11 2006 | International Business Machines Corporation | System and method for selecting a sub-domain for a specified domain of the web |
7603348, | Jan 26 2007 | R2 SOLUTIONS LLC | System for classifying a search query |
7610282, | Mar 30 2007 | GOOGLE LLC | Rank-adjusted content items |
8145623, | May 01 2009 | GOOGLE LLC | Query ranking based on query clustering and categorization |
20050120311, | |||
20050149504, | |||
20050154716, | |||
20050165753, | |||
20060004891, | |||
20060253428, | |||
20060287993, | |||
20070208730, | |||
20070214115, | |||
20070255689, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 29 2010 | Microsoft Corporation | (assignment on the face of the patent) | / | |||
Oct 14 2014 | Microsoft Corporation | Microsoft Technology Licensing, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034544 | /0001 |
Date | Maintenance Fee Events |
May 27 2016 | REM: Maintenance Fee Reminder Mailed. |
Oct 16 2016 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Oct 16 2015 | 4 years fee payment window open |
Apr 16 2016 | 6 months grace period start (w surcharge) |
Oct 16 2016 | patent expiry (for year 4) |
Oct 16 2018 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 16 2019 | 8 years fee payment window open |
Apr 16 2020 | 6 months grace period start (w surcharge) |
Oct 16 2020 | patent expiry (for year 8) |
Oct 16 2022 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 16 2023 | 12 years fee payment window open |
Apr 16 2024 | 6 months grace period start (w surcharge) |
Oct 16 2024 | patent expiry (for year 12) |
Oct 16 2026 | 2 years to revive unintentionally abandoned end. (for year 12) |