A computer based system and method of determining whether to re-fetch a previously retrieved document across a computer network is disclosed. The method utilizes a statistical model to determine whether the previously retrieved document likely changed since last accessed. The statistical model is continuously improving its accuracy by training internal probability distributions to reflect the actual experience with change rate patterns of the documents accessed. The decision of whether to access the document is based on the probability of change compared against a desired synchronization level, random selections, maximum limits on the amount of time since the document was last accessed, and other criterion. Once the decision to access is made, the document is checked for changes and this information is used to train the statistical model.
|
8. A computer-implemented method for retrieving one document in a plurality of documents from either a remote server or a local cache, the one document having been previously retrieved from the remote server and a copy thereof stored in the cache, the method comprising:
Maintaining historical information representing prior changes to the one document; the historical information representing prior changes to the one document comprises for one document, a change count representing the number of times the one document has been modified, an access count representing the number of times the one document has been retrieved from the remote server, a first access time representing the time the one document was first retrieved from the remote server, and a last access time representing the time the one document was last retrieved from the remote server; Initiating a document retrieval request procedure for retrieving particular documents in the plurality of documents; and
Determining whether to access the one document from the remote server or the local cache, wherein the determination is based on a probabilistic analysis of the historical information representing prior changes to the one document, wherein the probabilistic analysis comprises: Computing a probability that the one document has changed since the one document was last retrieved from the remote server, the probability that the one document has changed since the document was last retrieved from the remote server being computed without examining the one document and
Beginning with a probability that a pre-defined proportion of documents in the plurality of documents has changed, training the probability that the pre-defined proportion of documents has changed using the historical information representing prior changes to the one document to achieve the probability that the one document has changed; Wherein the step of training the probability comprises: creating a timeline using the historical information, the timeline having representations thereon of no change intervals, change intervals, and no change chunk intervals; Training the document probability distribution for each no change interval; Training the document probability distribution for each change interval; and Training the document probability distribution for each no change chunk interval.
1. A computer-readable medium storing computer-executable instructions for retrieving one document in a plurality of documents from either a remote server or a local cache, the one document having been previously retrieved from the remote server and a copy thereof stored in the cache, that, when executed, comprise:
Maintaining historical information representing prior changes to the one document; the historical information representing prior changes to the one document comprises for one document, a change count representing the number of times the one document has been modified, an access count representing the number of times the one document has been retrieved from the remote server, a first access time representing the time the one document was first retrieved from the remote server, and a last access time representing the time the one document was last retrieved from the remote server;
Initiating a document retrieval request procedure for retrieving particular documents in the plurality of documents;
Determining whether to access the one document from the remote server or the local cache, wherein the determination is based on a probabilistic analysis of the historical information representing prior changes to the one document, wherein the probabilistic analysis comprises: Computing a probability that the one document has changed since the one document was last retrieved from the remote server, the probability that the one document has changed since the document was last retrieved from the remote server being computed without examining the one document and
Beginning with a probability that a pre-defined proportion of documents in the plurality of documents has changed, training the probability that the pre-defined proportion of documents has changed using the historical information representing prior changes to the one document to achieve the probability that the one document has changed; Wherein the step of training the probability comprises: creating a timeline using the historical information, the timeline having representations thereon of no change intervals, change intervals, and no change chunk intervals; Training the document probability distribution for each no change interval; Training the document probability distribution for each change interval; and Training the document probability distribution for each no change chunk interval.
15. A system for retrieving one document in a plurality of documents from either a remote server or a local cache, the one document having been previously retrieved from the remote server and a copy thereof stored in the cache, the system comprising:
a processor; and a memory having computer-executable instruction stored thereon, the computer-executable instructions including instructions for:
Maintaining historical information representing prior changes to the one document; the historical information representing prior changes to the one document comprises for one document, a change count representing the number of times the one document has been modified, an access count representing the number of times the one document has been retrieved from the remote server, a first access time representing the time the one document was first retrieved from the remote server, and a last access time representing the time the one document was last retrieved from the remote server;
Initiating a document retrieval request procedure for retrieving particular documents in the plurality of documents; and
Determining whether to access the one document from the remote server or the local cache, wherein the determination is based on a probabilistic analysis of the historical information representing prior changes to the one document, wherein the probabilistic analysis: comprises: Computing a probability that the one document has changed since the one document was last retrieved from the remote server, the probability that the one document has changed since the document was last retrieved from the remote server being computed without examining the one document and
Beginning with a probability that a pre-defined proportion of documents in the plurality of documents has changed, training the probability that the pre-defined proportion of documents has changed using the historical information representing prior changes to the one document to achieve the probability that the one document has changed; Wherein the step of training the probability comprises: creating a timeline using the historical information, the timeline having representations thereon of no change intervals, change intervals, and no change chunk intervals; Training the document probability distribution for each no change interval; Training the document probability distribution for each change interval; and Training the document probability distribution for each no change chunk interval.
2. The computer-readable medium of
If the determination to access the one document is positive, identifying the one document for retrieval from the cache during the document retrieval procedure.
3. The computer-readable medium of
4. The computer-readable medium of
5. The computer-readable medium of
6. The computer-readable medium of
The historical information representing prior changes to the one document includes a hash value associated with the one document, the hash value being a representation of the one document; and
The probabilistic analysis includes a comparison of the hash value included in the historical information with another hash value calculated from information retrieved from the one document stored on the remote server each time the one document is retrieved from the remote server.
7. The computer-readable medium of
9. The method of
If the determination to access the one document is positive, identifying the one document for retrieval from the cache during the document retrieval procedure.
10. The method of
11. The method of
12. The method of
13. The method of
The historical information representing prior changes to the one document includes a hash value associated with the one document, the hash value being a representation of the one document; and
The probabilistic analysis includes a comparison of the hash value included in the historical information with another hash value calculated from information retrieved from the one document stored on the remote server each time the one document is retrieved from the remote server.
14. The method of
16. The system of
If the determination to access the one document is positive, identifying the one document for retrieval from the cache during the document retrieval procedure.
17. The system of
18. The system of
19. The system of
20. The system of
The historical information representing prior changes to the one document includes a hash value associated with the one document, the hash value being a representation of the one document; and
The probabilistic analysis includes a comparison of the hash value included in the historical information with another hash value calculated from information retrieved from the one document stored on the remote server each time the one document is retrieved from the remote server.
21. The system of
|
This application is a continuation of U.S. patent application Ser. No. 09/603,695, now U.S. Pat. No. 6,883,135, filed Jun. 26, 2000 and entitled “Proxy Server Using a Statistical Model”, the entire contents of which are hereby incorporated by reference in its entirety.
This application is a related by subject matter to inventions disclosed in commonly assigned U.S. patent application Ser. No. 09/493,748, now abandoned, filed Jan. 28, 2000, and entitled “ADAPTIVE WEB CRAWLING USING A STATISTICAL MODEL,” the contents of which are hereby incorporated by reference in their entirety.
The present invention relates generally to the field of network information software and, and more particularly to methods and systems for retrieving data from network sites.
In recent years, there has been a tremendous proliferation of computers connected to a global network known as the Internet. A “client” computer connected to the Internet can download digital information from “server” computers connected to the Internet. Client application software executing on client computers typically accept commands from a user and obtain data and services by sending requests to server applications running on server computers connected to the Internet. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hyper Text Transfer Protocol (HTTP), the Simple Mail Transfer Protocol (SMTP), and the “Gopher” document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as “the Web.” The World Wide Web is an information service on the Internet providing documents and links between documents. The World Wide Web is made up of numerous Web sites located around the world that maintain and distribute documents. The location of a document on the Web is typically identified by a document address specification commonly referred to as a Universal Resource Locator (URL). A Web site may use one or more Web server computers that store and distribute documents in one of a number of formats including the Hyper Text Markup Language (HTML). An HTML document contains text and metadata or commands providing formatting information. HTML documents also include embedded “links” that reference other data or documents located on any Web server computers. The referenced documents may represent text, graphics, or video in respective formats.
A Web browser is a client application or operating system utility that communicates with server computers via FTP, HTTP, and Gopher protocols. Web browsers receive documents from the network and present them to a user. Internet Explorer, available from Microsoft Corporation, of Redmond, Wash., is an example of a popular Web browser application.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
Generally, a proxy server is a server that sits between a secure network, such as a corporate intranet, and a non-secure network, such as the Internet. It processes requests from computers on the intranet for access to resources on the Internet, while limiting or blocking access to the intranet from external computer systems. For efficiency purposes, it may in some cases attempt to fulfill these requests itself.
In a typical proxy server implementation, the proxy server operates to filter requests for Web pages from the corporate intranet to the Internet. Web page requests are routed by the proxy server to the non-secure network and upon receipt of a requested Web page from the non-secure network, the proxy server forwards the Web page to the end user.
Proxy servers are often configured with a local cache area which might be located on a disc drive and in which are stored Web pages that have previously been accessed. Upon receipt of a request for a previously accessed Web page, the proxy server can access the copy of the Web page stored on local disc rather than request the page from the non-secure network.
Thus, the cache contains copies of Web pages, wherein the actual Web pages exist on the non-secure network. Of course, the actual Web pages may, and often do change. When a Web page on the non-secure network changes, the copy of the Web page stored in cache becomes out-of-date. In order to minimize the probability that an out-of-date Web page will be routed to a user, it is necessary to periodically refresh the cache, i.e. re-fetch the Web page from the non-secure network.
In existing proxy servers, the decision of whether to re-fetch a Web page is made by referencing information stored in the Web page header. Generally, Web page headers may have stored therein an expiration date and a modification time. The expiration date identifies an estimated date after which the Web page can no longer be considered to be current and the modification time identifies the time the Web page was last modified. In existing proxy servers, if a Web page's expiration date has expired, the proxy server issues a request across the non-secure network to forward a new copy of the Web Page if the modification time for the Web page stored on the non-secure network is different than that stored on the proxy server. Thus, if the modification time indicates that the Web page has changed, the Web page on the proxy server is updated.
There are, however, problems presented in relying on header information for making re-fetch decisions. For example, the header information for many Web pages does not include expiration dates and modification times, thereby making it impossible to rely on this information for re-fetch decisions. Additionally, the expiration date, even when present, is not necessarily reliable as it represents only an estimate of when a Web page may be changed. Furthermore, Web page header information is stored with the actual Web pages on the non-secure network. In order to check the modification time for a Web page and make a re-fetch decision, it is necessary to access the modification time across the non-secure network. Making connections over the non-secure network slows the decision process and adds to system overhead.
Therefore, it is desirable to have an improved proxy server. More specifically, it would be a significant improvement in the art to have a mechanism by which a proxy server can selectively access either an original document located across a network or a previously retrieved copy of the document stored locally in cache based in part on the probability that the document has actually changed in some substantive way since it was last accessed. Preferably, such a mechanism will make the decision to access or not to access the original Web document without having to establish a connection with a host server that stores the original of the document. The mechanism would also preferably provide a way to continually improve the accuracy of its decisions to retrieve a document either from cache or across a network based on the actual experience of the proxy server as it tracks changed documents encountered during Web accesses. If a decision is made by the proxy server to access a document across the web as opposed to the copy in cache, the mechanism should provide a way to quickly and accurately determine if the original document has indeed changed. The present invention is directed to providing such a mechanism.
Briefly, the present invention is directed toward remedying these shortcomings by providing an improved proxy server for retrieving data from a computer network. The proxy server employs novel systems and methods to intelligently determine, based in part on a statistical model and prior document retrievals, which documents are most likely to have changed since a previous retrieval and adaptively decide on whether to access a copy of a document stored in cache or to access the original document across a network.
In accordance with an aspect of the invention, each document retrieval request begins with an active probability distribution containing a plurality of probabilities indicative that a document has changed at a given change rate. A history map is maintained by the proxy server that references a number of documents that have previously been accessed. For each referenced document in the history map, a document probability distribution is initialized as a copy of the active probability distribution. The document probability distribution is trained under a statistical model. The training is based on changes to the document experienced by the proxy server during the previous document retrievals. A probability that the document has changed is during an interval of interest is then computed based on the document probability distribution and the statistical model. A decision to access or not to access the document is made with the aid of this computed probability.
In accordance with additional aspects of the invention, the document probability distribution is trained for events as experienced with the document upon previous accesses. These events may include “change events” or “no change events.” A change event may be where the document was found to have changed in some substantive manner since the last access of the document. A no change event may be where an access to the document determines that the document has not changed. A no change event determination may be made in many ways, such as by evaluating a time stamp associated with the document, or if no substantive change is found when a hash value of the currently retrieved document matches a hash value of the previously retrieved document. Events such as “no change chunk events” may also be interpolated from experienced events, as is described in detail below.
The probability that the document has changed (the “document change probability”) is computed based on is the document probability distribution. A bias is then computed based on the document change probability in conjunction with a synchronization level. The synchronization level may be a predefined value that specifies the percentage of documents that are expected to be synchronized at any given time. A decision whether to access the document is made based on a “coin-flip” using the computed bias.
In accordance with further aspects of the invention, the methods and systems of the present invention conserve computer resources by balancing the need for accuracy in the statistical model against the computer storage and computing resources available. In an actual embodiment of the invention, a minimal amount of historical information is maintained for each document in a history map. This historical information is converted by the method and systems of the present invention to interpolate change events, no change events, and no change chunk events by mapping data recorded in the history map to a timeline. From the interpolation, the variables required by the statistical model can be determined with reasonable accuracy, given the limited resources available to the proxy server and the need for speedy processing when conducting a document retrieval.
In accordance with still further aspects of the invention, when a proxy server in accordance with the invention first begins operating, a training probability distribution is initialized to essentially zero by multiplying a copy of a base probability distribution (containing a starting point estimate of probabilities that a document will change at a given change rate) by a small diversity factor. The training probability distribution recursively accumulates the document probability distribution for each document that is retrieved across the network. By summing each probability in the training probability distribution with a corresponding probability from each document probability distribution, the training probability distribution represents the accumulated experience associated with the document probability distributions for all documents processed. Periodically, the training probability distribution is stored and used as the active probability distribution for future document retrievals. This feed-back of the training probability distribution into the active probability distribution provides for a constantly-improving statistical model for determining whether to retrieve a document from cache or across the Internet.
In accordance with other aspects of this invention, a secure hash function is used to determine a hash value corresponding to each previously retrieved document. The hash value is stored in a history map and is used in subsequent document retrievals to determine whether the corresponding current document is modified. A secure hash function may be used to obtain a new hash value, which is compared with the hash value for the previously retrieved document data. If the hash values are equal, the current document is considered to be substantively equivalent to the previously retrieved document data. If the hash values differ, the current document is considered to be modified and a change counter is incremented for the document. An access counter may also be incremented each time a network access is attempted for the current document.
As will be readily appreciated from the foregoing description, a system and method formed in accordance with the invention minimizes the re-fetching of documents across a network. Thus, the invention provides a proxy server that responds to document requests in less time and with greater efficiency.
Other features of the invention are further apparent from the following detailed description of presently preferred exemplary embodiments of the invention taken in conjunction with the accompanying drawings, of which:
Overview
The present invention is directed to improved computer-based systems and methods for determining whether to retrieve a copy of a document from cache or to re-fetch the original document across a network. The systems and methods employ a statistical model and data collected from past retrievals to adaptively decide whether or not to re-fetch a document. Specifically, the system maintains an active probability distribution that contains a plurality of probabilities indicative that a document has changed at a given change rate. The system further maintains a history map having data stored therein for the documents that have previously been fetched and now residing in cache. For each document having an entry in the history map, a document probability distribution is initialized as a copy of the active probability distribution. The document probability distribution is revised or “trained” using the data in the history map that specifies changes to the document experienced during previous retrievals. A probability that the document may have changed is calculated based upon this “trained” document probability distribution and a statistical model, which in one embodiment is based upon a Poisson distribution. A decision to access or not to access the document is made with the aid of this computed probability that the document may have changed.
Prior to explaining the details of the invention, it is useful to provide a description of a suitable exemplary environment in which the invention may be implemented.
Exemplary Operating Environment
1. A Computer Environment
With reference to
A number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37 and program data 38. A user may enter commands and information into the personal computer 20 through input devices such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone, joystick, game pad, satellite disk, scanner or the like. These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the-system bus, but may be connected by other interfaces, such as a parallel port, game port or universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor 47, personal computers typically include other peripheral output devices (not shown), such as speakers and printers.
The personal computer 20 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 49. The remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the personal computer 20, although only a memory storage device 50 has been illustrated in
When used in a LAN networking environment, the personal computer 20 is connected to the local network 51 through a network interface or adapter 53. When used in a WAN networking environment, the personal computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program modules depicted relative to the personal computer 20, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
2. A Network Environment
The computer network 216 may be a local area network 51 (
A client computer 214, such as the personal computer 20 (
When a user at the client computer 214 desires to retrieve one or more documents that may be located, for example, on server 218, the client computer transmits a request to a server computer 204. Proxy server 206 handles the request. If the document has previously been retrieved and stored in cache 208, proxy server 206 determines, as described in detail below, whether to retrieve the document from cache 208 or to re-fetch the document across network 216 from remote server 218.
As will be readily understood by those skilled in the art of computer network systems, and others, the system illustrated in
The history map further includes a hash value 416 corresponding to each document identified in the history map. A hash value results from applying a “hash function” to the document. A hash function is a mathematical algorithm that transforms a digital document into a smaller representation of the document (called a “hash value”). A “secure hash function” is a hash function that is designed so that it is computationally unfeasible to find two different documents that “hash” to produce identical hash values. A hash value produced by a secure hash function serves as a “digital fingerprint” of the document. The “MD5” is one such secure hash function, published by RSA Laboratories of Redwood City, Calif. in a document entitled RFC 1321, suitable for use in conjunction with the present invention.
Historical information such as the first access time 422, the last access time 424, the change count 426, and the access count 428 are used in a statistical model for deciding if a document should be re-fetched across the network 216 or retrieved from cache 208, as is discussed below with reference to
If at step 504, it is determined that the requested document is stored in cache, at step 510 it is determined whether the document in cache 208 has “expired,” indicating that the document needs to be re-fetched across the network. The process for making this decision is described in detail below with reference to
If, at step 612, it is determined that an entry exists in history map 308 for the re-fetched document, at step 616, hash value 416 of the corresponding document is located in history map 308 and compared with the new hash value calculated at step 610. If the two values are equal, indicating that the filtered data corresponding to the newly retrieved document is the same as the filtered data corresponding to the previously retrieved version of the document, processing continues to step 622. If, at step 616, the hash values are not equal, indicating that the document has changed, at step 618, the change count 426 is incremented and hash value 416 is set equal to the new value calculated at step 610. The change made to the change count 426 indicates that the document was found to have changed in a substantive way. At step 622, the last access time 424 is set to the current time and the access count 428 is incremented.
Turning to
A method of an actual embodiment of the invention for estimating a set of starting values for the base probability distribution is illustrated in
Expressed in this way, the base probability distribution, and all probability distributions that descend from it, represent the probability that the document will change at a given rate, over a plurality of sample rates. It will be apparent to one skilled in the art that there are many ways to estimate and express initial base probability distributions while remaining within the spirit and scope of the present invention. For instance, the initial probability rates may be set to anything from normalized random numbers to actual probability rates determined experimentally over time.
Returning to
Where High is the highest expected change rate, Low is the lowest expected change rate, and N is the number of samples, or resolution. The final change rate (RN) in the change rate distribution is assigned a change rate that is low enough that the document is essentially considered static. Although one actual method for selecting sample rates has been described here, those skilled in the art will appreciate that any number of ways are available for selecting a sample rate and each may be employed by the mechanism of the invention without deviating from the spirit or scope of the invention.
As will be explained below, the active probability distribution serves as the starting point for evaluating a request for a document by providing a starting value for document probability distributions. The active probability distribution is initialized to the value of the base probability distribution. Thus, returning to
It should be noted, however, that the above values are initial values that are set when proxy server 206 first begins operation. During the operation of proxy server 206, the training probability distribution changes as described below with reference to
At decision step 816, a determination is made whether a predefined maximum amount of time has expired since the last time the document was accessed. In other words, the present invention optionally provides a mechanism to ensure that a document is retrieved across network 216 after a certain amount of time regardless of whether the document may have changed. If the time has expired, the process continues to step 812 where a “cache expired” response is returned. If not, the process continues to step 818.
At step 818, a document probability distribution is calculated for the document being processed. The calculation of the document probability distribution is illustrated in
At step 822, a weighted sum of the probabilities in the document probability distribution is taken according to the Poisson model, with DT equal to the time since the last access of the document (i.e., DPD[1]*(E^(−R[1]*DT))+DPD[2]*(E^(−R[2]*DT))+ . . . + DPD[n]*(E^(−R[n]*DT)). The weighted sum thus computed is the probability that the document has not changed (PNC). The probability that the document has changed (PC) is the complement of PNC (PC=1−PNC).
At step 824, a probability that the document will be accessed (PA) may be optionally computed and biased by both a specified synchronization level (S) and the probability that the document has changed (PC). In other words, this embodiment of the invention optionally allows the ultimate decision whether to retrieve a document to be biased by a synchronization level, specified by a system administrator. By adjusting the synchronization level, a system administrator may bias the likelihood of retrieving documents in accordance with the administrator's tolerance for having unsynchronized documents. Thus, using the formula PA=1−((1−S)/PC), where S is the desired synchronization level and PC is the probability that the document has changed as calculated in step 822, a probability (PA) that the document should be accessed is calculated.
At step 826, a coin flip is generated with a “heads” bias equal to the probability of access (PA) computed in step 824. A decision is made to either “access” or “not access” the document across the network based on the result of this coin flip. The coin flip is provided because it may be desirable to add a random component to the retrieval of documents in order to strike a balance between the conservation of resources and ensuring document synchronization. The bias PA calculated at step 824 is applied to the coin flip to influence the outcome in favor of the likelihood that the document has changed, modified by the desired synchronization level. The outcome of the coin flip is passed to decision step 830.
At decision step 830 if the outcome of the coin flip is “heads”, the instruction “cache expired” is returned at step 812. Otherwise, the instruction “cache not expired” indicating the document should be retrieved from cache is returned at step 832. Following steps 812 or 832, the process of
FIG. 10A1-2 illustrate a process for training the document probability distribution. At step 1010, the accesses 428 to a document from history map 308 are mapped to a timeline. One example of such a timeline is illustrated in
At step 1012, the process assumes that the amount of time between each change (identified by the change count 426) is uniform. Thus, the changes are evenly distributed on the timeline. The information necessary for the application of the Poisson process can be derived from the mapping of the changes to the timeline. The process continues from step 1012 to step 1014.
At step 1014, several variables are calculated from the historical information in each entry 410 of history map 308 for use in the training of the document probability distribution. The average time between accesses (intervals) is computed and stored as the interval time (DT). The number of intervals between changes is calculated (NC). The number of intervals in which a change occurred is calculated (C). A group of intervals between changes is termed a “no change chunk.” Accordingly, the number of no change chunks (NCC) is calculated. And, finally, the length of time of each no change chunk (DTC) is calculated.
An event probability distribution for a no change event is computed in a step 1030. The event probability distribution includes a plurality of probabilities (EP[N]) that the event will occur at a given change rate (N) for the interval (DT) experienced with the no change events. Each probability EP[N] is computed using the Poisson process: EP[N]=e^(−R[N]* DT) where e is the transcendental constant used as the base for natural logarithms, R[N] is the rate of change and DT is the time interval of the event. At step 1032, the event probability distribution EP[N] calculated at step 1030 is passed to a process for training the document probability distribution for the no change events. The operations performed by the process to train the document probability distribution for each no change event are illustrated in detail in FIG. 10C1-2 and described below.
At a step 1033, an event probability distribution for a change event is computed. The event probability distribution includes a plurality of probabilities (EP[N]) that the event will occur at a given change rate (N) for the interval (DT) experienced with the change events. Each probability EP[N] is computed using the Poisson process: EP[N]=1−e^(−R[N]* DT). Alternatively, the event probability distribution may be calculated by taking the complement of each probability in the event probability distribution calculated for the no change events (as calculated in step 1030). At step 1034, the event probability distribution EP[N] calculated at step 1033 is passed to a process for training the document probability distribution for the change events. As mentioned above, the operations performed by the process to train the document probability distribution are illustrated in detail in FIG. 10C1-2 and described below.
At a step 1035, an event probability distribution for a no change chunk event is computed. The event probability distribution includes a plurality of probabilities (EP[N]) that the no change chunk event will occur at a given change rate (N) for the interval (DTC) interpolated for the no change chunk events. Each probability EP[N] is computed using the Poisson process: EP[N]=e^(−R[N]*DTC). At step 1034, the event probability distribution EP[N] calculated at step 1035 is passed to a process for training the document probability distribution for the no change chunk events, as illustrated in detail in FIG. 10C1-2.
In summary, at step 1032, the document probability distribution is trained for each no change interval. At step 1034, the document probability distribution is trained for each change interval. And at step 1038, the document probability distribution is trained for each no change chunk interval. The order that the events/intervals are trained in steps 1032, 1034, and 1038 is believed to be immaterial. Once the document probability distribution is completely trained, the process of
In general, an interval 1020 that does not contain a change event 1019 is considered to contain a no change event 1021. Since a longer interval period has a significant effect on the probability calculated by the Poisson equation, no change intervals occurring between adjacent change intervals may be grouped into “no change chunks” 1028. A no change chunk 1028 is a group of no change intervals, which may be used to calculate a chunk time interval (DTC). In cases where there is a remainder of no change intervals which cannot be evenly placed into a no change chunk 1020, the remainder intervals are treated as no change intervals 1021 and are used to train the document probability distribution separately. It should be appreciated that although one actual embodiment is described here for mapping events onto a timeline, there are many other, equally acceptable ways for mapping events onto a timeline. Accordingly, the present invention is not limited to the specific examples described here.
FIGS. 10C1-2 illustrate one exemplary process for training the document probability distribution for occurrence of an event for each passed event type (e.g., no change event, change event and no change chunk event). Beginning with step 1050, each occurrence of an event type (e.g., C, NC, NCC) is trained. At step 1052, the probability of the event occurring is computed by summing the results of multiplying each probability in the document probability distribution (given a particular change rate) by the corresponding probability that the event has occurred (given a particular change rate): i.e., P=SUM(DPD[i]*EP[i]). This probability P is checked against a minimum probability constant that is set by the system administrator. If the probability P is less than the minimum probability value, a decision step 1054 directs the process to set P to the minimum probability value in a step 1056.
Once checked by decision step 1054 and the value of P reset, if necessary, each probability in the document probability distribution is updated by multiplying each probability in the (old) document probability distribution by a corresponding probability in the event probability training distribution and dividing the result by the probability of the event occurring, i.e., DPD[N]=(DPD[N]*EP[N])/P.
The document probability distribution resulting from step 1058 is checked in a decision step 1060 for an adequate normalization, by determining if the sum of the probabilities in the document probability distribution deviate from a total of 100% by more than a predetermined normalization threshold constant. If the normalization threshold constant is exceeded, the document probability distribution is normalized in a step 1062.
At a step 1064, if there is another event to train the document probability distribution for, the process control is passed back to step 1050 (
Thus, as described above, the present invention provides systems and methods for determining whether to retrieve a document from cache or to re-fetch the document across a network. The systems and methods employ a statistical model and data collected from past retrievals to adaptively decide whether or not to re-fetch a document. These aspects of the invention provide for a proxy server that is faster and more efficient than existing systems.
Those skilled in the art understand that computer readable instructions for performing the above described processes can be generated and stored on a computer readable medium such as a magnetic disk or CD-ROM. Further, a computer such as that described with reference to
While the invention has been described and illustrated with reference to specific embodiments, those skilled in the art will recognize that modification and variations may be made without departing from the principles of the invention as described above and set forth in the following claims. In particular, the invention may be employed in virtually any situation wherein it is necessary to either retrieve a document from cache or from another location. Further, while the invention has been described with reference to a Poisson distribution, other statistical models might also be used. Accordingly, reference should be made to the appended claims as indicating the scope of the invention.
Meyerzon, Dmitriy, Obata, Kenji
Patent | Priority | Assignee | Title |
11095494, | Oct 15 2007 | Viasat, Inc | Methods and systems for implementing a cache model in a prefetching system |
8645345, | Apr 24 2003 | Affini, Inc. | Search engine and method with improved relevancy, scope, and timeliness |
8738635, | Jun 01 2010 | Microsoft Technology Licensing, LLC | Detection of junk in search result ranking |
8812493, | Apr 11 2008 | Microsoft Technology Licensing, LLC | Search results ranking using editing distance and document information |
8825893, | Sep 18 2009 | Kabushiki Kaisha Toshiba | Relay device, relay method and relay system |
8843486, | Sep 27 2004 | Microsoft Technology Licensing, LLC | System and method for scoping searches using index keys |
8966053, | Jul 12 2007 | Viasat, Inc | Methods and systems for performing a prefetch abort operation for network acceleration |
9348846, | Jul 02 2012 | GOOGLE LLC | User-navigable resource representations |
9460229, | Oct 15 2007 | ViaSat, Inc.; Viasat, Inc | Methods and systems for implementing a cache model in a prefetching system |
9495462, | Jan 27 2012 | Microsoft Technology Licensing, LLC | Re-ranking search results |
9654328, | Oct 15 2007 | VIASAT,INC ; Viasat, Inc | Methods and systems for implementing a cache model in a prefetching system |
Patent | Priority | Assignee | Title |
5222236, | Apr 29 1988 | OverDrive Systems, Inc. | Multiple integrated document assembly data processing system |
5257577, | Apr 01 1991 | Apparatus for assist in recycling of refuse | |
5594660, | Sep 30 1994 | Cirrus Logic, INC | Programmable audio-video synchronization method and apparatus for multimedia systems |
5606609, | Sep 19 1994 | SILANIS TECHNOLOGY INC | Electronic document verification system and method |
5848404, | Mar 24 1997 | International Business Machines Corporation | Fast query search in large dimension database |
5893092, | Dec 06 1994 | University of Central Florida Research Foundation, Inc | Relevancy ranking using statistical ranking, semantics, relevancy feedback and small pieces of text |
5920859, | Feb 05 1997 | Fidelity Information Services, LLC | Hypertext document retrieval system and method |
5933851, | Sep 29 1995 | Sony Corporation | Time-stamp and hash-based file modification monitor with multi-user notification and method thereof |
5960383, | Feb 25 1997 | Hewlett Packard Enterprise Development LP | Extraction of key sections from texts using automatic indexing techniques |
5983216, | Sep 12 1997 | GOOGLE LLC | Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections |
5987457, | Nov 25 1997 | HANGER SOLUTIONS, LLC | Query refinement method for searching documents |
6006225, | Jun 15 1998 | Amazon Technologies, Inc | Refining search queries by the suggestion of correlated terms from prior searches |
6012053, | Jun 23 1997 | RPX Corporation | Computer system with user-controlled relevance ranking of search results |
6032196, | Dec 13 1995 | OATH INC | System for adding a new entry to a web page table upon receiving a web page including a link to another web page not having a corresponding entry in the web page table |
6041323, | Apr 17 1996 | International Business Machines Corporation | Information search method, information search device, and storage medium for storing an information search program |
6070158, | Aug 14 1996 | GOOGLE LLC | Real-time document collection search engine with phrase indexing |
6070191, | Oct 17 1997 | THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT | Data distribution techniques for load-balanced fault-tolerant web access |
6098064, | May 22 1998 | Cisco Technology, Inc | Prefetching and caching documents according to probability ranked need S list |
6125361, | Apr 10 1998 | International Business Machines Corporation | Feature diffusion across hyperlinks |
6128701, | Oct 28 1997 | CA, INC | Adaptive and predictive cache refresh policy |
6145003, | Dec 17 1997 | Microsoft Technology Licensing, LLC | Method of web crawling utilizing address mapping |
6151624, | Feb 03 1998 | R2 SOLUTIONS LLC | Navigating network resources based on metadata |
6167369, | Dec 23 1998 | GOOGLE LLC | Automatic language identification using both N-gram and word information |
6182085, | May 28 1998 | International Business Machines Corporation | Collaborative team crawling:Large scale information gathering over the internet |
6182113, | Sep 16 1997 | SNAP INC | Dynamic multiplexing of hyperlinks and bookmarks |
6185558, | Mar 03 1998 | A9 COM, INC | Identifying the items most relevant to a current query based on items selected in connection with similar queries |
6202058, | Apr 25 1994 | Apple Inc | System for ranking the relevance of information objects accessed by computer users |
6208988, | Jun 01 1998 | BHW INFO EDCO COM, LLC | Method for identifying themes associated with a search query using metadata and for organizing documents responsive to the search query in accordance with the themes |
6222559, | Oct 02 1996 | Nippon Telegraph and Telephone Corporation | Method and apparatus for display of hierarchical structures |
6240407, | Apr 29 1998 | Ineos USA LLC | Method and apparatus for creating an index in a database system |
6240408, | Jun 08 1998 | KCSL, Inc. | Method and system for retrieving relevant documents from a database |
6247013, | Jun 30 1997 | Canon Kabushiki Kaisha | Hyper text reading system |
6263364, | Nov 02 1999 | R2 SOLUTIONS LLC | Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness |
6285367, | May 26 1998 | International Business Machines Corporation | Method and apparatus for displaying and navigating a graph |
6285999, | Jan 10 1997 | GOOGLE LLC | Method for node ranking in a linked database |
6304864, | Apr 20 1999 | Textwise LLC | System for retrieving multimedia information from the internet using multiple evolving intelligent agents |
6317741, | Aug 09 1996 | EUREKA DATABASE SOLUTIONS, LLC | Technique for ranking records of a database |
6327590, | May 05 1999 | GOOGLE LLC | System and method for collaborative ranking of search results employing user and group profiles derived from document collection content analysis |
6349308, | Feb 25 1998 | Korea Advanced Institute of Science & Technology | Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems |
6351467, | Oct 27 1997 | U S BANK NATIONAL ASSOCIATION | System and method for multicasting multimedia content |
6351755, | Nov 02 1999 | R2 SOLUTIONS LLC | System and method for associating an extensible set of data with documents downloaded by a web crawler |
6360215, | Nov 03 1998 | R2 SOLUTIONS LLC | Method and apparatus for retrieving documents based on information other than document content |
6385602, | Nov 03 1998 | GOOGLE LLC | Presentation of search results using dynamic categorization |
6389436, | Dec 15 1997 | International Business Machines Corporation | Enhanced hypertext categorization using hyperlinks |
6418433, | Jan 28 1999 | International Business Machines Corporation | System and method for focussed web crawling |
6418452, | Nov 03 1999 | International Business Machines Corporation | Network repository service directory for efficient web crawling |
6418453, | Nov 03 1999 | TWITTER, INC | Network repository service for efficient web crawling |
6442606, | Aug 12 1999 | R2 SOLUTIONS LLC | Method and apparatus for identifying spoof documents |
6484204, | May 06 1997 | AT&T Corp. | System and method for allocating requests for objects and managing replicas of objects on a network |
6516312, | Apr 04 2000 | International Business Machine Corporation | System and method for dynamically associating keywords with domain-specific search engine queries |
6539376, | Nov 15 1999 | International Business Machines Corporation | System and method for the automatic mining of new relationships |
6546388, | Jan 14 2000 | International Business Machines Corporation | Metadata search results ranking system |
6547829, | Jun 30 1999 | Microsoft Technology Licensing, LLC | Method and system for detecting duplicate documents in web crawls |
6549897, | Oct 09 1998 | Microsoft Technology Licensing, LLC | Method and system for calculating phrase-document importance |
6594682, | Oct 28 1997 | Microsoft Technology Licensing, LLC | Client-side system for scheduling delivery of web content and locally managing the web content |
6598047, | Jul 26 1999 | Method and system for searching text | |
6598051, | Sep 19 2000 | R2 SOLUTIONS LLC | Web page connectivity server |
6601075, | Jul 27 2000 | GOOGLE LLC | System and method of ranking and retrieving documents based on authority scores of schemas and documents |
6622140, | Nov 15 2000 | Justsystem Corporation | Method and apparatus for analyzing affect and emotion in text |
6628304, | Dec 09 1998 | Cisco Technology, Inc. | Method and apparatus providing a graphical user interface for representing and navigating hierarchical networks |
6633867, | Apr 05 2000 | International Business Machines Corporation | System and method for providing a session query within the context of a dynamic search result set |
6633868, | Jul 28 2000 | MIN, SHERMANN LOYALL | System and method for context-based document retrieval |
6636853, | Aug 30 1999 | ZARBAÑA DIGITAL FUND LLC | Method and apparatus for representing and navigating search results |
6638314, | Jun 26 1998 | Microsoft Technology Licensing, LLC | Method of web crawling utilizing crawl numbers |
6671683, | Jun 28 2000 | RAKUTEN GROUP, INC | Apparatus for retrieving similar documents and apparatus for extracting relevant keywords |
6701318, | Nov 18 1998 | Technology Licensing Corporation | Multiple engine information retrieval and visualization system |
6718324, | Jan 14 2000 | International Business Machines Corporation | Metadata search results ranking system |
6718365, | Apr 13 2000 | GOOGLE LLC | Method, system, and program for ordering search results using an importance weighting |
6738764, | May 08 2001 | VALTRUS INNOVATIONS LIMITED | Apparatus and method for adaptively ranking search results |
6763362, | Nov 30 2001 | Round Rock Research, LLC | Method and system for updating a search engine |
6766316, | Jan 18 2001 | Leidos, Inc | Method and system of ranking and clustering for document indexing and retrieval |
6766422, | Sep 27 2001 | UNIFY, INC | Method and system for web caching based on predictive usage |
6775659, | Aug 26 1998 | Fractal Edge Limited | Methods and devices for mapping data files |
6775664, | Apr 04 1996 | HUDSON BAY MASTER FUND LTD | Information filter system and method for integrated content-based and collaborative/adaptive feedback queries |
6778997, | Jan 05 2001 | International Business Machines Corporation | XML: finding authoritative pages for mining communities based on page structure criteria |
6829606, | Feb 14 2002 | Fair Isaac Corporation | Similarity search engine for use with relational databases |
6862710, | Mar 23 1999 | FIVER LLC | Internet navigation using soft hyperlinks |
6871202, | Oct 25 2000 | R2 SOLUTIONS LLC | Method and apparatus for ranking web page search results |
6886010, | Sep 30 2002 | The United States of America as represented by the Secretary of the Navy; NAVY, UNITED STATES OF AMERICA, AS REPRESENTED BY THE SEC Y OF THE | Method for data and text mining and literature-based discovery |
6886129, | Nov 24 1999 | International Business Machines Corporation | Method and system for trawling the World-wide Web to identify implicitly-defined communities of web pages |
6910029, | Feb 22 2000 | International Business Machines Corporation | System for weighted indexing of hierarchical documents |
6931397, | Feb 11 2000 | International Business Machines Corporation | System and method for automatic generation of dynamic search abstracts contain metadata by crawler |
6934714, | Mar 04 2002 | MEINIG, KELLY L | Method and system for identification and maintenance of families of data records |
6944609, | Oct 18 2001 | RPX Corporation | Search results using editor feedback |
6947930, | Mar 21 2003 | Jollify Management Limited | Systems and methods for interactive search query refinement |
6959326, | Aug 24 2000 | International Business Machines Corporation | Method, system, and program for gathering indexable metadata on content at a data repository |
6973490, | Jun 23 1999 | COLORADO WSC, LLC | Method and system for object-level web performance and analysis |
6990628, | Jun 14 1999 | Excalibur IP, LLC; Yahoo! Inc | Method and apparatus for measuring similarity among electronic documents |
7016540, | Nov 24 1999 | NEC Corporation | Method and system for segmentation, classification, and summarization of video images |
7028029, | Mar 28 2003 | The Board of Trustees of the Leland Stanford Junior University | Adaptive computation of ranking |
7051023, | Apr 04 2003 | R2 SOLUTIONS LLC | Systems and methods for generating concept units from search queries |
7072888, | Jun 16 1999 | Triogo, Inc. | Process for improving search engine efficiency using feedback |
7076483, | Aug 27 2001 | Xyleme SA | Ranking nodes in a graph |
7080073, | Aug 18 2000 | AUREA SOFTWARE, INC | Method and apparatus for focused crawling |
7107218, | Oct 29 1999 | British Telecommunications public limited company | Method and apparatus for processing queries |
7152059, | Aug 30 2002 | EMERgency24, Inc. | System and method for predicting additional search results of a computerized database search user based on an initial search query |
7181438, | May 30 2000 | RELATIVITY DISPLAY LLC | Database access system |
7197497, | Apr 25 2003 | R2 SOLUTIONS LLC | Method and apparatus for machine learning a document relevance function |
7228301, | Jun 27 2003 | Microsoft Technology Licensing, LLC | Method for normalizing document metadata to improve search results using an alias relationship directory service |
7243102, | Jul 01 2004 | Microsoft Technology Licensing, LLC | Machine directed improvement of ranking algorithms |
7246128, | Jun 12 2002 | GLOBAL CONNECT TECHNOLOGY, INC | Data storage, retrieval, manipulation and display tools enabling multiple hierarchical points of view |
7257577, | May 07 2004 | GOOGLE LLC | System, method and service for ranking search results using a modular scoring system |
7260573, | May 17 2004 | GOOGLE LLC | Personalizing anchor text scores in a search engine |
7281002, | Mar 01 2004 | International Business Machines Corporation | Organizing related search results |
7308643, | Jul 03 2003 | GOOGLE LLC | Anchor tag indexing in a web crawler system |
7328401, | Jan 28 2000 | Microsoft Technology Licensing, LLC | Adaptive web crawling using a statistical model |
7428530, | Jul 01 2004 | Microsoft Technology Licensing, LLC | Dispersing search engine results by using page category information |
20010042076, | |||
20020055940, | |||
20020062323, | |||
20020078045, | |||
20020099694, | |||
20020103798, | |||
20020107861, | |||
20020107886, | |||
20020129014, | |||
20020169595, | |||
20020169770, | |||
20030037074, | |||
20030061201, | |||
20030065706, | |||
20030074368, | |||
20030208482, | |||
20030217047, | |||
20030217052, | |||
20040006559, | |||
20040049766, | |||
20040093328, | |||
20040117351, | |||
20040148278, | |||
20040181515, | |||
20040186827, | |||
20040194099, | |||
20040199497, | |||
20040205497, | |||
20040215606, | |||
20040215664, | |||
20040254932, | |||
20050033742, | |||
20050044071, | |||
20050055340, | |||
20050055347, | |||
20050060311, | |||
20050071328, | |||
20050071741, | |||
20050086192, | |||
20050086206, | |||
20050086583, | |||
20050144162, | |||
20050154746, | |||
20050165781, | |||
20050187965, | |||
20050192936, | |||
20050192955, | |||
20050210006, | |||
20050216533, | |||
20050240580, | |||
20050251499, | |||
20050262050, | |||
20060036598, | |||
20060047649, | |||
20060173560, | |||
20060195440, | |||
20060206460, | |||
20060206476, | |||
20060282455, | |||
20060287993, | |||
20070038616, | |||
20070038622, | |||
20070073748, | |||
20070106659, | |||
20070150473, | |||
DE10029644, | |||
EP950961, | |||
EP1050830, | |||
EP1120717, | |||
EP1282060, | |||
EP1557770, | |||
JP10091638, | |||
JP11328191, | |||
KR1020020015838, | |||
KR1020030082109, | |||
KR1020060116042, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Nov 05 2004 | Microsoft Corporation | (assignment on the face of the patent) | ||||
Oct 14 2014 | Microsoft Corporation | Microsoft Technology Licensing, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034766 | 0001 |
Date | Maintenance Fee Events |
May 24 2013 | REM: Maintenance Fee Reminder Mailed. |
Oct 13 2013 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Oct 13 2012 | 4 years fee payment window open |
Apr 13 2013 | 6 months grace period start (w surcharge) |
Oct 13 2013 | patent expiry (for year 4) |
Oct 13 2015 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 13 2016 | 8 years fee payment window open |
Apr 13 2017 | 6 months grace period start (w surcharge) |
Oct 13 2017 | patent expiry (for year 8) |
Oct 13 2019 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 13 2020 | 12 years fee payment window open |
Apr 13 2021 | 6 months grace period start (w surcharge) |
Oct 13 2021 | patent expiry (for year 12) |
Oct 13 2023 | 2 years to revive unintentionally abandoned end. (for year 12) |