In a method for acquiring statistical information from data, an initial cumulative distribution function (cdf) that characterizes an initial set of data is acquired. The acquisition of this cdf comprises acquiring a set of quantile endpoints that define the cdf. At least one additional cdf, which characterizes a further set of data, is also acquired. Information that describes the initial cdf is combined with information that describes one or more additional cdfs, and the result is used to obtain a composite cdf that describes a combined set of data that includes the initial data set and the one or more further data sets. Then, a new set of quantile endpoints is determined, that defines the composite cdf. The sequence of steps described above is repeated at least once more. The previously obtained composite cdf is used as the initial cdf for each repetition of this sequence.
|
1. A method, comprising:
a) acquiring an initial cumulative distribution function (cdf) that characterizes an initial set of data; b) acquiring at least one additional cdf that characterizes a further set of data; and c) combining information that describes the initial cdf with information that describes at least one said additional cdf, thereby to obtain a composite cdf that describes a combined set of data that includes the initial data set and at least one said additional data set; characterized in that step (a) comprises acquiring a set of quantile endpoints that define the initial cdf, and the method further comprises: d) determining a set of quantile endpoints that define the composite cdf; and e) repeating steps (a)-(d) at least once more, wherein each repetition of step (a) takes as the initial cdf the composite cdf obtained in the most recent execution of step (c).
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
(b) comprises receiving at least two agent records, each of which contains or represents a set of data, and computing one or more cdfs that collectively characterize the received agent records; and (c) is carried out so as to obtain a composite cdf that describes a combined set of data that includes the initial data set together with the data set contained in or represented by each agent record.
7. The method of
8. The method of
9. The method of
10. The method of
11. The method of
12. The method of
|
This invention relates to methods for deriving statistical information from a stream of data, and for updating that information.
There are many practical problems of data collection in which it is useful to summarize large volumes of data in a fast and reliable manner, while preserving as much information as possible. One class of problems of that kind relates to the collection of data that reflect the performance of a communication network. One example from that class of problems is the problem of collecting and summarizing the length of time to complete each of a sequence of transactions that take place on a network. In the case, e.g., of e-mail transaction times, the problem is made more difficult by the fact that the data arrive in small increments at random times, and because it is often desirable to reserve for processing and storing the data an amount of memory that is small relative to the volume of data to be processed.
Those of skill in the art have long been acquainted with the histogram as a means for summarizing statistical data. To create a histogram, the practitioner marks off endpoints along an axis that corresponds to the incoming data; i.e., to the measured values of the statistic that is to be characterized. Below, we will refer to the incoming measured values as scores, and to the corresponding axis as the data axis. Accordingly, the endpoints referred to above are marked off along the score axis. Each pair of successive endpoints defines an interval. The height of the histogram within each interval, measured along a probability axis perpendicular to the data axis, is proportional to the number of scores that fall within that interval. Below, we will find it convenient to refer to each such interval as a bucket.
When a histogram is based on an exhaustive set of data, it can dependably represent the statistical distribution of those data. However, if the histogram is based on only a partial set of data, it might not dependably represent the full population from which that partial set was taken. In particular, a histogram based on an initial portion of a stream of data might differ substantially from a histogram based on a longer initial portion or on a subsequent portion of the data stream.
When a stream of data arrives over time, it is often most convenient to characterize the arriving data by taking an initial sequence of data values, creating a histogram, and then updating the histogram using further data values taken from subsequently arriving data. Such a procedure is especially useful when the amount of computer memory available for processing and storing the data is limited.
The quality of a histogram depends on its ability to model the population of data that it is based on, and on its ability to preserve statistical information about that population. In both of these aspects, the quality of a histogram is affected by the setting of the endpoints that define the respective buckets, and also by the procedure used to update the histogram using later-arriving data.
In the statistical study of network performance, among other fields, there has been a recognized need for methods of processing data streams to characterize the data more reliably without sacrificing useful statistical information.
We have invented an improved method for acquiring statistical information from data, which may, e.g., be already accumulated data or data that are arriving as a data stream. According to the inventive method, an initial cumulative distribution function (CDF) that characterizes an initial set of data is acquired. The acquisition of this CDF comprises acquiring a set of quantile endpoints that define the CDF. The quantile endpoints are endpoints that could be marked off along the data axis in such a way that a defined fraction of the sampled scores would lie within each corresponding bucket.
At least one additional CDF, which characterizes a further set of data, is also acquired. Information that describes the initial CDF is combined with information that describes one or more additional CDFs, and the result is used to obtain a composite CDF that describes a combined set of data that includes the initial data set and the one or more further data sets. Then, a new set of quantile endpoints is determined, that defines the composite CDF. The sequence of steps described above is repeated at least once more. The previously obtained composite CDF is used as the initial CDF for each repetition of this sequence.
It should be noted that in practice, the ratio of the number of scores in each interval to the total number of scores might not be precisely equal across all intervals. However, it will suffice, in general, to choose the intervals in such a way that shifting any one of them will shift at least one such ratio farther from the chosen value.
If the defined quantile levels are equally spaced along the probability axis, the resulting filled buckets will all be of equal probability, because each interval between quantiles will represent an equal fraction of all of the accumulated data. That is the case illustrated in FIG. 3. It should be noted, however, that in practice there are often good reasons to space the defined quantile levels unequally along the probability axis. In such cases, the buckets will not all be of equal probability.
We have found that for characterizing data such as those of
At block 40, the total number N of scores currently stored in the D Buffer is read. At block 50, the raw scores X1, . . . , XN are read from the D Buffer.
At block 60, a function FQ(x), referred to here as the "provisional CDF," is defined. The variable x runs along the data axis. FQ(x) is defined with reference to the probability levels pm and the endpoints Qm according to the following rules:
For x=Qm, m=1, . . . , M, FQ(x)=pm.
For intermediate values of x, i.e., for values of x between Qm-1 and Qm, the value of FQ(x) is determined by interpolating between the value at Qm-1 and the value at Qm.
The expression "max(0, x+5) in the preceding formula is present because it is assumed that g-1(x) is defined for all real x. Interpolation is then done linearly with respect to g(pm)
At block 70, a function Fx(x) is computed from the raw scores in the D Buffer. The function Fx(x) approximates the statistical distribution of the data from the data stream during the period between the previous update and the present update. In the absence of other information, Fx(x) will be computed as the empirical cumulative distribution according to well-known statistical procedures. However, it should be noted that our method is also advantageously practiced using alternative methods for estimating Fx(x). For example, the estimate of Fx(x) could be based on knowledge of the changing nature of the data stream. Such knowledge can be incorporated, for example, when Fx(x) is estimated as a parametric distribution described by a set of updateable parameters.
In the illustrative embodiment described here, Fx(x) is the empirical cumulative distribution. Accordingly, in the following discussion, Fx(x) will be referred to for convenience as the "empirical CDF." However, the use of that term does not limit the scope of the possible alternative forms and definitions that Fx(x) might take.
Fx(x) is defined, for a given x, as the total number of scores Xn that are less than or equal to x. An example B of an empirical CDF Fx(x) is included in FIG. 7. It will be apparent from
At block 80, a further CDF, denoted F(x), is computed as a weighted average of the provisional CDF and the empirical CDF. The weight given to the provisional CDF is proportional to T, and the weight given to the empirical CDF is proportional to N. That is, F(x) is defined by:
The above averaging procedure is illustrated in FIG. 7.
At block 90, the Q buffer is updated with new quantile endpoints, and T is incremented by N to reflect the fact that N more scores have entered into the computation of the current set of quantile endpoints. The new quantile endpoints are computed from the weighted average CDF F(x) according to the following rule:
It will be appreciated that the method described above processes incoming data block-by-block, where each block is one filling of the D Buffer. Such a method is not limited to the processing of a single stream of data that arrive sequentially in time. On the contrary, methods of the kind described above are readily adaptable for, e.g., merging short-term data records, such as daily records, into longer-term records, such as weekly records. Methods of the kind described above are also readily adaptable for merging records acquired by a collection of independent agents into a single, master record. The agents need not have operated sequentially, but instead, e.g., may have carried out concurrent data acquisition.
According to one possible scenario, each of a collection of K agents acquires data, and sends the data to a central location in a record of length I+1. Two examples of agent records are provided in FIG. 8. Agent record 100 contains Tk scores, where Tk≦I, and the record also contains the weight Tk. Where record 100 is the k'th of K records, the scores that it contains are denoted Xk,I, . . . , Xk,T
Whether the agent sends a record of the type 100 or the type 110 will typically depend on the volume of data being processed by the agent. If in a particular iteration the agent is required to process more scores than can fit on its D Buffer, or if the agent's Q Buffer is already full, the agent will typically update the Q Buffer as described above in connection with
The generation of an output record is discussed below in connection with FIG. 10. At block 300 of
At block 150, the current approximate quantile endpoints Ql, . . . , QM are read from the Q Buffer at the central processing location. As noted above, T is zero in the initial application of the method. As a consequence, the contents of the Q Buffer are not used.
At block 160, the approximate quantile endpoints are used to define a provisional CDF FQ(x) as explained above in connection with FIG. 6. As indicated at block 170, agent record k is now obtained. If this is the first iteration, then record k is the first agent record; otherwise, it is the next agent record in sequence. As indicated at block 180, the treatment of agent record k depends on whether or not the record holds quantiles; i.e., on whether it is a record of the type 110 or a record of the type 100. If the record contains quantiles, control passes to block 190, to be described below. Otherwise, control passes to block 220, to be described below.
If control has passed to block 190, agent record k contains quantiles. Accordingly, at block 190, the quantile endpoints Rk,I, . . . , Rk,i are read from the agent record. The weight Tk, indicative of the total number of scores taken into consideration in computing the quantile endpoints, is also read.
At block 200, a provisional CDF Fk(x) is defined using the quantile endpoints from agent record k and the probability levels piR for the agent records. That is, for x=Rk,I, Fk(x)=piR. For values of x that fall between the endpoints Rk,i, interpolation is used as described above. At block 230, a representation of the resulting provisional CDF Fk(x) is stored, together with the weight Tk.
If control has passed to block 220, agent record k does not contain quantiles, but instead contains raw scores Xk,I, . . . , Xk,T
At block 220, the raw scores from any number of individual agent records are optionally pooled and treated as a single data set, with Tk adjusted to reflect the total weight of the pooled scores.
As noted above, the term "empirical CDF" has been adopted for convenience, and should not be understood as limiting the possible forms that the agent CDF might take. Instead, like the method described with reference to
At block 230, Tk and the CDF Fk(x) are stored at the central processing location.
If the current agent record is not the last agent record, control now returns to block 170 for a further iteration. Otherwise, control passes to block 240.
At block 240, a new CDF, denoted Fmerged (x) in
The summations in the preceding expression are carried out over all agent records.
It will be appreciated that the preceding formula for the merged CDF gives equal weight to each score. This formula is readily generalized by permitting each of the agent weights Tk to be freely adjustable. For example, setting each of the agent weights to unity results in a merged CDF in which each agent, rather than each score, has equal weight.
It should be noted that arithmetic averaging is only one of various methods for updating a merged CDF, all of which lie within the scope of the present invention. For example, the merged CDF may be defined by a set of parameters, and the updating of the merged CDF may be performed by updating the parameters so that they reflect knowledge both of the previous merged CDF and of the agent records.
At block 250, new quantile endpoints Qmnew are computed for storage in the Q Buffer at the central processing location according to:
At block 260, the weight factor T is updated by adding to it the total of all agent weight factors Tk. That is,
Output records may be produced at any time, using the current CDF Fmerged (X). As noted above, a set of quantile probability levels is read at block 300 of
At block 310 of
It should be noted that when merging, e.g., hourly records into daily records. it is convenient to start the Q Buffer and the D Buffer at the beginning of each new hourly period. However, there are at least some circumstances, e.g. in the analysis of network performance data, when data from one reporting period (such as an hourly period) are relevant to performance in the next reporting period. Under such circumstances, it may be advantageous to start the Q Buffer, at the beginning of the next, e.g., hour, in its final state from the previous hour, but with a scaled-down weight factor.
Lambert, Diane, Wiel, Scott Alan Vander, Chambers, John M, James, David A
Patent | Priority | Assignee | Title |
10061939, | Mar 03 2017 | Microsoft Technology Licensing, LLC | Computing confidential data insight histograms and combining with smoothed posterior distribution based histograms |
10127296, | Apr 07 2011 | BMC Software, Inc. | Cooperative naming for configuration items in a distributed configuration management database environment |
10198476, | Mar 26 2010 | BMC Software, Inc. | Statistical identification of instances during reconciliation process |
10523543, | Dec 06 2004 | BMC Software, Inc. | Generic discovery for computer networks |
10534577, | Dec 06 2004 | BMC Software, Inc. | System and method for resource reconciliation in an enterprise management system |
10740352, | Apr 07 2011 | BMC Software, Inc. | Cooperative naming for configuration items in a distributed configuration management database environment |
10795643, | Dec 06 2004 | BMC Software, Inc. | System and method for resource reconciliation in an enterprise management system |
10831724, | Dec 19 2008 | BMC Software, Inc.; BMC SOFTWARE, INC | Method of reconciling resources in the metadata hierarchy |
10877974, | Mar 26 2010 | BMC Software, Inc. | Statistical identification of instances during reconciliation process |
11514076, | Apr 07 2011 | BMC Software, Inc. | Cooperative naming for configuration items in a distributed configuration management database environment |
7388998, | Jul 04 2003 | ED-Tech Co., Ltd. | Apparatus and method for controlling brightness of moving image signals in real time |
7783647, | Dec 13 2005 | Alcatel-Lucent USA Inc | Method and apparatus for globally approximating quantiles in a distributed monitoring environment |
8346710, | Jan 29 2010 | GOOGLE LLC | Evaluating statistical significance of test statistics using placebo actions |
8589329, | Jul 10 2009 | Alcatel Lucent | Method and apparatus for incremental tracking of multiple quantiles |
8666946, | Jul 10 2009 | Alcatel Lucent | Incremental quantile tracking of multiple record types |
8683032, | Dec 06 2004 | BMC SOFTWARE, INC | Generic discovery for computer networks |
8712979, | Mar 26 2010 | BMC Software, Inc. | Statistical identification of instances during reconciliation process |
9137115, | Dec 06 2004 | BMC Software, Inc. | System and method for resource reconciliation in an enterprise management system |
9158799, | Mar 14 2013 | BMC Software, Inc. | Storing and retrieving context sensitive data in a management system |
9323801, | Mar 26 2010 | BMC Software, Inc. | Statistical identification of instances during reconciliation process |
9852165, | Mar 14 2013 | BMC Software, Inc. | Storing and retrieving context senstive data in a management system |
9967162, | Dec 06 2004 | BMC Software, Inc. | Generic discovery for computer networks |
Patent | Priority | Assignee | Title |
5870752, | Aug 21 1997 | NCR Voyix Corporation | Incremental maintenance of an approximate histogram in a database system |
5991332, | Jun 30 1995 | ALTRALIGHT, INC | Adaptive matched filter and vector correlator for a code division multiple access (CDMA) modem |
6108658, | Mar 30 1998 | International Business Machines Corporation | Single pass space efficent system and method for generating approximate quantiles satisfying an apriori user-defined approximation error |
6229843, | Jun 30 1995 | InterDigital Technology Corporation | Pilot adaptive vector correlator |
6272168, | Jun 30 1995 | InterDigital Technology Corporation | Code sequence generator in a CDMA modem |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Mar 12 2002 | JAMES, DAVID A | Lucent Technologies Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 012737 | /0467 | |
Mar 12 2002 | CHAMBERS, JOHN M | Lucent Technologies Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 012737 | /0467 | |
Mar 13 2002 | VANDER WIEL, SCOTT ALAN | Lucent Technologies Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 012737 | /0467 | |
Mar 14 2002 | LAMBERT, DIANE | Lucent Technologies Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 012737 | /0467 | |
Mar 22 2002 | Lucent Technologies Inc. | (assignment on the face of the patent) | / | |||
Jan 30 2013 | Alcatel-Lucent USA Inc | CREDIT SUISSE AG | SECURITY INTEREST SEE DOCUMENT FOR DETAILS | 030510 | /0627 | |
Aug 19 2014 | CREDIT SUISSE AG | Alcatel-Lucent USA Inc | RELEASE BY SECURED PARTY SEE DOCUMENT FOR DETAILS | 033949 | /0531 | |
Jul 22 2017 | Alcatel Lucent | WSOU Investments, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 044000 | /0053 | |
Aug 22 2017 | WSOU Investments, LLC | OMEGA CREDIT OPPORTUNITIES MASTER FUND, LP | SECURITY INTEREST SEE DOCUMENT FOR DETAILS | 043966 | /0574 | |
May 16 2019 | WSOU Investments, LLC | BP FUNDING TRUST, SERIES SPL-VI | SECURITY INTEREST SEE DOCUMENT FOR DETAILS | 049235 | /0068 | |
May 16 2019 | OCO OPPORTUNITIES MASTER FUND, L P F K A OMEGA CREDIT OPPORTUNITIES MASTER FUND LP | WSOU Investments, LLC | RELEASE BY SECURED PARTY SEE DOCUMENT FOR DETAILS | 049246 | /0405 | |
May 28 2021 | WSOU Investments, LLC | OT WSOU TERRIER HOLDINGS, LLC | SECURITY INTEREST SEE DOCUMENT FOR DETAILS | 056990 | /0081 |
Date | Maintenance Fee Events |
Jun 13 2007 | ASPN: Payor Number Assigned. |
May 16 2008 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
May 10 2012 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
May 11 2016 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Nov 16 2007 | 4 years fee payment window open |
May 16 2008 | 6 months grace period start (w surcharge) |
Nov 16 2008 | patent expiry (for year 4) |
Nov 16 2010 | 2 years to revive unintentionally abandoned end. (for year 4) |
Nov 16 2011 | 8 years fee payment window open |
May 16 2012 | 6 months grace period start (w surcharge) |
Nov 16 2012 | patent expiry (for year 8) |
Nov 16 2014 | 2 years to revive unintentionally abandoned end. (for year 8) |
Nov 16 2015 | 12 years fee payment window open |
May 16 2016 | 6 months grace period start (w surcharge) |
Nov 16 2016 | patent expiry (for year 12) |
Nov 16 2018 | 2 years to revive unintentionally abandoned end. (for year 12) |