discovering mixtures of models includes: initiating learning algorithms, determining, data sets including a cluster of points in a first region of a domain and a set of points distributed near a first line extending across the domain; inferencing parameters from the cluster and the set of points; creating a description of the cluster of points in the first region of the domain and computing approximations of a first learned mixture model and a second learned mixture model; determining a first and second probability, generating a confidence rating that each point of the cluster of points in the first region of the domain corresponds to the first learned mixture model and generating a confidence rating that each point of the set of points distributed near the first line correspond to the second learned mixture model, thus causing determinations of behavior of a system described by the learned mixture models.
|
1. A method, implemented in a computer readable and executable program on a computer processor, of discovering mixtures of models within data and probabilistic classification of data according to model mixtures, the method comprising:
receiving, a request for discovering mixtures of models within data and probabilistic classification of data according to model mixtures;
initiating a learning algorithm, by the computer processor, causing the computer processor to execute the computer readable and executable program for simultaneously discovering mixtures of models within data and probabilistic classification of data according to mixture models of a plurality of models;
applying a random sampling operation to determine mathematical functions;
determining multiple models of the plurality of models that fit portions of mixture models of the plurality of models;
probabilistically assigning points to multiple models of the plurality of models by
using abstractions of mathematical functions to form simulated equivalent mathematical functions, causing one or more mathematical functions to be processed as one or more of the plurality of models;
comparing multiple models of the plurality of models by comparing different mathematical functions and by comparing a first quality of a first model to a second quality of a second model, wherein a number of points supporting an at least one candidate model are counted to determine whether sufficient data are modeled, wherein global accounting ensures that the number of points supporting the at least one candidate model are only counted once, when determining how many of the number of points in data are modeled by candidate functions, and wherein comparing different mathematical functions includes using geometric properties, including overlap, the number of points supporting the at least one candidate model counted, and density; and
providing user settable thresholds for user interaction with computations of residual error and with computations of the number of points supporting the at least one candidate model corresponding to learned mixture models.
11. A non-transitory computer readable medium having a plurality of computer executable instructions in the form of a computer readable and computer executable program executed by a computer processor causing the computer processor to perform a method of discovering mixtures of models within data and probabilistic classification of data according to model mixtures, the plurality of computer executable instructions including:
instructions causing receiving, a request for discovering mixtures of models within data and probabilistic classification of data according to model mixtures, wherein the non-transitory computer readable medium includes a plurality of non-transitory computer readable data storage media including storage devices, such as tape drives and disc drives;
instructions initiating a learning algorithm, by the computer processor;
instructions for applying a random sampling operation to determine mathematical functions;
instructions causing determining, by the computer processor, one of when a data set consists of a cluster of points in a first region of a domain, determining when a set of points distributed near a first line that extends across part of the domain exists;
instructions causing inferencing parameters, of the first line, that one of describe the set of points distributed near the first line, and describe a mean and variance of the cluster of points in the first region of the domain creating a description of the cluster of points in the first region of the domain, and describe other parameters needed to describe an instance of a function in a plurality of dimensions;
instructions causing computing, by the computer processor, approximations of a first learned mixture model corresponding to the set of points distributed near a first function and a second learned mixture model corresponding to the set of points near a second function within the domain and similar approximations for functions determined to exist within data embedded in any subspace of the domain and total domain;
instructions causing probabilistically assigning points to multiple models of a plurality of models;
instructions for using abstractions of mathematical functions to form simulated equivalent mathematical functions, causing one or more mathematical functions to be processed as one or more of the plurality of models;
instructions causing comparing multiple models of the plurality of models by comparing different mathematical functions and by comparing a first quality of a first model to a second quality of a second model, wherein a number of points supporting an at least one candidate model are counted to determine whether sufficient data are modeled, wherein global accounting ensures that the number of points supporting the at least one candidate model are only counted once, when determining how many of the number of points in data are modeled by candidate functions, and wherein comparing different mathematical functions includes using geometric properties, including overlap, the number of points supporting the at least one candidate model counted, and density;
instructions for providing a user settable threshold for user interaction with computations residual error and with computations of the number of points supporting the at least one candidate model corresponding to the first and second learned mixture models; and
instructions for generating a confidence rating that each point of the cluster of points in the first region of the domain corresponds to the first learned mixture model and generating a confidence rating that each point of the cluster of points in the first region of the domain correspond to the second learned mixture model and causing determination of a behavior of a system described by the learned mixture models.
6. A system of discovering mixtures of models within data and probabilistic classification of data according to model mixtures, the system comprising:
a computer processor having a display, an input device and an output device;
a network interface communicatively coupling the computer processor to a network; and
a memory having a dynamic repository, an algorithm unit and a program unit containing a computer readable and computer executable program; and
a memory controller communicatively coupling the computer processor with contents of the dynamic repository, the algorithm unit and the computer readable and computer executable program residing in the program unit, wherein when executed by the computer processor, the computer readable and computer executable program causes the computer processor to perform operations of discovering mixtures of models including operations of:
receiving, a request for discovering mixtures of models within data and probabilistic classification of data according to model mixtures;
initiating a learning algorithm, by the computer processor, causing the computer processor to execute the computer readable and executable program discovering mixtures of models within data and probabilistic classification of data according to mixture models;
applying a random sampling operation to determine mathematical functions;
determining, by the computer processor, one of when a data set consists of a cluster of points in a first region of a domain, and determining when a set of points distributed near a first line that extends across part of the domain exists;
inferencing parameters, of the first line, that one of describe the set of points distributed near the first line, and describe a mean and variance of the cluster of points in the first region of the domain creating a description of the cluster of points in the first region of the domain, and describe other parameters needed to describe an instance of a function in a number of dimensions, wherein the number of dimensions includes 4D and higher dimensions;
computing, by the computer processor, approximations of a first learned mixture model corresponding to the set of points distributed near the first function and a second learned mixture model corresponding to the set of points near the second function within the domain and similar approximations for functions determined to exist within data;
probabilistically assigning points to multiple models of the plurality of models;
using abstractions of mathematical functions to form simulated equivalent mathematical functions, causing one or more mathematical functions to be processed as one or more of the plurality of models;
comparing multiple models of the plurality of models by comparing different mathematical functions and by comparing a first quality of a first model to a second quality of a second model, wherein a number of points supporting an at least one candidate model are counted to determine whether sufficient data are modeled, wherein global accounting ensures that the number of points supporting the at least one candidate model are only counted once, when determining how many of the number of points in data are modeled by candidate functions, and wherein comparing different mathematical functions includes using geometric properties, including overlap, the number of points supporting the at least one candidate model counted, and density;
providing user settable thresholds for user interactions with computations of residual error and with computations of the number of points supporting the at least one candidate model corresponding to the first and second learned mixture models; and
generating a confidence rating that each point of the cluster of points in the first region of the domain corresponds to the first learned mixture model and generating a confidence rating that each point of the cluster of points in the first region of the domain corresponds to the second learned mixture model and causing determination of a behavior of a system described by the first and second learned mixture models.
2. The method according to
3. The method according to
4. The method according to
5. The method according to
7. The system according to
8. The system according to
9. The system according to
10. The system according to
12. The instructions of the non-transitory computer readable medium according to
13. The instructions of the non-transitory computer readable medium according to
14. The instructions of the non-transitory computer readable medium according to
15. The instructions of the non-transitory computer readable medium according to
16. The instructions of the non-transitory computer readable medium according to
|
The present application is related to and claims the benefit of priority under 35 USC §119(e) of prior filed provisional U.S. patent application 61/088,830, which is herein incorporated by reference in its entirety.
The present application generally relates to machine learning, data mining and to mathematical modeling in business, science, educational, legal, medical and/or military environments. More particularly, the present application relates to training mathematical models that describe a given data set and classify each point in the data with a probability of fitting each discovered model for the purpose of discovering a model from existing data.
The amount of data currently available overwhelms our capacity to perform analysis. Thus, good tools to sort through data and determine which variables are relevant and what trends or patterns exist amongst those variables becomes a paramount initial step in analyzing real-world data. In defense and security applications, the consequences of missed information can be dire.
Mixture models are the generic term given to models that consist of the combination (usually a summation) of multiple, independent functions that contribute to the distribution of points within a set. For example, a mixture model might be applied to a financial market, with each model describing a certain sector of the market, or each model describing behavior of the market under certain economic conditions. The underlying mechanism that creates the overall behavior of the system is often not directly observable, but may be inferred from measurements of the data. Also, a combination of models may be used simply for convenience and mathematical simplicity, without regard to whether it accurately reflects the underlying system behavior.
General mixture models (GMM) have been successfully applied in a wide range of applications, from financial to scientific. However, exemplary embodiments include applications having militarily-relevant tasks such as tracking and prediction of military tactics. Implementation of such militarily-relevant tasks necessitates extension of standard techniques to include a wider range of basis models, a more flexible classification algorithm that enables multiple hypotheses to be pursued along with a confidence metric for the inclusion of a data point into each candidate model in the mixture (a novel contribution in this work), the application of standard techniques for pushing the GMM solver out of a local optimum in the search for a global solution, and parallel implementations of these techniques in order to reduce the computation time necessary to arrive at these solutions. Further exemplary embodiments include domain expertise metrics and techniques for dimension reduction, which are necessary processes in dealing with real world problems, such as searching for patterns in enemy tactics, such as IED emplacements and suicide attacks.
Overlapping fields such as unsupervised machine learning, pattern classification, signal processing, and data mining also offer methods, including: principal component analysis (PCA), independent component analysis (ICA), k-means clustering, and many others all of which attempt to extract models that describe the data.
PCA transforms the basis of the domain so that each (ordered) dimension accounts for as much variability as possible. Solutions often use eigenvalue or singular value decomposition. Concerns include the assumption of a linear combination, computational expense, and sensitivity to noise and outliers.
In regard to ICA limitations, ICA assumes mutual statistical independence of source signals; and ICA can not identify actual numbers of source signals; also, ICA does not work well in high dimensions. ICA separates a multivariate signal into a summation of components. It identifies these by maximizing the statistical independence, often by a measure non-fit to a Gaussian model. It typically requires a centering, whitening, and dimension reduction to decrease the complexity of the problem; the latter two are often accomplished by using PCA. ICA can not in general identify the number of source signals. ICA solutions can be non-unique in the ordering of components, and the scale (including sign) of the source signals may not be properly identified.
However, ICA identifies summed components by maximizing the statistical independence. k-means finds a pre-specified number of clusters by minimizing the sum of variances within each cluster. The solution as specified by cluster indicators has equivalence to PCA components. Further ICA concerns include having to know the number of clusters, assumptions of Gaussian clustering, and reliance on good seed points. FlexMix methods allow finite mixtures of linear regression models (Gaussian and exponential distributions) and an extendable infrastructure. To reduce the parameter space, FlexMix methods restrict some parameters from varying or restrict the variance. FlexMix methods assume the number of components is known, but allow component removal for vanishing probabilities, which reduces the problems caused by overfitting, and provides two methods for unsupervised learning of Gaussian clusters: one method is a “decorrelated k-means” algorithm that minimizes an objective function of error and decorrelation for a fixed number of clusters, and the second method is a “sum of parts” algorithm that uses expectation maximization to learn the parameters of a mixture of Gaussians and factor them.
Learning models from unstructured data have a wide range of fields of application, and thus learning models comprise a well-studied problem. The fundamental approach is to assume a situation in which an underlying mechanism, which may or may not be observable, generates data such that each observation belongs to one of some number of different sources or categories. More generally, such models may be applied indirectly to generate a model that fits, even if the underlying model is known to be different than the components used.
Gaussian mixture models are used for classifying points into clusters that enable an analyst to extract the underlying model or models which produced a set of observations. For simple situations, this technique may be used to easily separate the data into clusters that belong to a particular model. Referring to
It should be noted that while a Gaussian function is a common basis function, there is nothing in the theory that prevents linear, quadratic, transcendental, or other basis functions from being considered as possible components, and indeed such models are being used in learning of general mixture models (GMM). The difficulty that arises in such cases is that the combinatorial explosion of possibilities becomes computationally problematic.
There are several methods used to estimate the mixture in GMM. The most common is expectation maximization (EM), which iteratively computes the model parameters and their weights, as well as assesses the fit of the mixture model to a plurality of data. Thus the first step at each iteration computes the “expected” classes of all data points, while the second step computes the maximum likelihood model parameters given the class member distributions of the plurality of data. The first step requires evaluation of the Gaussian or other basis function; the second is a traditional model-fitting operation. The nice thing about EM is that convergence is guaranteed, but only to a local optimum, which means that the algorithm may not find the best solution. This convergence is achieved in a linear fashion.
Utilization of the EM approach highlights another problem with the general mixture model approach, regardless of the basis functions included, is that the methods inherently must assign each data point to a particular basis model; there is no room for uncertainty associated with this assignment, though it inherently exists within the data. Also, EM is sensitive to errors in the class assignment, introducing the possibility of missing the introduction of a new model into the mixture when new data doesn't quite fit. Multiple hypotheses can not both claim to draw upon a single data point, which means that one of the hypotheses must be eliminated from consideration early in the process.
Therefore, the need exists for estimating the mixtures, which eliminates problems with the general mixture model approach of uncertainty of data point assignment associated with this assignment, though it inherently exists within the data.
Also, the need exists for reducing sensitivity to errors in the class assignment, which introduces the possibility of missing the introduction of a new model into the mixture when new data doesn't quite fit.
The need exists for a more pro-active defense posture in the assessment of threats against U.S. forces and installations in battlefields and other high-risk environments.
Furthermore, the need exists for threat assessment applications including militarily-relevant tasks such as tracking and prediction of military tactics, having a wider range of basis models, a more flexible classification algorithm that enables multiple hypotheses to be pursued along with a confidence metric for the inclusion of a data point into each candidate model in the mixture, the application of standard techniques for pushing the GMM solver out of a local optimum in the search for a global solution, and parallel implementations of these techniques in order to reduce the computation time necessary to arrive at these solutions.
Further, the need exists for applying a sampling technique such as the random sample consensus (RANSAC) method to classification procedures. By testing every data point against each proposed model to compose the mixture, a measure of the uncertainty can be obtained associated with a particular assignment, derived from the residual error associated with each model. A data point may thus be associated tentatively with any number of models until such time as the confidence in a particular component model becomes high enough to truly classify the data point into a particular pattern. In this way, a decision may be delayed until multiple hypotheses have had a chance to claim a data point.
Additionally, the need exists for embodiments which include domain expertise metrics and techniques for dimension reduction, which are necessary processes in dealing with real world problems, such as searching for patterns in enemy tactics, such as IED emplacement and suicide attacks.
Still further, the need exists for methods of extracting models that describe the data, which reduce concerns including the assumption of a linear combination, computational expense, and sensitivity to noise and outliers.
Furthermore, the need exists for methods of overcoming ICA limitations of the inability of identifying actual numbers of source signals; also, ICA does not work well in high dimensions, and typically requires a centering, whitening, and dimension reduction to decrease the complexity of the problem; the latter two are often accomplished by using PCA.
Further, the need exists for methods of identifying the number of source signals.
Additionally, the need exists for methods of estimating mixtures in GMM which do not suffer from combinatorial explosion of possibilities that are computationally problematic.
In addition, the need exists for applying a standard method simulated annealing for dealing with the problem of the possibility of getting stuck in a local minimum, which is a general problem in optimization methods. Simulated annealing probes the search space with a random jump to see if the current neighborhood of the search seems not as promising as the tested location. If the probe sees a lower cost (goodness of fit, in this case), then the jump is accepted and the search continued. While there are no guarantees, with such a large search space of possible models and parameters of those models, this is an important element in any algorithm for delivering a GMM for the observed data in a tracking or event prediction model.
Further, the need exists for managing the computational load inherent in the multiple models being evaluated, thus parallel architectures of modern graphics processing units (GPUs) will be applied to the problems at issue. Such units have proven themselves to be applicable to a wide range of repetitive computations, especially those with gridded problems or that exist (as these models will) within well-defined domains like the search spaces in the problems described above. Thus evaluation of multiple points on a model in a single step is possible, by applying a single program—multiple data parallel computation approach. This reduces the cost of each additional model that is a candidate for the mixture to a constant factor, although the model may still require a complex program for individual evaluation. The number of data points, however, becomes less of a factor in the asymptotic evaluation of the program efficiency.
Also, the need exists for risk averse measures, such as confidence metrics, both in the fit to a particular model and through the multiple-hypothesis capability by enabling a data point to be classified into multiple models in the mixture, in order to minimize risk of this research, as with any optimization method, because there is no way to guarantee 100 percent accuracy in the results.
Furthermore, the need exists for understanding the tactics used by enemy combatants, especially in the era of asymmetric warfare. The danger presented by IED emplacements and suicide attacks is extremely high. While no prediction algorithm can be expected to be 100 percent accurate in identifying dangers as they approach or excluding non-combatants from suspicion, there are patterns to this behavior. Thus detecting patterns that can be learned from existing data and applying them to situations in which the threat must be assessed is a critical problem for combat environments. Similar problems may be considered in maritime domain awareness, homeland security, and other safety and security applications.
A computer implemented method of discovering mixtures of models within data and probabilistic classification of data according to model mixtures includes: receiving, a request for discovering mixtures of models within data and probabilistic classification of data according to model mixtures; initiating a learning algorithm, by the computer processor, causing the computer processor to execute the computer readable and executable program discovering mixtures of models within data and probabilistic classification of data according to mixture models; applying a random sampling operation to determine mathematical functions by determining, by the computer processor, when a data set consists of a cluster of points in a first region of a domain, determining one of when a set of points distributed near a first line that extends across some part of the domain exists and determining when a set of points constitutes one of a transcendental, hyperbolic, polynomial and other mathematical function, which is described as embedded in any number of dimensions that describe input data and any other type of function that extends across any number of dimensions of the domain exists; inferencing parameters, of the first line, that either describe the set of points distributed near the first line, or that describe a mean and variance of the cluster of points in the first region of the domain creating either a description of the cluster of points in the first region of the domain or other parameters that describe an instance of a function in an appropriate number of dimensions; computing approximations of a first learned mixture model corresponding to the set of points distributed near the first function and a second learned mixture model corresponding to the set of points near the second function within the domain, and similar approximations for any number of functions as determined to exist within the data in any subspace of the domain and the entire domain; determining multiple models of the plurality of models that fit portions of mixture models of the plurality of models by probabilistically assigning points to multiple models of the plurality of models by determining a first probability that the first learned mixture model corresponds to each point of the cluster of points in the first region of the domain and determining a second probability that the second learned mixture model corresponds to each point of the set of points distributed near the first line, wherein determining the first and second probabilities is performed by testing each point, wherein determining the first and second probabilities eliminates a requirement for a fit of each point displaced from a true position, and wherein setting a minimum number of points for each of the first and second learned mixture models distinguishes the first and second learned mixture models from a combination learned mixture model formed from parameters of the first and second learned mixture models, and repeating for each of any number of functions (models in the mixture) that may be determined to exist within the data; using abstractions of mathematical functions to form simulated equivalent mathematical functions, causing one or more mathematical functions to be processed as one or more of the plurality of models; comparing different mathematical functions, using geometric properties, including overlap, supporting point sets, and density; providing user settable thresholds for user interaction with computations of residual error and corresponding and supporting point sets to learned mixture models and generating a confidence rating that each point of the cluster of points in the first region of the domain corresponds to the first learned mixture model and generating a confidence rating that each point of the cluster of points in the first region of the domain correspond to the second learned mixture model and causing determination of a behavior of a system described by the learned mixture models, and repeating for any number of functions that are determined to exist within the data.
Preferred exemplary embodiments of the present disclosure are now described with reference to the figures, in which like reference numerals are generally used to indicate identical or functionally similar elements. While specific details of the preferred exemplary embodiments are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the relevant art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the preferred exemplary embodiments. It will also be apparent to a person skilled in the relevant art that this invention can also be employed in other applications. Devices and components described in the exemplary embodiments can be off the shelf commercially available devices or specially made devices. Further, the terms “a”, “an”, “first”, “second”, and “third” etc. used herein do not denote limitations of quantity, but rather denote the presence of one or more of the referenced items(s).
In an exemplary embodiment, of particular interest is the discovery of potential models that may describe the underlying processes that generate data. Typically, the mechanisms that create the overall behavior are not directly observable, but may be inferred from measurements. The extracted models may then be used to draw insights about the events encapsulated in the data. Another issue of great interest is the use of such models as tools for prediction or assessment of likelihood of future events.
The purpose of the application is to discover mathematical models that describe a given data set and classify each point in the data with a probability of fitting each discovered model. A model may be any mathematical shape, including but not limited to lines, closed curves, geometric shapes, or general functions such as Gaussians, polynomials, exponentials, or trancendentals, or combinations of any of the above. Once a set of models (from one to any number) is determined to fit in the data set, a probability may be assigned to each data point to measure the likelihood that a given point “belongs to” a particular function—i.e. that it is consistent with that particular model. From these probabilities, a classification of the points may be derived.
Risks inherent in optimization research include the understanding that approaches to optimization research can only be as good as the data on which optimization research is evaluated. However, in this application, real data for the IED attacks and terrorist events in the Middle East are utilized. Also, real crime data available for local U.S. jurisdictions is used to exercise the modeling algorithm, so there are several sources of good data used.
This application develops a new framework for unsupervised learning that addresses shortcomings associated with existing methods. An object oriented approach is taken to design a model. Fundamental operations identified include construction of a model from a list of points and the computation of residual error for a point, given the model parameters. This approach provides a way to build a model with a randomly selected set of data points and to compute an optimal approximation, such as computation of a least squares optimal approximation or some other known optimal approximation paradigm, according to standard mathematical practices, from a set of points identified as probabilistically belonging with that model, and re-compute the residuals for all points. The computation of a residual has a somewhat different meaning for every class of function, but for many functions the computation of a residual is merely an abstraction of a distance function with geometric meaning in the space. For functions such as Gaussians, the natural residual associated with such a function (the Z-score) is a convenient implementation. Implicitly, there is a metric for whether a particular value of the residual is deemed sufficiently close to an instance to warrant inclusion. Thus every function must have a way—given access to the input data—to build a random instance of itself, compute residuals to its current instance, evaluate which points are worthy of further consideration, and build optimal approximations from those points.
Algorithms used build on the RANSAC framework for estimation of good candidate models. However, in exemplary embodiments, algorithms are altered in important (but simple) ways. Not just one answer, but the best answer is sought (by any metric); also, reasonable models are generated which are candidate hypotheses for the underlying cause(s) behind the input data. Thus, an internal list of candidates (via a priority queue) are maintained, rather than a single candidate. Then models are generated iteratively at random, and their quality is assessed, and compared against other models previously generated. The lowest quality candidate among the current set may be discarded when a better candidate is discovered.
In exemplary embodiments, an analogous change in the stopping conditions for RANSAC is made. The original specification included two criteria: a maximum number of attempts to find the best model and a minimum amount of support required. The first criterion is maintained: the number of models tested is limited (but multiple possible solutions are kept). The second criterion requires some adjustment, since multiple underlying models are responsible for the data. Thus the number of points are counted that are supporting at least one candidate model to determine whether a sufficient portion of the data has been modeled. It is important to note that a point supporting two completely independent models is not prevented. This requires some global accounting to ensure that points are not counted multiple times when determining how many points in the data are modeled by the candidate functions.
To complete the implementation, a way to compare the quality of one model to another is needed. This is much more straightforward for models of the same type, and thus one loop for each type of model is selected (see
Two metrics inherent in the RANSAC framework provide a basis for comparing models. Mean residuals for models are examined, as the first criterion. If these are quite close (a few points difference as a percentage of the domain's size), then the support size for an individual model is examined. Candidates with significantly greater numbers of points within the maximum residual are considered to be better candidates; these are kept in the candidate set for further consideration. While nothing in this framework would prevent comparisons of mean residuals for models of different types, this is not currently done, since the geometric validity of such comparisons has not been determined. Within this framework, linear and Gaussian models are implemented, as the second criterion. These metrics will be extended.
In exemplary embodiments, a final mixed model has a residual for each point against each of the component models. This serves two purposes. First, it gives us fuzzy assignment of every point in the original data to each of the models in the final candidate set. This allows a point to support multiple models and demonstrate that multiple hypotheses might explain a particular data point. Second, it enables identification of outlier points that are not explained by any of the candidate models. Parameters are identified through which a user may control the performance of the algorithm. Typically, a minimum amount of support for a model is enforced to be considered valid. This prevents generation of an extremely high number of models and increases the likelihood that the models generated will be meaningful. However, raising this number too high can force the algorithm to miss a valid model that explains a smaller number of points.
Current algorithms use variance and statistical independence metrics. In exemplary embodiments, metrics are more intuitive for non-experts in statistics (like GIS analysts). Comparisons against current methods are conducted, using various datasets. New metrics for such comparisons are herein described. Speed improvements are also important; similar algorithms have seen performance gains commensurate with the degree of parallelism, which is currently as high as 128 for graphics hardware. (Some minor overhead reduces the gain).
In exemplary embodiments, referring to
Referring to
A larger model library is one extension of the current implementation. Integrating the model generation into a single loop which at random tests any function type for inclusion as a model requires equating the geometric meaning of a residual of each type. An exponential decay function of the distance from a geometric model to compare against the Gaussian model is used. Improved efficiency in both computation and memory are implemented. One important property of the algorithm is that generating models does not require mapping the points to a new space; thus only one copy of the input data is ever required.
The expanded model library and integrated loop in turn require improved metrics for comparing models. In exemplary embodiments a separate loop for each model type is implemented. With only two, this is a minor inconvenience; with more, this will be a source of inefficiency. Current metrics include the number of points that support a model and the mean residual for those points. In practice, the number of points is determined, that is, for the number of points that do not yet support a model. This provides a multi-key priority queue for pruning models from our list of candidates. Requiring many unique points implies that fewer candidate models will be proposed; this may be an advantage or disadvantage to a user. Also the equivalence of the mean residual metric must be maintained. Distance and z-score are both valid, but not necessarily in the same space. Density and bounding volume overlap are not limited by the space in which each is defined.
Of critical interest is implementation of intuitive control parameters for the analyst; a recent JASON study identified this as a critical element of all stages of the data pipeline. As noted above, the number of (unique) points required to accept a model is one control mechanism that could be passed on to the human operator. Automated methods for computing the best value could be developed based on the current number of unlabeled points. In exemplary embodiments, control parameters that set thresholds for the residuals are enabled. Also, the analyst is enabled to affect the weighting between these two keys; a mechanism that sets the threshold for how much higher a mean residual can be to win over a lower residual. Although the algorithm does not require any initial input, the algorithm is adaptable to enable input models based on previously observed data.
In exemplary embodiments, the algorithm for discovering mixtures of models within data and probabilistic classification of data according to the model mixture addresses practical issues of robustness to outliers, noise, and missing information. The last leads to concerns of overfitting and underfitting and sensitivity of the extracted models to new data points. The algorithm's metrics enable it to limit these concerns. An important theoretical concern is identifiability and uniqueness of the solution; in complex real-world data sets, a given solution will not in fact be unique. Thus a goal of the algorithm allows multiple, independent, competing hypotheses to emerge. The affect of the overlap parameter directly affects the ability of the algorithm to consider non-unique solutions. The key issue is balancing the two metrics of overlap in support and residual; an automated method of determining this balance is developed. In this respect, a hybrid algorithm with PCA or ICA would be of great value in determining the nature of such an overlap. Determining whether the right number of models has been found is important to the robustness of the algorithm. Also an analysis of the theoretical asymptotic performance of the algorithm is performed, as well as measuring such performance with real and synthetic data sets are implemented in the exemplary embodiments.
Exemplary embodiments include forming hybrids of the various algorithms with the algorithm for discovering mixtures of models within data and probabilistic classification of data according to the model mixture. The value of reducing dimensions is certainly applicable to any algorithm that attempts to mine data for underlying causative models; it guides this algorithm to more efficient searching. Many operations in this algorithm (as well as in PCA, ICA, and k-means are easily parallelizable. Thus, parallel architectures available on graphics processors to provide significant performance improvements are implemented in the exemplary embodiments. These improvements apply not only to this algorithm, but to the components of the other algorithms as well. Many of these other algorithms are computationally expensive (depending on the input data and/or the input parameters), but have many parallel sub-routines. Thus, in exemplary embodiments, double-precision graphics processors are implemented and allow implementation of memory coherence and parallel nature of operations, which rely heavily on matrix and vector computations (which are the basic operations for graphics processors as well).
In exemplary embodiments, the learning algorithm for discovering mixtures of models within data and probabilistic classification of data according to the model mixture is used for evaluations against other clustering and classification algorithms: PCA, ICA, and k-means. Metrics are required to compare such algorithms, since they are not quite identical in nature. One metric is the number of components required to capture a specified percentage of the information in the data. PCA does this cleanly through its ordering of dimensions. The variance captured in those dimensions provides a suitable metric. Similarly, for ICA, the independence of the components gives a basis for determining the information gained by including another component. Both k-means and this algorithm use residual errors; however, the equivalence of k-means and PCA are used to convert the former's metric. Such a transformation is adapted and applied to non-Gaussian groupings for evaluating this algorithm. Also ordering of models inherent in this algorithm can be used to determine the increase in the mean residual incurred by discarding a model; this would more directly equate to the PCA and ICA methods. This discussion also emphasizes the applicability of all of these algorithms to developing a sparse data representation and reducing the number of dimensions, wherein performance limitations of this algorithm is identified.
In exemplary embodiments, the following datasets are used for testing: Because applications of the techniques described in regard to the use of the algorithm for discovering mixtures of models within data and probabilistic classification of data according to the model mixture, any data will be useful for testing. For development, synthetic data as well as classic data sets from the UCI MACHINE LEARNING REPOSITORY, {http://archive.ics.uci.edu/ml/} is used. U.S. CENSUS data and Washington, D.C. crime statistics are used. ARC/GIS software and associated data are used. The {IEEE} SYMPOSIUM ON VISUAL ANALYTICS SCIENCE AND TECHNOLOGY CHALLENGE {http://www.cs.umd.edu/hcil/VASTchallenge09} provides test data and ground truth. Unclassified version of the ONI MERCHANT SHIP CHARACTERISTIC DATABASE, as well as NRL classified data facilities are used, as well.
The algorithm for discovering mixtures of models within data and probabilistic classification of data according to the model mixture extracts a sparse representation of input data by distilling from samples a small number of functions considered for further analysis. This approach differs from those that attempt to find a new basis set of dimensions or select some number of the original dimensions, but it offers many of the same benefits. Note that in the description above, there is nothing that restricts the algorithm to any number of dimensions. The algorithm is able to operate and does freely operate in a lower-dimensional subspace of the input domain, and in doing so, it implicitly selects dimensions that are of great interest; however, such a goal would constrain the options of the algorithm unnecessarily. The algorithm directly finds a few functions that explain the recognizable patterns within the data, pulling out a sparse set of functions that serves the purpose of the algorithm.
Ongoing work in the NRL Information Technology Division involves the use of GIS data for predictive analytics: determination of probability of future events in areas of ongoing conflict. Initial products are in use by DoD and DHS clients. This algorithm in conjunction with GIS analysts is a tool for mining vast data, and tests on proxy data sets from non-sensitive (different region or older) scenarios as well. This algorithm is useful in analysis of merchant ship data from ONI, applicable to pressing problems in the area of maritime domain awareness. Other data sets can be tested with this algorithm as opportunities arise for formal evaluation, in both qualitative and quantitative methods; some of that work was performed for ONR 311 on graphics systems (augmented reality).
In exemplary embodiments, when a data set consists of a cluster of points in one region of a domain, and a set of points distributed near a line that extends across the domain. A mixture model would operate to discover the best parameters of the line that would describe the latter points and the mean and variance of the cluster to describe the former points. Since in a general data set, these points will have displacement from true positions due to noise, and there will in general be noisy points that do not appear to fit either of these models, best fit approximations to the models must be computed. Then, each point may be tested against the two models to determine the probability that each model could be the underlying mechanism that generated that point. This gives a confidence rating that a point belongs to each learned model.
Referring to
In exemplary embodiments, in contrast to the above other algorithms, the algorithm for discovering mixtures of models within data and probabilistic classification of data according to the model mixture uses an abstraction of the model that includes a generating function (which underlies all the computations), a notion of the residual (or error) for a point, measures the support for a model among the input data. This enables direct comparisons of qualitatively different models and of different parameterizations of a single shape of model. Thus, there are no assumptions made about the number or shape of the models that should be found in the data set. The algorithm merely requires a user settable threshold for the amount of support there should be for a model in the input data, which in turn gives provides new features for how a user may interact with the computations, in regard to the amount of support and the maximum (or mean) residual error for points to be associated with a model.
In additional exemplary embodiments, an iterative approach using expectation maximization includes using initial guesses for the parameters, where expectation values for the membership values of each data point are computed; then estimates are computed for the distribution parameter (e.g. mean and variance for a Gaussian, or slope and intercept for a line). This is done in a way that maximizes the fit, which may be equivalently conceived as minimizing an error function.
Further exemplary embodiments include implementing Markov-chain Monte Carlo methods which deduce parameters by randomly assigning points to a particular model (an example of Monte Carlo sampling). The parameters are at first initial guesses, but are iteratively refined as points are assigned. Estimators then determine the quality of fit, and points that do not fit may be put back into the pool of points to be randomly assigned.
Additional exemplary embodiments, include implementation of spectral methods are based on Singular Value Decomposition of a matrix embodying the data points. This enables a projection of each point onto a linear subspace (singular vectors); in this space, points that are generated by different underlying distributions should be distant from each other.
In additional exemplary embodiments, a stand-alone program suitable for analysis and product testing is available. In further exemplary embodiments, the algorithm for discovering mixtures of models within data and probabilistic classification of data according to the model mixture is also useful as a tool incorporating all competing algorithms with radically faster (parallel) implementations. A method, system and program product including instructions implemented in a computer readable and computer executable program on a computer processor are described herein as discovering mixtures of models within data and probabilistic classification of data according to model mixtures.
Referring to
Referring to
Referring again to
Referring to
Again referring to
In exemplary embodiments, the system 200 and the method 100 illustrated in
In exemplary embodiments, the system(s) 200 can be implemented with a general purpose digital computer designated as the computer processor 206. The computer processor 206 is a hardware device for executing software implementing the method 100, as well as the method 300. The computer processor 206 can be any custom made or commercially available, off-the-shelf processor, a central processing unit (CPU), one or more auxiliary processors, a semiconductor based microprocessor, in the form of a microchip or chip set, a macroprocesssor or generally any device for executing software instructions. The system(s) 200 when implemented in hardware can include discrete logic circuits having logic gates for implementing logic functions upon data signals, or the system(s) 200 can include an application specific integrated circuit (ASIC).
Referring to
Referring again to
Referring to
In exemplary embodiments, referring to
Referring to
Referring to
Referring again to
Referring to
Referring to
Referring to
Again referring to
Referring to
Referring again to
Referring again to
Referring again to
According to a second exemplary embodiment, referring to
Referring to
Referring again to
Referring to
Referring to
Referring to
Referring again to
In exemplary embodiments, an object oriented approach is taken approach to design a model. Fundamental operations identified were construction of a model from a list of points and the computation of residual error for a point, given the model parameters. This approach provides a way to build a model with a randomly selected set of data points and to compute an optimal approximation, such as computation of a least squares optimal approximation or some other known optimal approximation paradigm, according to standard mathematical practices, from a set of points identified as probabilistically belonging with that model, and re-compute the residuals for all points. The computation of a residual has a somewhat different meaning for every class of function, but for many functions the computation of a residual is merely an abstraction of a distance function with geometric meaning in the space. For functions such as Gaussians, the natural residual associated with such a function (the Z-score) is a convenient implementation. Implicitly, there is a metric for whether a particular value of the residual is deemed sufficiently close to an instance to warrant inclusion. Thus, every function must have a way—given access to the input data—to build a random instance of itself, compute residuals to its current instance, evaluate which points are worthy of further consideration, build optimal approximations from those points.
Referring to
The operation 316 is performed by further determining a first probability that the first learned mixture model corresponds to each point of the cluster of points in the first region of the domain and determining a second probability that the second learned mixture model corresponds to each point of the set of points distributed near the first line. The determining of the first and second probabilities is performed by testing each point. The determining of the first and second probabilities eliminates a requirement for a fit of each point displaced from a true position. Setting a minimum number of points for each of the first and second learned mixture models distinguishes the first and second learned mixture models from a combination learned mixture model formed from parameters of the first and second learned mixture models. The determining of the first and second probabilities includes assigning a fixed percent probability up to about fifty percent for points of a line, depending on a residual error fit of the first and second learned mixture model, and where the learning algorithm probabilistically determines whether a series of Gaussian mixture models are found, by combining a number of points of the first and second learned mixture models with an average residual points to be excluded and repeating probabilistically assigning points to multiple models of the plurality of models for each function determined to exist within data.
In exemplary embodiments, an iterative approach using expectation maximization includes using initial guesses for the parameters, where expectation values for the membership values of each data point are computed; then estimates are computed for the distribution parameter (e.g. mean and variance for a Gaussian, or slope and intercept for a line). This is done in a way that maximizes the fit, which may be equivalently conceived as minimizing an error function.
Referring again to
Referring to
Referring again to
Referring to
Referring again to
All references cited herein, including issued U.S. patents, or any other references, are each entirely incorporated by reference herein, including all data, tables, figures, and text presented in the cited references. Also, it is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance presented herein, in combination with the knowledge of one of ordinary skill in the art.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein.
Livingston, Mark Alan, Palepu, Aditya Maruti
Patent | Priority | Assignee | Title |
9747610, | Nov 22 2013 | AT&T Intellectual Property I, LP | Method and apparatus for determining presence |
Patent | Priority | Assignee | Title |
20100323903, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Aug 14 2009 | The United States of America, as represented by the Secretary of the Navy | (assignment on the face of the patent) | / | |||
Aug 14 2009 | LIVINGSTON, MARK ALAN | GOV T OF USA REPRESENTED BY SECRETARY OF THE NAVY CODE OOCCIP | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 023103 | /0768 | |
Aug 14 2009 | PAPELU, ADITYA MARUTI | GOV T OF USA REPRESENTED BY SECRETARY OF THE NAVY CODE OOCCIP | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 023103 | /0768 | |
Aug 14 2009 | LIVINGSTON, MARK ALAN | GOV T OF USA REPRESENTED BY SECRETARY OF THE NAVY CODE OOCCIP | CORRECTIVE ASSIGNMENT TO CORRECT THE CORRECT SECOND INVENTOR NAME PREVIOUSLY RECORDED ON REEL 023103 FRAME 0768 ASSIGNOR S HEREBY CONFIRMS THE ASSIGNMENT | 024647 | /0084 | |
Aug 14 2009 | PALEPU, ADITYA MARUTI | GOV T OF USA REPRESENTED BY SECRETARY OF THE NAVY CODE OOCCIP | CORRECTIVE ASSIGNMENT TO CORRECT THE CORRECT SECOND INVENTOR NAME PREVIOUSLY RECORDED ON REEL 023103 FRAME 0768 ASSIGNOR S HEREBY CONFIRMS THE ASSIGNMENT | 024647 | /0084 |
Date | Maintenance Fee Events |
Apr 07 2017 | REM: Maintenance Fee Reminder Mailed. |
Sep 25 2017 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Aug 27 2016 | 4 years fee payment window open |
Feb 27 2017 | 6 months grace period start (w surcharge) |
Aug 27 2017 | patent expiry (for year 4) |
Aug 27 2019 | 2 years to revive unintentionally abandoned end. (for year 4) |
Aug 27 2020 | 8 years fee payment window open |
Feb 27 2021 | 6 months grace period start (w surcharge) |
Aug 27 2021 | patent expiry (for year 8) |
Aug 27 2023 | 2 years to revive unintentionally abandoned end. (for year 8) |
Aug 27 2024 | 12 years fee payment window open |
Feb 27 2025 | 6 months grace period start (w surcharge) |
Aug 27 2025 | patent expiry (for year 12) |
Aug 27 2027 | 2 years to revive unintentionally abandoned end. (for year 12) |