methods and systems predict parameters in a dataset of an identified piece of (“information technology”) IT equipment. An automated method identifies datasets IT equipment in a same category of IT equipment as a piece of IT equipment identified as having incomplete dataset information. Each dataset of IT equipment parameters are used to construct generalized linear models of different classes of IT equipment within the category of IT equipment. The class of the identified IT equipment is determined. A predicted equipment parameter of incomplete information of the identified piece of IT equipment is computed using the generalized linear model associated with the class. The predicted equipment parameter can be used to complete the dataset of the identified piece of IT equipment.
|
11. A non-transitory computer-readable medium encoded with machine-readable instructions that implement a method carried out by one or more processors of a computer system to perform the operations of
forming equipment parameters from the configuration parameters and non-parametric information for each piece of (“information technology”) IT equipment in a category of IT equipment, each equipment parameter corresponding to a data point in a multi-dimensional space;
computing clusters of data points to determine classes of IT equipment, each piece of IT equipment in the category belonging to one of the classes of IT equipment;
computing a generalized linear model for each class of IT equipment based on the equipment parameters of the IT equipment in the class of IT equipment;
determining a class of an identified piece of IT equipment with incomplete equipment parameters as a minimum of squared distances between equipment parameters of the identified piece of IT equipment and the equipment parameters in each class; and
computing a predicted equipment parameter of the identified piece of IT equipment using the generalized linear model associated with the class of IT equipment the identified piece of IT equipment belongs to, the predicted equipment parameter completing the dataset of the identified piece of IT equipment.
1. An automated method stored in one or more data-storage devices and executed using one or more processors of a computer system to predict parameters in datasets of reference library database of (“information technology”) IT equipment, the method comprising:
forming equipment parameters from the configuration parameters and non-parametric information for each piece of IT equipment in a category of IT equipment, each equipment parameter corresponding to a data point in a multi-dimensional space;
computing clusters of data points to determine classes of IT equipment, each piece of IT equipment in the category belonging to one of the classes of IT equipment;
computing a generalized linear model for each class of IT equipment based on the equipment parameters of the IT equipment in the class of IT equipment;
determining a class of an identified piece of IT equipment with incomplete equipment parameters as a minimum of squared distances between equipment parameters of the identified piece of IT equipment and the equipment parameters in each class; and
computing a predicted equipment parameter of the identified piece of IT equipment using the generalized linear model associated with the class of IT equipment the identified piece of IT equipment belongs to, the predicted equipment parameter completing the dataset of the identified piece of IT equipment.
6. A system to predict parameters in datasets of reference library database of (“information technology”) IT equipment, the system comprising:
one or more processors;
one or more data-storage devices; and
machine-readable instructions stored in the one or more data-storage devices that when executed using the one or more processors controls the system to carry out
forming equipment parameters from the configuration parameters and non-parametric information for each piece of IT equipment in a category of IT equipment, each equipment parameter corresponding to a data point in a multi-dimensional space;
computing clusters of data points to determine classes of IT equipment, each piece of IT equipment in the category belonging to one of the classes of IT equipment;
computing a generalized linear model for each class of IT equipment based on the equipment parameters of the IT equipment in the class of IT equipment;
determining a class of an identified piece of IT equipment with incomplete equipment parameters as a minimum of squared distances between equipment parameters of the identified piece of IT equipment and the equipment parameters in each class; and
computing a predicted equipment parameter of the identified piece of IT equipment using the generalized linear model associated with the class of IT equipment the identified piece of IT equipment belongs to, the predicted equipment parameter completing the dataset of the identified piece of IT equipment.
2. The method of
identifying datasets in the reference library database in a same category of IT equipment as a piece of IT equipment identified as having incomplete or inaccurate dataset information, each dataset comprising configuration parameters and non-parametric information of each piece of IT equipment; and
encoding non-parametric information in each dataset into encoded parameters that represent the non-parametric information.
3. The method of
applying k-means clustering to equipment parameters based on an initial set of cluster centers to assign each equipment parameters to one of k clusters of equipment parameters; and
for each cluster of equipment parameters,
testing cluster of equipment parameters for fit to a Gaussian distribution,
replacing cluster center with two child cluster centers when cluster of equipment parameters do not fit a Gaussian distribution, and
applying k-means clustering to cluster of equipment parameters based on two child cluster.
4. The method of
for each class of IT equipment,
partitioning the equipment parameters associated with the class into training data and validation data;
iteratively computing predictor coefficients of a generalized linear model of the class of IT equipment based on the training data;
computing approximate response parameters using the generalized linear model applied to the validation data associated with the class, the approximate response parameters to approximate the actual response parameter of the validation data; and
discarding the predictor coefficients when a difference between the approximate response parameters and corresponding response parameters of the validation data exceed a threshold.
5. The method of
computing a squared distance between the identified piece of IT equipment and each piece of IT equipment based on the incomplete equipment parameters of the identified piece of IT equipment and corresponding equipment parameters of each piece of IT equipment;
determining a minimum squared distance of the squared distances; and
assigning the identified piece of IT equipment to the class of IT equipment with the piece of IT equipment having the minimum squared distance to the identified piece of IT equipment.
7. The system of
identifying datasets in the reference library database in a same category of IT equipment as a piece of IT equipment identified as having incomplete or inaccurate dataset information, each dataset comprising configuration parameters and non-parametric information of each piece of IT equipment; and
encoding non-parametric information in each dataset into encoded parameters that represent the non-parametric information.
8. The system of
applying k-means clustering to equipment parameters based on an initial set of cluster centers to assign each equipment parameters to one of k clusters of equipment parameters; and
for each cluster of equipment parameters,
testing cluster of equipment parameters for fit to a Gaussian distribution,
replacing cluster center with two child cluster centers when cluster of equipment parameters do not fit a Gaussian distribution, and
applying k-means clustering to cluster of equipment parameters based on two child cluster.
9. The system of
for each class of IT equipment,
partitioning the equipment parameters associated with the class into training data and validation data;
iteratively computing predictor coefficients of a generalized linear model of the class of IT equipment based on the training data;
computing approximate response parameters using the generalized linear model applied to the validation data associated with the class, the approximate response parameters to approximate the actual response parameter of the validation data; and
discarding the predictor coefficients when a difference between the approximate response parameters and corresponding response parameters of the validation data exceed a threshold.
10. The system of
computing a squared distance between the identified piece of IT equipment and each piece of IT equipment based on the incomplete equipment parameters of the identified piece of IT equipment and corresponding equipment parameters of each piece of IT equipment;
determining a minimum squared distance of the squared distances; and
assigning the identified piece of IT equipment to the class of IT equipment with the piece of IT equipment having the minimum squared distance to the identified piece of IT equipment.
12. The medium of
identifying datasets in the reference library database in a same category of IT equipment as a piece of IT equipment identified as having incomplete or inaccurate dataset information, each dataset comprising configuration parameters and non-parametric information of each piece of IT equipment; and
encoding non-parametric information in each dataset into encoded parameters that represent the non-parametric information.
13. The medium of
applying k-means clustering to equipment parameters based on an initial set of cluster centers to assign each equipment parameters to one of k clusters of equipment parameters; and
for each cluster of equipment parameters,
testing cluster of equipment parameters for fit to a Gaussian distribution,
replacing cluster center with two child cluster centers when cluster of equipment parameters do not fit a Gaussian distribution, and
applying k-means clustering to cluster of equipment parameters based on two child cluster.
14. The medium of
for each class of IT equipment,
partitioning the equipment parameters associated with the class into training data and validation data;
iteratively computing predictor coefficients of a generalized linear model of the class of IT equipment based on the training data;
computing approximate response parameters using the generalized linear model applied to the validation data associated with the class, the approximate response parameters to approximate the actual response parameter of the validation data; and
discarding the predictor coefficients when a difference between the approximate response parameters and corresponding response parameters of the validation data exceed a threshold.
15. The medium of
computing a squared distance between the identified piece of IT equipment and each piece of IT equipment based on the incomplete equipment parameters of the identified piece of IT equipment and corresponding equipment parameters of each piece of IT equipment;
determining a minimum squared distance of the squared distances; and
assigning the identified piece of IT equipment to the class of IT equipment with the piece of IT equipment having the minimum squared distance to the identified piece of IT equipment.
|
Benefit is claimed under 35 U.S.C. 119(a)-(d) to Foreign Application Serial No. 201741042282 filed in India entitled “METHODS AND SYSTEMS TO PREDICT PARAMETERS IN A DATABASE OF INFORMATION TECHNOLOGY EQUIPMENT”, on Nov. 24, 2017, by VMware, Inc., which is herein incorporated in its entirety by reference for all purposes.
This disclosure is directed to computational systems and methods for predicting parameters in a database of information technology equipment.
In recent years, enterprises have shifted much of their computing needs from enterprise owned and operated computer systems to cloud-computing providers. Cloud-computing providers charge enterprises for use of information technology (“IT”) services over a network, such as storing and running an enterprise's applications on the hardware infrastructure, and allow enterprises to purchase and scale use of IT services in much the same way utility customers purchase a service from a public utility. IT services are provided over a cloud-computing infrastructure made up of geographically distributed data centers. Each data center comprises thousands of server computers, switches, routers, and mass data-storage devices interconnected by local-area networks, wide-area networks, and wireless communications.
Because of the tremendous size of a typical data center, cloud-computing providers rely on automated IT financial management tools to determine cost of IT services, project future costs of IT services, and determine the financial health of a data center. A typical automated management tool determines current and projected cost of IT services based on a reference database of actual data center equipment inventory and corresponding invoice data. But typical management tools do not have access to the latest invoice data for data center equipment. Management tools may deploy web automated computer programs, called web crawling agents, that automatically collect information from a variety of vendor web sites and write the information to the reference database. However, agents are not able to identity errors in web pages and may not be up-to-date with the latest format changes to web sites. As a result, agents often write incorrect information regarding data center equipment to reference databases. Management tools may also compute approximate costs of unrecorded equipment based on equipment currently recorded in a reference database. For example, the cost of an unrecorded server computer may be approximated by computing a mean cost of server computers recorded in the reference database with components that closely match the components of the unrecorded server computer and assigning the mean cost as the approximate cost of the unrecorded server computer. However, this technique for determining the cost of data center equipment typically is unreliable with errors ranging from as low as 12% to as high as 45%. Cloud-computing providers and data center managers seek more accurate tools to determine cost of IT equipment in order to more accurately determine the cost of IT services and project future cost of IT services.
Methods and system described herein may be used to predict parameters in a dataset of an identified piece of IT equipment stored in a reference library database. An automated method identifies datasets in the reference library database in the same category of IT equipment as a piece of IT equipment identified as having incomplete or inaccurate dataset information. Each dataset comprises configuration parameters, non-parametric information, and cost of each piece of IT equipment of a data center. The non-parametric information in each dataset is encoded into encoded parameters that represent the non-parametric information. The configuration parameters, encoded parameters, and cost of each piece of IT equipment in the category are identified as equipment parameters. Each set of equipment parameters corresponds to a data point in a multi-dimensional space. Clustering is applied to the data points to determine classes of IT equipment such that each piece of IT equipment in the category belongs to one of the classes. A generalized linear model is computed for each class of IT equipment based on the equipment parameters of the IT equipment in the class. Methods then determine the class of the identified piece of IT equipment as the minimum of squared distances between equipment parameters of the identified piece of IT equipment and the equipment parameters in each class. A predicted equipment parameter of the identified piece of IT equipment is computed using the generalized linear model associated with the class of IT equipment the identified piece of IT equipment belongs to. The predicted equipment parameter can be used to complete the dataset of the identified piece of IT equipment.
There are many different types of computer-system architectures deployed in a data center. System architectures differ from one another in the number of different memories, including different types of hierarchical cache memories, the number of processors and the connectivity of the processors with other system components, the number of internal communications busses and serial links, and in many other ways.
Data sets of component information, non-parametric information and costs associated with each piece of IT equipment deployed in a data center are stored in a reference library database.
A piece of IT equipment to be deployed in the data center or already deployed in the data center may have incomplete dataset information. The identified piece of IT equipment can be server computer, a workstation, a desktop computer, a network switch, or a router. Methods and system described below predict parameters in a dataset of the identified piece of IT equipment based on datasets of the same category of IT equipment stored in a reference library database. Datasets of IT equipment that are in the same category of IT equipment as the identified piece of IT equipment are determined. Non-parametric information entries in each dataset are identified and encoded into numerical values called “encoded parameters.”
In general, an M-tuple of equipment parameters associated with a piece of IT equipment corresponds to a data point in an M-dimensional space. Let N be the number of pieces of IT equipment of the same category deployed in a data center. The categories of IT equipment include server computers, workstations, routers, network switches, data-storage devices or any other type of equipment deployed in a data center. The M-tuples of N pieces of the IT equipment form N data points in the M-dimensional space.
n=(Xn,1,Xn,2, . . . ,Xn,M,Yn) (1)
The full set of data points associated with the category of IT equipment is given by:
X={n}n=1N (2)
As shown in the Example of
Ci(m)={n:|n−i(m)|≤|n−j(m)|∀j,1≤j≤k} (3)
where
The value of the cluster center i(m) is the mean value of the data points in the i-th cluster, which is computed as follows:
where |Ci(m)| is the number of data points in the i-th cluster.
For each iteration m, Equation (3) is used to determine if a data points n belongs to the i-th cluster followed by computing the cluster center according to Equation (4). The computational operations represented by Equations (3) and (4) are repeated for each value of m until the data points assigned to the k clusters do not change. The resulting clusters are represented by:
Ci={p}pN
where
The number of data points in each cluster sums to N (i.e., N=N1+N2+ . . . +Nk)
Each cluster is then tested to determine whether the data assigned to a cluster are distributed according to a Gaussian distribution about the corresponding cluster center. A significance level, □, is selected for the test. For each cluster Ci, two child cluster centers are initialized as follows:
i+=i+ (6a)
i−=i− (6b)
In one implementation, the vector is an M-dimensional randomly selected vector with the constraint that the length ∥∥ is small compared to distortion in the data points of the cluster. In another implementation, principle component analysis is applied to data points in the cluster Ci to determine the eigenvector, , with the largest eigenvalue □. The eigenvector points in the direction of greatest spread in the cluster of data points and is identified by the corresponding largest eigenvalue □. In this implementation, the vector =√{square root over (2λ/π)}.
K-means clustering, as described above with reference to Equations (3) and (4), is then applied to data points in the cluster Ci for the two child cluster centers i+ and i−. The two child cluster centers are relocated to identify two sub-clusters of the original cluster Ci. When the final iteration of k-means clustering applied to data points in the cluster Ci is complete, the final relocated child cluster centers are denoted by i+′ and i−′, and M-dimensional vector is formed between the relocated child cluster centers i+′ and i−′ as follows:
=i+′−i−′ (7)
The data points in the cluster Ci are projected onto a line defined by the vector as follows:
A set of projected data points
C′i={X′p}pN
The projected data points lie along the vector . The projected data points are transformed to zero mean and a variance of one by applying Equation (10) as follows:
The mean of the projected data points is given by
The variance of the projected data points is given by:
The set of projected data points with zero mean and variance of one is given by:
C′(i)={X′(p)}pN
The cumulative distribution function for a normal distribution with zero mean and variance one, N(0,1), is applied to the projected data points in Equation (13) to compute a distribution of projected data points:
A statistical test value is computed for the distribution of projected data points:
When the statistical test value is less than the significance level represented by the condition
A.2(Z(i))<α (16)
the relocated child cluster centers i+′ and i−′ are rejected and the original cluster center i is accepted. On the other hand, when the condition in Equation (16) is not satisfied, the original cluster center i is rejected and the relocated child cluster centers i+′ and i−′ are accepted as the cluster centers of two sub-clusters of the original cluster.
Each cluster Ni of data points is partitioned into training data with L data points and validation data with Ni−L data points, with the validation data set having fewer data points. Each cluster may be partitioned by randomly selecting data points to serve as training data while the remaining data points are used as validation data. For example, in certain implementations, each cluster of data points may be partitioned into 70% training data and 30% validation data. In other implementations, each cluster of data points may be partitioned into 80% training data and 20% validation data. In still other implementations, each cluster of data points may be partitioned into 90% training data and 10% validation data.
The L training data points are used to construct a generalized linear model for each class (i.e., cluster) of IT equipment.
A generalized linear model is represented by
h(μl)=β0+β1Xl,1+β2Xl,2+ . . . +βMXl,M (17)
where
The response parameters, Y1, Y2, . . . , YL are dependent variables that are distributed according to a particular distribution, such as the normal distribution, binomial distribution, Poisson distribution, and Gamma distribution, just to name a few. The linear predictor is the expected value of the response parameter given by:
μl=E(Yl) (18)
Examples of link functions are listed in the following Table:
Link Function
ηl = h(μl)
μl = h−1(ηl)
Identity
μl
μl
Log
ln(μl)
eh(μ
Inverse
μl−1
h(μl)−1
Inverse-square
μl−2
h(μl)−1/2
Square-root
h(μl)2
For example, when the response parameters are distributed according to a Poisson distribution, the link function is the log function. When the response parameters are distributed according to a Normal distribution, the link function is the identity function.
The system of equations in
γm(r+1)=βm(r)+S(βm(r))E(H(βm(r))) (19)
where
The predictor coefficients can be computed iteratively using iterative weighted least squares. The validation data is used to validate the iteratively computed prediction parameters. Consider a set of predictor coefficients β1j, β2j, . . . , βMj obtained for the j-th cluster using the training data of the j-th cluster. Let the validation data for a validation data point in the j-th cluster be represented by the regressors X1j, X2j, . . . , XMj and a response parameter Yj. The regressors are substituted into the generalized linear model to obtain an approximate response parameter as follows:
Y0j=h−1(β0j+β1jX1j+β2jX2j+ . . . +βMjXMj) (20a)
where Y0j is the approximate response parameter of the actual response parameter Yj.
The operation of Equation (20a) is repeated for the regressors of each of the Nj−L validation data points in the j-th cluster to obtain a set of corresponding approximate response parameters
0={Y01,Y02, . . . ,Y0N
The set of actual response parameters of the regressors in the validation data are given by
={Y1,Y2, . . . ,YN
When the approximate response parameters for the validation data satisfy the condition
∥0−∥<ε (20b)
where
The predictor coefficients and link function can be used to compute an unknown response parameter of an identified piece of IT equipment in a category of IT equipment. For each class of IT equipment, a sum of square distances is computed from the known regressor parameters of the identified piece of IT equipment to the regressor parameters of each piece of IT equipment in each class as follows:
where
The square distances between the identified piece of IT equipment with an unknown response is denoted by {D1, D2, . . . , DN}. The square distance can be rank ordered to determine the minimum square distance in the set of square distances denoted by:
Dj=min{D1,D2, . . . ,DN} (22)
The identified piece of IT equipment belongs to the class of IT equipment with data points in the j-th cluster Cj. An approximation of the unknown response parameter of the piece of IT equipment is computed from the predictor coefficients of the j-th cluster Cj as follows:
=h−1(β0f+β1jX1u) (23)
For example, suppose configuration and encoded parameters are known for a server computer, but the cost the server computer is unknown.
It is appreciated that the previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Gaurav, Kumar, Jha, Chandrashekhar, Singh, Jusvinder, George, Jobin, Sahu, Prateek
Patent | Priority | Assignee | Title |
11218427, | Oct 24 2020 | Zscaler, Inc. | Detecting lagging nodes in a time-synchronized distributed environment |
Patent | Priority | Assignee | Title |
6633882, | Jun 29 2000 | Microsoft Technology Licensing, LLC | Multi-dimensional database record compression utilizing optimized cluster models |
9336302, | Jul 20 2012 | Ool LLC | Insight and algorithmic clustering for automated synthesis |
20130013631, | |||
20180018533, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Dec 13 2017 | JHA, CHANDRASHEKHAR | VMWARE, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 044951 | /0142 | |
Dec 13 2017 | GEORGE, JOBIN | VMWARE, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 044951 | /0142 | |
Dec 13 2017 | GAURAV, KUMAR | VMWARE, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 044951 | /0142 | |
Dec 13 2017 | SINGH, JUSVINDER | VMWARE, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 044951 | /0142 | |
Feb 14 2018 | SAHU, PRATEEK | VMWARE, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 044951 | /0142 | |
Feb 16 2018 | VMware, Inc. | (assignment on the face of the patent) | / | |||
Nov 21 2023 | VMWARE, INC | VMware LLC | CHANGE OF NAME SEE DOCUMENT FOR DETAILS | 067102 | /0395 |
Date | Maintenance Fee Events |
Feb 16 2018 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Nov 22 2023 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Jun 09 2023 | 4 years fee payment window open |
Dec 09 2023 | 6 months grace period start (w surcharge) |
Jun 09 2024 | patent expiry (for year 4) |
Jun 09 2026 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jun 09 2027 | 8 years fee payment window open |
Dec 09 2027 | 6 months grace period start (w surcharge) |
Jun 09 2028 | patent expiry (for year 8) |
Jun 09 2030 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jun 09 2031 | 12 years fee payment window open |
Dec 09 2031 | 6 months grace period start (w surcharge) |
Jun 09 2032 | patent expiry (for year 12) |
Jun 09 2034 | 2 years to revive unintentionally abandoned end. (for year 12) |