A method for diagnosing system health with system event logs is provided. The method includes receiving a plurality of event logs and health indicator states from a system; transducing the plurality of event logs into numeric-based metrics of the system; and deriving, based on the transduced numeric-based metrics, at least one model of the system that correlates the plurality of event logs to the corresponding health indicator states.
|
1. A method for diagnosing system health with system event logs, the method comprising:
receiving a plurality of text event logs and health indicator states from a system;
transducing the plurality of text event logs into numeric-based metrics of the system, wherein transducing includes organizing textual event messages of the plurality of text event logs into clusters based on a similarity between the textual event messages and based on whether the similarity is greater than a similarity threshold; and
deriving, based on the transduced numeric-based metrics, at least one model of the system that correlates the plurality of event logs to the corresponding health indicator states.
19. A computer readable non-transitory medium on which is encoded computer executable programming code that includes computer execution instructions to: receive a plurality of text event logs and health indicator states from a system; transduce the plurality of text event logs into numeric-based metrics of the system including organize textual event messages of the plurality of text event logs into clusters based on a similarity between the textual event messages and based on whether the similarity is greater than a similarity threshold; and
derive, based on the transduced numeric-based metrics, at least one model of the system that correlates the plurality of text event logs to the corresponding health indicator states.
12. A system for providing automated health diagnosis of a computing system, comprising:
a metrics transducer module operates to receive a plurality of text event logs and health indicator states of the computing system and to transduce the plurality of text event logs into numeric-based metrics of the system, wherein the metrics transducer module organizes textual event messages of the plurality of text event logs into clusters based on a similarity between the textual event messages and based on whether the similarity is greater than a similarity threshold; and a model building engine, executed by a processor, operates to derive, based on the transduced numeric-based metrics, at least one model of the system that correlates the plurality of text event logs to the corresponding health indicator states.
2. The method of
transducing comprises: transducing the textual event messages in the plurality of text event logs into the numeric-based metrics indicating one or more conditions of the system or at least one application executing therein.
3. The method of
transducing comprises:
computing one of the numeric-based metrics for each of the clusters based on the organized clusters.
4. The method of
providing a distance function for clustering the plurality of text event logs;
providing the similarity threshold;
providing a cluster set;
computing a distance between each of the textual event messages and each cluster found in the cluster set, based on the provided distance function;
comparing the computed distance with the provided similarity threshold;
responsive to the computed distances being smaller than the provided similarity threshold, adding the textual event message associated with the computed distance as a new cluster in the cluster set.
5. The method of
responsive the computed distance being greater than or equal to the provided similarity threshold, adding a count to the cluster associated with the computed distance.
6. The method of
computing a value for each one of the numeric-based metrics for each cluster found in the cluster set by aggregating a number of counts in the each cluster.
7. The method of
deriving the at least one system model correlating each of the computed numeric-based metrics with one of the health indicator states.
8. The method of
initializing the cluster set as an empty set.
9. The method of
receiving the plurality of text event logs and health indicator states over a plurality predefined time periods, wherein there is at least one of the plurality of text event logs and one of the health indicator states corresponding to each of the plurality of predefined time periods.
10. The method of
11. The method of
distilling, from the organizing of textual event messages, a set of prototypical event messages from the plurality of text event logs;
counting a number of times each of prototypical event messages appears in a predefined time period; and
setting the count for each of the prototypical event messages as one of the transduced numeric-based metrics.
13. The system of
each of the transduced numeric-based metrics includes an identification of the each transduced numeric-based metric and a numerical value for the each transduced numeric-based metric.
14. The system of
15. The system of
16. The system of
17. The system of
18. The system of
20. The computer-readable non-transitory medium of
|
The complexity of current computing systems and applications provided therein is quickly outgrowing the human ability to manage at an economic cost. For example, it is common to find data centers with thousands of host computing systems servicing hundreds to thousands of applications and components that provide web, computations and other services. In such distributed environments, diagnosis of failures and performance problems is an extremely difficult task for human operators. To facilitate diagnosis, commercial and open source management tools have been developed to measure and collect data from systems, networks and applications in the form of system metrics (i.e., data measurements), application metrics, and system and application event logs. However, with the large amounts of data collected, the operator is faced with the daunting task of manually going through the data, which is becoming unmanageable. These challenges have led researchers to propose the use of automated machine learning and statistical learning theory methods to aid with the detection, diagnosis and repair efforts of distributed systems and applications.
As referred herein, system and application event logs (hereinafter, “event logs” or “logs”) are records of system (both hardware and software) and application (software) events that have taken place in a system. Examples of event logs include but are not limited to failures to start a component or complete an action, system or application performance reaching predetermined thresholds, system or application errors, security events, network connection events. Each event entry typically includes a date stamp, a time stamp, and a message detailing the event. Unlike system metrics and application metrics, which contain structured numeric data, event logs are semi-structured and typically contain free text information. Event logs are essentially text messages written by the developers of the system and application. There are potentially many different messages. For example, it was found that there were more than 280,000 distinct event messages (after removing timestamps and fields containing numerical symbols only) in the event logs collected on one instance of an Information Technology (IT) system in a 9-month period.
Some prior solutions for diagnosing and repairing distributed systems and applications involve the use of search engines (e.g., as available from the Splunk Company of San Francisco, Splunk.com) or analysis modules (e.g., as available from LogLogic, Inc. of San Jose, Calif., loglogic.com) to perform indexing and parsing of the logs, whereby users have to provide adequate search queries to find desired information about the system or application health in the logs. Other prior solutions simply provide analyses of logs without correlating them with defined application or system health and typically require knowledge of the log structures and types of log messages a-priori. This leads to a finding of many types of data patterns in the logs that may not be important for diagnosing or forecasting a system or application behavior.
Embodiments are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the principles of the embodiments are described by referring mainly to examples thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent however, to one of ordinary skill in the art, that the embodiments may be practiced without limitation to these specific details. In other instances, well known methods and structures have not been described in detail so as not to unnecessarily obscure the embodiments.
Because of the sheer size and number of types of event logs that may be generated in an IT system, there is a need for a systematic approach to distill a smaller set of “prototypical” or exemplary feature messages or clusters (hereinafter, “PM set”) from the event logs to simplify the monitoring of such logs. Once the PM set is defined, it is used to transduce the text event logs into numeric-based metrics for input into learning probabilistic classifier models (hereinafter, “classifier models” or “models”) that capture the correlation between the numeric-based metrics and predefined system health indicators. One type of such classifier models is described in U.S. Patent Application Publication No. 2006/0188011(hereinafter, “Publication PAP-011”), with publication date of Aug. 24, 2006 of U.S. patent application Ser. No. 10/987,611, filed Nov. 12, 2004, which is herein incorporated by reference in its entirety. Alternative embodiments are contemplated wherein other models for correlating the numeric-based metrics and the system health states are applicable as well. As referred herein, numeric-based metrics of a system are numeric (as opposed to textual messages in event logs) data measurements indicating conditions of the system and applications operating therein. As also referred herein, a system health indicator provides a status or state of the system in accordance with predefined acceptable thresholds. A system health indicator may have one or more states to indicate different levels of system health. An example of a system health indicator is a service level objective (SLO) typically found in a service level agreement (SLA). A system SLO may have two states, compliance or violation to indicate whether such a SLO has been complied or violated in the system. It also may have more than two states to indicate different levels of system health, such violation, 50% compliance, full compliance, etc.
Described herein are methods and systems that provide an efficient representation of event logs of an IT system and applications therein that is amenable to modeling techniques that produce diagnosis or forecasting of the system health. As referred herein, and as understood in the art, information technology, or IT, encompasses all forms of technology, including but not limited to the design, development, installation, and implementation of hardware and software information or computing systems and software applications, used to create, store, exchange and utilize information in its various forms including but not limited to business data, conversations, still images, motion pictures and multimedia presentations technology and with the design, development, installation, and implementation of information systems and applications. IT distributed environments may be employed, for example, by Internet Service Providers (ISP), web merchants, and web search engines to provide IT applications and services to users.
System
The IT system 110 is instrumented to generate system event logs, which includes event logs of both the system hardware and software applications therein, and monitored values for the predefined system health indicators in any manner known in the art. For example, commercially available data collection tools such as OpenView application by Hewlett Packard® Company and Microsoft NT 4.0 Performance Counters by Microsoft may be used to monitor the IT system 110.
The metrics transducer module 120 is operable to receive the event logs generated by the system 110 and transduce the event logs into metrics. In one embodiment, the metrics transducer module 120 distills a PM set from the received event logs by performing text clustering. The metrics transducer module 120 performs text clustering by combining similar event messages in the event logs to form a cluster. For example, messages generated by the same fprintf statements with slightly different parameters may be organized or classified into a single cluster. In effect, message clustering reverse engineers the “templates” that were used to generate the event messages and ignore the minor differences. In one embodiment, the message clustering is sequentially performed in an incremental fashion because over the lifetime of the system 110, code changes may be pushed into production that result in new messages appearing. Alternatively, it is possible to wait until all possible event messages are found in the collected events logs before they are batch clustered. The sequential clustering methodology is now described. However, it should be understood by one skilled in the art, based on the present disclosure herein, that minor modification may be done in order to apply such a methodology to batch clustering.
According to one embodiment, the similarity between two text messages found in the event logs is measured with a cosine distance function:
where A and B are the messages, |•| represents the number of words in a message, and ai is the i'th word in message A. The cosine distance is a number between 0 and 1. When Dcos=1, the two messages A and B are identical, and when Dcos=0, the two messages are completely different. Upon seeing a new message, the clustering method compares the new message with the existing clusters (each cluster representing a prototypical feature message). If there exists a cluster to which the cosine distance is larger than a predefined threshold (e.g., 0.85), then the message is added to the existing cluster count. Otherwise, a new cluster is created with the new message. For example, the following event messages:
java.net.connectexception: db server connection refused; error host001; and
java.net.connectexception: db server connection refused; error code
are clustered together because their cosine distance is 0.857 (>0.85).
The metrics transducer module 120 then counts a number of times each prototypical feature message appears in a given time interval (set to match the interval of the predefined system health indicators) and use these counts as the input metrics for classifier models. It should be noted that the statistical properties of these feature-message-based metrics is different compared to system metrics or application metrics. In one embodiment, a different distribution for these input metrics is used in the classifier models. For system metrics, the normal distribution is used; whereas, for feature-message-based metrics, a modified Gamma distribution is used, which the inventors have observed to fit better than the normal and other distributions. Formally, the modified Gamma distribution follows:
The value of x is always a non-negative integer. The modified Gamma distribution fits the feature message counts better because these counts exhibit a heavy tail with an additional large concentration of 0 counts.
The model building engine 130 is operable to receive the input metrics from the metrics transducer module 120 and the monitored values for the predefined system health indicators from the system 110 (directly from the system 110 or through the metrics transducer module 120). It then derives or generates classifier models that correlate the input metrics as transduced from the event logs to the monitored system health indicators, as described in the Publication PAP-011.
The computer system 200 includes one or more processors, such as processor 202, providing an execution platform for executing software. Thus, the computerized system 200 includes one or more single-core or multi-core processors of any of a number of computer processors, such as processors from Intel, AMD, and Cyrix. As referred herein, a computer processor may be a general-purpose processor, such as a central processing unit (CPU) or any other multi-purpose processor or microprocessor. A computer processor also may be a special-purpose processor, such as a graphics processing unit (GPU), an audio processor, a digital signal processor, or another processor dedicated for one or more processing purposes. Commands and data from the processor 202 are communicated over a communication bus 204 or through point-to-point links with other components in the computer system 200.
The computer system 200 also includes a main memory 206 where software is resident during runtime, and a secondary memory 208. The secondary memory 208 may also be a computer-readable medium (CRM) that may be used to store software programs, applications, or modules that implement the method 300 (as described later), or parts thereof. The main memory 206 and secondary memory 208 (and an optional removable storage unit 214) each includes, for example, a hard disk drive and/or a removable storage drive 212 representing a floppy diskette drive, a magnetic tape drive, a compact disk drive, etc., or a nonvolatile memory where a copy of the software is stored. In one example, the secondary memory 408 also includes ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), or any other electronic, optical, magnetic, or other storage or transmission device capable of providing a processor or processing unit with computer-readable instructions. The computer system 200 includes a display 220 connected via a display adapter 222, user interfaces comprising one or more input devices 218, such as a keyboard, a mouse, a stylus, and the like. However, the input devices 218 and the display 220 are optional. A network interface 230 is provided for communicating with other computer systems via, for example, a network.
Process
At 310, the text event logs are generated and values of predefined system health indicators, e.g., SLO states (compliance or violation), of the system 110 are monitored by the system 110, using any commercially available data collection tools, such as OpenView software available from Hewlett Packard® Company and Microsoft NT 4.0 Performance Counters available from Microsoft®. In one embodiment, for each predefined time period, window, or epoch (e.g., 5-minute intervals), there is one or more generated event logs, with a plurality of event messages therein, and a corresponding SLO state Sε{s+,s−} of the system 110. The generated event logs and monitored SLO states are received or obtained by the metrics transducer module 120.
At 320, the transducer module 120 transduces the event logs into numeric-based metrics by distilling a PM set from the event logs. It then counts the number of times each prototypical feature message appears in a given time interval and use these counts as the input metrics for the learning probability classifier models.
At 330, the model building engine 130 receives the transduced numeric-based metrics from the metrics transducer module 120 and the monitored values for the predefined system health indicators from the system 110 (directly from the system 110 or through the metrics transducer module 120). It then computes or derives classifier models that correlate the numeric-based metrics, and thus the generated event logs, to the monitored system health indicators. In one embodiment, the model building engine 130 builds a classifier model, as a Naïve Bayes model based on the transduced numeric-based metrics and a corresponding system health indicator state for each predefined time period, as described in the Publication PAP-011.
At 321, a distance function for sequentially clustering the event logs into a PM set is set or provided in the metrics transducer module 120. This distance function may be defined as desired by a user of the environment 100 or any component therein. An example of the distance function is as described earlier in Equation 1.
At 322, a threshold for identifying a similarity between event messages based on the distance function is set or provided in the metrics transducer module 120. As with the distance function, the similarity threshold may be defined as desired by a user. For example, if the cosine distance function as described in Equation 1 is employed, the similarity threshold may be a value near the maximum value of 1 (which indicates the two compared messages are identical). Thus, whenever the calculated distance between two messages is equal to or greater than such a threshold value, the two messages are deemed similar for clustering.
At 323, the PM set is initialized to empty.
At 324, for each predefined time window or period (e.g., each 5-minute interval), the metrics transducer module 120 employs the predefined distance function to compute, in order, a distance between each of the event messages found in the event logs received for such a predefined time window and a prototypical feature message found in the PM set.
At 325, if the computed distance between an event message and all prototypical feature messages is smaller than (or either equal to or smaller than) the predefined similarity threshold, the metrics transducer module 120 designates such an event message as a prototypical feature message in the PM set, i.e., a new member in the PM set, for comparison with other event messages. It should be noted that the first event message is automatically designated as a prototypical feature message because there is initially no other prototypical feature message for the first event message to compare. Accordingly, the PM set is dynamically created for each predefined time window.
At 326, if the computed distance between an event message and a prototypical feature message is greater than or equal to (or just greater than) the predefined similarity threshold, the metrics transducer module 120 maps such an event message to the particular prototypical feature message and increment by one a count of the particular prototypical feature message.
At 327, the metrics transducer module 120 aggregates or counts the number of times each prototypical feature message appears in each predefined period and use these counts as the input metrics for the learning probability classifier models.
Accordingly, for each predefined time window, the metrics transducer module 120 generates a pair of a vector {right arrow over (M)} of values of the transduced numeric-based metrics and a corresponding SLO state Sε{s+,s−} (compliance or violation, respectively) of the system 110. Each element mi of the vector {right arrow over (M)} contains a value indicating the total number of a particular prototypical feature message that is found in the received event logs for each predefined time window. Thus, for multiple predefined time windows, there are multiple pairs of <{right arrow over (M)},S>. These pairs are input to the model building engine 130 to create a model for each SLO state relating each state to different values and patterns of metrics that are collected and received by the metrics transducer module 120 from the measured system 110.
In recap, the systems and methods as described herein are operable to provide compact representations of raw textual data in system event logs and transform such representations into numeric-based metrics for system modeling techniques that can produce diagnosis or forecasting of the system health.
What has been described and illustrated herein is an embodiment along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Those skilled in the art will recognize that many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Cohen, Ira, Karlsson, Magnus, Huang, Chengdu
Patent | Priority | Assignee | Title |
10002066, | Mar 03 2015 | International Business Machines Corporation | Targeted multi-tiered software stack serviceability |
10318503, | Jul 20 2012 | Ool LLC | Insight and algorithmic clustering for automated synthesis |
10402428, | Apr 29 2013 | Dell Products L P | Event clustering system |
10423624, | Sep 23 2014 | MICRO FOCUS LLC | Event log analysis |
10496465, | Oct 15 2009 | NEC Corporation | System operations management apparatus, system operations management method and program storage medium |
10496900, | May 08 2013 | Xyratex Technology Limited | Methods of clustering computational event logs |
10656979, | Mar 31 2015 | International Business Machines Corporation | Structural and temporal semantics heterogeneous information network (HIN) for process trace clustering |
11120033, | May 16 2018 | NEC Corporation | Computer log retrieval based on multivariate log time series |
11205103, | Dec 09 2016 | The Research Foundation for The State University of New York | Semisupervised autoencoder for sentiment analysis |
11216428, | Jul 20 2012 | Ool LLC | Insight and algorithmic clustering for automated synthesis |
11961015, | Jun 20 2016 | International Business Machines Corporation | System, method, and recording medium for distributed probabilistic eidetic querying, rollback, and replay |
8209567, | Jan 28 2010 | MICRO FOCUS LLC | Message clustering of system event logs |
8533193, | Nov 17 2010 | Hewlett Packard Enterprise Development LP | Managing log entries |
8635617, | Sep 30 2010 | Microsoft Technology Licensing, LLC | Tracking requests that flow between subsystems using transaction identifiers for generating log data |
8700953, | Sep 18 2008 | NEC Corporation | Operation management device, operation management method, and operation management program |
8930757, | Jan 24 2011 | NEC Corporation | Operations management apparatus, operations management method and program |
8959401, | Oct 15 2009 | NEC Corporation | System operations management apparatus, system operations management method and program storage medium |
8983963, | Jul 07 2011 | International Business Machines Corporation | Techniques for comparing and clustering documents |
9158606, | Jan 22 2009 | International Business Machines Corporation | Failure repetition avoidance in data processing |
9336302, | Jul 20 2012 | Ool LLC | Insight and algorithmic clustering for automated synthesis |
9384079, | Oct 15 2009 | NEC Corporation | System operations management apparatus, system operations management method and program storage medium |
9607023, | Jul 20 2012 | Ool LLC | Insight and algorithmic clustering for automated synthesis |
9934123, | Mar 03 2015 | International Business Machines Corporation | Targeted multi-tiered software stack serviceability |
Patent | Priority | Assignee | Title |
5991806, | Jun 09 1997 | Dell USA, L.P. | Dynamic system control via messaging in a network management system |
6592627, | Jun 10 1999 | GOOGLE LLC | System and method for organizing repositories of semi-structured documents such as email |
6662171, | Oct 30 2000 | Hewlett Packard Enterprise Development LP | Automated diagnostic metric loop |
7171590, | Jul 29 2002 | NEC Corporation | Multi-processor system that identifies a failed node based on status information received from service processors in a partition |
7302618, | Sep 19 2001 | Juniper Networks, Inc. | Diagnosis of network fault conditions |
7668953, | Nov 13 2003 | Cisco Technology, Inc. | Rule-based network management approaches |
7747083, | Mar 27 2006 | R2 SOLUTIONS LLC | System and method for good nearest neighbor clustering of text |
20030101385, | |||
20050010323, | |||
20060074597, | |||
20060143291, | |||
20060173863, | |||
20060188011, | |||
20060195356, | |||
20070234426, | |||
20070255979, | |||
20080010680, | |||
20080103736, | |||
20080162982, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Mar 30 2007 | HUANG, CHENGDU | Hewlett-Packard Development Company, LP | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019195 | /0778 | |
Apr 02 2007 | COHEN, IRA | Hewlett-Packard Development Company, LP | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019195 | /0778 | |
Apr 02 2007 | KARLSSON, MAGNUS | Hewlett-Packard Development Company, LP | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019195 | /0778 | |
Apr 03 2007 | Hewlett-Packard Development Company, L.P. | (assignment on the face of the patent) | / | |||
Oct 27 2015 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Hewlett Packard Enterprise Development LP | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037079 | /0001 |
Date | Maintenance Fee Events |
Jun 26 2015 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jun 24 2019 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Aug 28 2023 | REM: Maintenance Fee Reminder Mailed. |
Feb 12 2024 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Jan 10 2015 | 4 years fee payment window open |
Jul 10 2015 | 6 months grace period start (w surcharge) |
Jan 10 2016 | patent expiry (for year 4) |
Jan 10 2018 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jan 10 2019 | 8 years fee payment window open |
Jul 10 2019 | 6 months grace period start (w surcharge) |
Jan 10 2020 | patent expiry (for year 8) |
Jan 10 2022 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jan 10 2023 | 12 years fee payment window open |
Jul 10 2023 | 6 months grace period start (w surcharge) |
Jan 10 2024 | patent expiry (for year 12) |
Jan 10 2026 | 2 years to revive unintentionally abandoned end. (for year 12) |