A managing apparatus includes: a storing unit configured to store fault co-occurrence information storing the number of detection times that a first message pattern indicating a message group including messages received from an information processing system in a predetermined duration is deleted; a determining unit configured to detect the first message pattern, to read the number of detection times from the fault co-occurrence information, to calculate a co-occurrence probability of the fault and the first message pattern based on the number of detection times, and to determine that the fault has occurred if the co-occurrence probability is equal to or higher than a threshold value; and an updating unit configured to create a second message pattern indicating a message group obtained by excluding a message output by a changed component from the first message pattern, and to update the first message pattern to the second message pattern.
|
8. A managing method for managing an information processing system including a plurality of components, the managing method comprising:
detecting a first message pattern from one or two or more messages received from the information processing system in a predetermined duration;
reading the number of detection times that the first message pattern is detected when a fault has occurred in the information processing system from fault co-occurrence information stored in a storing unit configured to store the fault co-occurrence information storing the number of detection times, and calculating a co-occurrence probability of the fault and the first message pattern based on the number of detection times;
determining that the fault has occurred when the co-occurrence probability is equal to or higher than a threshold value; and
creating a second message pattern indicating a message group obtained by excluding a message output by a changed component from the first message pattern upon detecting that the component has been changed, and updating the first message pattern stored in the fault co-occurrence information to the second message pattern,
calculating, when the first message is updated, the co-occurrence probability of the fault and the updated first message.
9. A storage medium on which is stored a program for causing a computer to execute a process for managing an information processing system including a plurality of components, the process comprising:
detecting a first message pattern from one or two or more messages received from the information processing system in a predetermined duration;
reading the number of detection times that the first message pattern is detected when a fault has occurred in the information processing system from fault co-occurrence information stored in a storing unit configured to store the fault co-occurrence information storing the number of detection times, and calculating a co-occurrence probability of the fault and the first message pattern based on the number of detection times;
determining that the fault has occurred when the co-occurrence probability is equal to or higher than a threshold value; and
creating a second message pattern indicating a message group obtained by excluding a message output by a changed component from the first message pattern upon detecting that the component has been changed, and updating the first message pattern stored in the fault co-occurrence information to the second message pattern,
calculating, when the first message is updated, the co-occurrence probability of the fault and the updated first message.
1. A managing apparatus for managing an information processing system including a plurality of components, the managing apparatus comprising:
a storing unit configured to store fault co-occurrence information storing the number of detection times that a first message pattern indicating a message group including one or two or more messages received from the information processing system in a predetermined duration is detected when a fault has occurred in the information processing system;
a determining unit configured to detect the first message pattern from the one or two or more messages received from the information processing system in the predetermined duration, to read the number of detection times from the fault co-occurrence information stored in the storing unit, to calculate a co-occurrence probability of the fault and the first message pattern based on the number of detection times, and to determine that the fault has occurred when the co-occurrence probability is equal to or higher than a threshold value; and
an updating unit configured to create a second message pattern indicating a message group obtained by excluding a message output by a changed component from the first message pattern upon detecting that the component has been changed, and to update the first message pattern stored in the fault co-occurrence information to the second message pattern,
wherein the determining unit calculates the co-occurrence probability of the fault and the first message updated by the updating unit.
2. The managing apparatus according to
the determining unit detects the first message pattern according to message pattern information storing messages included in a message pattern for each message pattern indicating the message group including one or two or more messages received from the information processing system in the predetermined duration.
3. The managing apparatus according to
the updating unit excludes the message output by the changed component from the message pattern information upon detecting that the component has been changed, and updates the message pattern information to new message pattern information by merging message patterns that become identical as a result of excluding the message.
4. The managing apparatus according to
the updating unit merges message patterns that are included in the fault co-occurrence information and become identical as a result of excluding the message output by the changed component from the message pattern information, and updates the fault co-occurrence information to new fault co-occurrence information by summing up the number of detection times.
5. The managing apparatus according to
the updating unit identifies a message output by a component according to message information stored by being associated with component information indicating the component that outputs the message for each message.
6. The managing apparatus according to
a configuration information attaching unit configured to extract the component information included in the message upon receipt of the message from the information processing system, and to store the component information in the message information by associating the component information with the message.
7. The managing apparatus according to
a learning unit configured to create a third message pattern indicating a message group by reading the message group including one or two or more messages received in the predetermined duration from a message log storing messages received from the information processing system, and to store the number of detection times that the third message pattern is detected when a fault has occurred in the information processing system in the predetermined duration.
|
This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2010-275215, filed on Dec. 10, 2010, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a managing apparatus and a managing method, which are intended to manage a system including one or two or more information processing devices.
In recent years, a utilization form of ICT (Information and Communication Technology) called cloud computing has been known. Cloud computing is a utilization form of ICT, by which ICT resources on a network are used via a network.
ICT resources include various elements such as a network, a server and a storage, which are interconnected by a network, middleware running on a server, and the like.
In an environment for realizing cloud computing, namely, in a cloud environment, there are many systems having an identical or similar configuration in some cases. Moreover, in the cloud environment, a configuration of the ICT resources included in the cloud environment is dynamically changed by a replacement of hardware, an addition of a server, a version upgrade of an application, or the like. Accordingly, a heavy workload is imposed on the management of the cloud environment, such as fault detection or the like.
In relation to the above described technique, an apparatus for generating an anomalous state signal by collecting and processing warning message signals from a communication network is known.
Additionally, a fault detection system that has a peripheral device fault pattern file and a node device fault pattern file, to which a characteristic of a fault MSG (message) is preregistered, and determines a peripheral device fault MSG or the like by making a comparison between an MSG and an individual pattern of the pattern file is known.
Furthermore, a fault monitoring system for registering maintenance information of a newly connected device to a fault dictionary upon detection of the newly connected device, and for determining a fault of the newly connected device if a notified log message is registered to the fault dictionary is known.
Still further, a relay server for adding, to shared resources information, message information about a notification message for notifying that information or the like of a relay group or the like has been changed, and for automatically deleting the message information when a predetermined duration elapses after the message information about the notification message is added to the shared resources information is known.
According to an aspect of the embodiment, a managing apparatus is an apparatus for managing an information processing system including a plurality of components, and includes the following configuration.
A storing unit is a storing device configured to store fault co-occurrence information storing the number of detection times that a first message pattern indicating a message group including one or two or more messages received from the information processing system in a predetermined duration is detected when a fault has occurred in the information processing system.
A determining unit detects the first message pattern from the one or two or more messages received from the information processing system in the predetermined duration. In this case, the determining unit reads the number of detection times from the fault co-occurrence information stored in the storing unit, calculates a co-occurrence probability of the fault and the first message pattern based on the number of detection times, and determines that the fault has occurred if the co-occurrence probability is equal to or higher than a threshold value.
An updating unit detects that a component has been changed. In this case, the updating unit creates a second message pattern indicating a message group obtained by excluding a message output by the changed component from the first message pattern, and updates the first message pattern stored in the fault co-occurrence information to the second message pattern.
The object and advantages of the embodiment will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the embodiment, as claimed.
The managing apparatus classifies messages stored in a message dictionary 101 among messages stored in a log of past messages output from devices under a cloud environment into messages that have occurred in each predetermined duration.
The managing apparatus generates a message pattern by collecting classified messages that have occurred in each predetermined duration. This predetermined duration is referred to as a window width.
In the meantime, the managing apparatus may know that a fault has occurred at a particular time based on past fault cases.
In the example of
Each of the message patterns illustrated in
The managing apparatus classifies messages stored in the message dictionary 101 among messages received from devices under a cloud environment into messages output in each predetermined duration.
The managing apparatus generates a message pattern by collecting classified messages for each window width.
The managing apparatus makes a comparison between the generated message pattern and message patterns stored in the message pattern dictionary 103, for example, the message patterns indicated by the events attribute illustrated in
Alternatively, the managing apparatus may determine that the fault has not occurred, namely, the state of the system is normal if the message pattern that matches the message pattern stored in the message pattern dictionary 103 is not detected, or if the fault occurrence probability is lower than the threshold value.
However, in the devices under the cloud environment, there are a plurality of systems having an identical or similar configuration, such as hardware, a server, an application and the like. The identical or similar configuration is frequently changed in a life cycle. For example, a configuration of an device is changed from day to day by a replacement of hardware, a version upgrade of an application, or the like. Moreover, a new server is added or a server is removed under the cloud environment.
Additionally, if a conventionally output message is not output, for example, due to a replacement of an device, a setting change of an application, or the like, a message pattern is partially lacking compared with a learned message pattern. Moreover, if content of a message is different from a conventional one although a replaced device or an application the setting of which has been changed outputs the message as is conventionally done, a message pattern becomes partially different from the conventional one.
As described above, if the configuration or the setting of an device under the cloud environment is changed, the fault occurrence detection illustrated in
Messages received from the devices are classified into messages output in each predetermined duration as described with reference to
In this case, the message pattern 502 does not match the message pattern 501 stored in the message pattern dictionary 103. As a result, a conventionally detectable problem may not be detected in some cases. Accordingly, learning needs to be newly performed after discarding the learned message patterns and fault occurrence probabilities, which are stored in the message pattern dictionary 103.
Detection of a message pattern similar to a learned message pattern stored in the message pattern dictionary 103 by obtaining a correlation between message patterns, for example, with a vector distance between the message patterns is also considered. In this case, however, it becomes difficult to statistically calculate the fault occurrence probability of the similar message pattern.
Embodiments are described below with reference to
The managing apparatus 600 illustrated in
The information processing system 605 is a system to be managed by the managing apparatus 600 according to this embodiment. The information processing system 605 is, for example, an information processing system that provides a cloud environment. The information processing system 605 includes one or two or more devices. The devices are communicatively connected via a network or the like. The devices include information processing devices such as a server, a SAN (Storage Area Network), a NAS (Network Attached Storage), a CAS (Content Aware Storage) and the like. In this embodiment, an element, such as an device, hardware included in an device, software running on an device or hardware included in an device, or the like, which may be an entity that outputs a message, is referred to as a component.
The storing unit 601 is a storage device for storing fault co-occurrence information including the number of times that a first message pattern indicating a message group including one or two or more messages received from the information processing system 605 in a predetermined duration is detected when a fault has occurred in the information processing system 605. The storing unit 601 may be a volatile storage device such as a RAM (Random Access Memory) or the like, or may be a nonvolatile storage device such as a magnetic disk device or the like.
The determining unit 602 detects the first message pattern from one or two or more messages received from the information processing system 605 in the predetermined duration. In this case, the determining unit 602 reads the number of detection times from the fault co-occurrence information stored in the storage unit 601, and calculates a co-occurrence probability of the fault and the first message pattern based on the read number of detection times. Then, the determining unit 602 determines that the fault has occurred if the co-occurrence probability is equal to or higher than a threshold value.
The updating unit 603 detects that a component included in the information processing system 605 has been changed. In this case, the updating unit 603 creates a second message pattern indicating a message group obtained by excluding a message output by the changed component from the first message pattern, and updates the first message pattern stored in the fault co-occurrence information to the second message pattern.
The determining unit 602 and the updating unit 603 may be realized by causing a CPU (Central Processing Unit) included in an information processing device to execute a predetermined program.
In the above described configuration, the updating unit 603 updates the first message pattern stored in the fault co-occurrence information to the second message pattern obtained by excluding the message output by the changed component from the first message pattern, if the component included in the information processing system 605 has been changed.
As a result, the determining unit 602 may read the number of detection times from the fault co-occurrence information stored in the storage unit 601 even upon detecting the second message pattern from the information processing system 605, and may calculate a co-occurrence probability of the fault and the second message pattern based on the read number of times. Then, the determining unit 602 determines that the fault has occurred if the co-occurrence probability is equal to or higher than the threshold value.
Consequently, even if the information processing system 605 does not output the first message pattern output so far because a component included in the information processing system 605 has been changed, a co-occurrence probability may be obtained by using the fault co-occurrence information, and a fault of the information processing system 605 may be detected. As a result, a workload imposed on the fault management for the information processing system 605 may be lightened.
The information processing system 700 illustrated in
The device 1, the device 2, . . . , the device N may respectively include an information processing device such as a server, a SAN, a NAS, a CAS or the like.
The device 1, the device 2, . . . , the device N, and hardware respectively included in the device 1, the device 2, . . . , the device N output a message to the managing apparatus 701 as needed. “The message output by hardware” may be considered as a message output by a program that controls hardware and may be regarded as being integral with the hardware. Also an application running on the device 1, the device 2, . . . , the device N, and an application running on the hardware respectively included in the application 1, the application 1, . . . the application N output a message to the managing apparatus 701 as needed.
In this embodiment, an element, such as an device included in the information processing system 700, hardware included in an device, software running on hardware included in an device, or other elements, which may be an entity that outputs a message, is referred to as a component.
The managing apparatus 701 may be implemented by using a general information processing device as illustrated in
The managing apparatus 701 collects messages output from the components included in the information processing system 700. Then, the managing apparatus 701 manages the state of the information processing system 700, for example, by determining whether or not a fault has occurred based on the collected messages.
The managing apparatus 701 includes a message pattern dictionary 801, a message dictionary 802 and a message pattern detecting unit 803. Moreover, the managing apparatus 701 may include a message pattern learning unit 804. The managing apparatus 701 may also include a configuration information storing unit 805, a configuration information attaching unit 806 and a message pattern updating unit 807.
The message pattern dictionary 801 is a storage device for storing a message pattern table 900 and a co-occurrence probability table 1000. A message pattern means a message group including one or two or more messages. In this embodiment, a message group including messages output from the information processing system 700 to be managed in a predetermined duration is used as a message pattern. For the message pattern in this embodiment, the order of output messages does not matter. For example, a message pattern including the messages 1, 2 and 3 output in this order, and a message pattern including the messages 3, 2 and 1 output in this order are handled as identical message patterns.
The message pattern table 900 is information including a message pattern that is extracted from a past message log and past fault cases and characterizes a fault. The co-occurrence probability table 1000 is information including a fault occurrence frequency for each message pattern. The message pattern table 900 and the co-occurrence probability table 1000 will be described later respectively with reference to examples of
The message dictionary 802 is a storage device for storing a message table 1100. The message table 1100 is information including messages to be managed, namely, messages desired to be extracted as a message pattern. The message table 1100 will be described later with reference to an example of
The message pattern detecting unit 803 collects messages 813 output from the components included in the information processing system 700. The message pattern detecting unit 803 classifies one or two or more messages 813 output in a predetermined duration into messages to be managed and messages other than the messages to be managed for each predetermined duration. For example, the message pattern detecting unit 803 may determine a message 813 as a message to be managed if the message 813 matches any of the messages stored in the message table 1100. Moreover, the message pattern detecting unit 803 creates a message pattern by classifying the messages to be managed into one message group for each window width.
The message pattern detecting unit 803 calculates a co-occurrence probability of a fault for each message pattern based on the co-occurrence probability table 1000. Then, the message pattern detecting unit 803 detects whether or not a fault has occurred based on the calculated co-occurrence probability of the fault. For example, if the co-occurrence probability of the fault exceeds a threshold value, the message pattern detecting unit 803 determines that the fault has occurred.
The message pattern learning unit 804 classifies messages stored in the message dictionary 802 among messages stored in a message log within the message log storing unit 811 into messages that have occurred in each predetermined duration. Then, the message pattern learning unit 804 generates a message pattern by collecting the classified messages for each window width. The message pattern learning unit 804 stores the generated message pattern in the message pattern table 900.
Additionally, the message pattern learning unit 804 counts the number of detection times that the generated message pattern is detected when a fault has occurred based on fault cases stored in the fault case storing unit 810, and stores the counted number in the co-occurrence probability table 1000.
The configuration information storing unit 805 is a storage device for storing information about components included in the information processing system 700, namely, configuration information.
The configuration information attaching unit 806 identifies a component at a transmission source of a message stored in the message dictionary 802 based on the configuration information stored in the configuration information storing unit 805, and stores the identified component in the message table 1100 by associating the component with the message.
The message pattern updating unit 807 receives configuration change information 812 including information about a changed component. In this case, the message pattern updating unit 807 generates a new message pattern table 900′ (not illustrated) obtained by deleting the message transmitted by the changed component at a transmission source based on the message pattern stored in the message pattern table 900. Moreover, the message pattern updating unit 807 generates a co-occurrence probability table 1000′ (not illustrated) for the new message pattern table 900′.
The fault case storing unit 810 is a storage device for storing past fault cases that occurred in the components included in the information processing system 700. The message log storing unit 811 is a storage device for storing messages output from the components included in the information processing system 700 as a log.
The message pattern table 900 is a table for storing a bit string that indicates whether or not there is a message included in a message pattern for each message pattern. This bit string has a bit width equivalent to the number of message IDs. If the bit is “0”, this indicates that the message indicated by the message ID corresponding to the bit is not included in the message pattern. Alternatively, if the bit is “1”, this indicates that the message indicated by the message ID corresponding to the bit is included in the message pattern.
For example, a bit corresponding to a message ID “1” is set to “1” in the pattern 1. This indicates that the message having the message ID “1” is included in the message pattern of the pattern 1.
Similarly, bits corresponding to the message IDs “1” and “2” are set to “1” in the pattern 3. This indicates that messages respectively having the message IDs “1” and “2” are included in the message pattern of the pattern 3.
The co-occurrence probability table 1000 is a table including the total number of detection times and the number of detection times for each massage pattern. The total number of detection times is the total of the numbers of times that a corresponding message pattern is detected when a fault 1 to a fault j have occurred. The number of detection times is the number of times that a message pattern is detected for a fault when the fault has occurred.
For example, according to the co-occurrence probability table 1000 illustrated in
The message table 1100 is a table including a registered message and CI (Configuration Item) for each message ID.
The message ID is an ID of a message to be classified as a message included in a message pattern among messages output from the components included in the information processing system 700. Accordingly, even if a message different from a conventional one is output because a component included in the information processing system 605 has been changed, the message is not to be classified unless it is registered to the message table 1100 as a registered message. In this case, the message pattern detecting unit 803 may execute a process similar to that executed when a message is not output because a component has been removed or changed.
The CI is information indicating a component at the transmission source of a message among the components included in the information processing system 700.
The configuration information 1200 is information including a component ID, a component type, a component name, a description and an administrator.
The component ID is information for identifying a component included in the information processing system 700. The component type is information indicating the type of a component indicated by the component ID. For example, “Network” illustrated in
The configuration information 1200 may include at least one or more of the component ID, the component type and the component name as needed. Moreover, for example, as the CI illustrated in
In the fault case storing unit 810, one or two or more fault cases respectively including a fault ID, a fault type, and an individual case are stored.
The fault ID is identification information for identifying a fault case. The fault type is information indicating a fault type of a fault case, such as an HDD (Hard Disk Drive) malfunction, a network card problem or the like. The individual case is information including one or two or more cases of the same fault type. For example, the individual case of the fault case 1304 illustrated in
(1) The configuration information attaching unit 806 reads a message 1401 from a message log stored in the message log storing unit 811. Then, the configuration information attaching unit 806 extracts a particular character string from a character string included in the message 1401. From which position in the message 1401 the particular character string is extracted may be predetermined according to, for example, the type of a message log.
(2) The configuration information attaching unit 806 references the configuration information 1200 stored in the configuration information storing unit 805, and obtains a component type of a component having a component name that matches the extracted particular character string. For instance, in the example of
(3) The configuration information attaching unit 806 stores the component type obtained from the configuration information 1200 in the message table 1100 as a CI of the message. As a result, the message is associated with the CI indicating the component at the transmission source of the message and stored.
(1) Upon receipt of the configuration change information 812, the message pattern updating unit 807 extracts a component name of a changed component from a character string included in the configuration change information 812.
Note that the configuration change information 812 may include only the component name of a changed component. In this case, the configuration information attaching unit 806 may merely obtain the component name from the configuration information attaching unit 806.
(2) The message pattern updating unit 807 obtains a component type of a component having the component name that matches the component name extracted from the configuration change information 812 by referencing the configuration information 1200 stored in the configuration information storing unit 805. For instance, in the example of
The message pattern updating unit 807 identifies a message ID of a message having a CI that matches the component type extracted from the configuration information 1200 by referencing the message table 1100 stored in the message dictionary 802. In the example of
(3) The message pattern updating unit 807 creates a message pattern table 900′ by excluding a bit corresponding to the identified message ID from the message pattern table 900 stored in the message pattern dictionary 801. In the example of
Here, for example, a case where the message output by the changed component is a message having the message ID “1” is considered.
Accordingly, if the bit corresponding to the message ID “1” is excluded from the message pattern table 900, duplicate message patterns like a message pattern having a pattern 2 and a message pattern having a pattern 3 in
Therefore, the message pattern updating unit 807 deletes the bit corresponding to the message ID “1” from the message pattern table 900, and the pattern 2 and the pattern 3 duplicate with the message of the pattern 2 are merged into a pattern 2′. With this merging, the message pattern table 900′ is created.
Additionally, the message pattern updating unit 807 creates a co-occurrence probability table 1000′ where the pattern 2 and the pattern 3 are merged into the pattern 2′ in the co-occurrence probability table 1000 recorded in the message pattern dictionary 801.
In the example of the co-occurrence probability table 1000 illustrated in
For example, as illustrated in
Then, the message pattern updating unit 807 calculates a logical AND between the mask pattern and the pattern 1, and a logical AND between the mask pattern and the pattern 2. If the logical AND between the mask pattern and the pattern 1 and that between the mask pattern and the pattern 2 match, the message pattern updating unit 807 determines that the pattern 1 and the pattern 2 are identical. In this case, the message pattern updating unit 807 determines that the pattern 1 and the pattern 2 are message patterns to be merged. The process for making a comparison between logical ANDs of a mask pattern in this way is referred to as a mask operation hereinafter.
In step S1801, the message pattern learning unit 804 references the message log storing unit 811. Then, the message pattern learning unit 804 obtains one or two or more messages output in a predetermined duration from a message log stored in the message log storing unit 811. The predetermined duration is referred to as a classification duration.
Upon detection of the end of the message log in step S1802 (“YES” in step S1802), the message pattern learning unit 804 terminates the learning process (step S1807).
Alternatively, if the end of the message log is not detected in step S1802 (“NO” in step S1802), the flow goes to step S1803. In this case, the message pattern learning unit 804 obtains the message ID of each message obtained in step S1801 by referencing the message table 1100 stored in the message dictionary 802 (step S1803).
If a message that is not stored in the message table 1100 is included in the messages obtained in step S1801, the message pattern learning unit 804 stores the message that is not stored in the message table 1100, in the message table 1100 along with a new message ID.
In step S1804, the message pattern learning unit 804 creates a bit string that represents the message pattern, and stores the created bit string in the message pattern table 900 as a bit pattern. The message pattern represented by the created bit string is hereinafter referred to as a target message pattern.
For example, a bit string that represents the target message pattern may be represented with a bit string having a width of bits the number of which is the same as the number of messages stored in the message table 1100 as illustrated in
However, if the same message pattern as that created in step S1804 is already stored in the message pattern table 900, the message pattern learning unit 804 does not store the target message pattern in the message pattern table 900.
In step S1805, the message pattern learning unit 804 references fault cases stored in the fault case storing unit 810, and extracts cases that have occurred in a classification duration. For example, the message pattern learning unit 804 references an occurrence time and an end time of each case included in the individual case of a fault case, and extracts a fault case including a case having an occurrence duration that is represented with the occurrence time and the end time and included partially or entirely in the classification duration as a fault that has occurred in the classification duration.
In step S1806, the message pattern learning unit 804 references the co-occurrence probability table 1000, adds the umber of detection times, which corresponds to the target message pattern, for each fault extracted in step S1805, and also updates the total number of detection times.
For example, if the target message pattern is a pattern i and the fault extracted in step S1805 is “fault j”, the message pattern learning unit 804 references the co-occurrence probability table 1000, and increments, by 1, the number of detection times “Cij” of the “pattern i” when the “fault j” has occurred. The message pattern learning unit 804 also increments, by 1, the total number “Ei” of detection times of the “pattern i”.
Upon termination of the above described process, the flow goes to step S1801. Then, the message pattern learning unit 804 obtains one or two or more messages output in the next classification duration from the message log stored in the message log storing unit 811. Then, the message pattern learning unit 804 executes the processes in steps S1802 to S1806.
In step S1901, the message pattern detecting unit 803 obtains messages output from the components included in the information processing system 700 in the classification duration.
In step S1902, the message pattern detecting unit 803 references the message table 1100 stored in the message dictionary 802, and obtains message IDs of the messages obtained in step S1901. Note that the message pattern detecting unit 803 executes subsequent processes only for a message having a message ID that may be obtained from the message table 1100. Accordingly, the message pattern detecting unit 803 does not execute the subsequent processes for a message that is not stored in the message table 1100.
In step S1903, the message pattern detecting unit 803 creates a bit string where a bit corresponding to the message having the message ID obtained in step S1902 is set to “1” and the other bits are set to “0”. Then, the message pattern detecting unit 803 references the message pattern table 900, and identifies a message pattern that matches the created bit string. The identified message pattern is hereinafter referred to as a target message pattern.
In step S1904, the message pattern detecting unit 803 references the co-occurrence probability table 1000, and calculates a co-occurrence probability of each fault upon detection of the message pattern identified in step S1903.
For example, a case where the message pattern created in step S1903 corresponds to the “pattern i” within the co-occurrence probability table 1000 illustrated in
If a co-occurrence probability that exceeds a predetermined threshold value is included in the co-occurrence probabilities calculated in step S1904 (“YES” in step S1905), the message pattern detecting unit 803 outputs a notification that the fault has occurred to a terminal device or the like of an administrator (step S1906). In this case, the message pattern detecting unit 803 may also output the fault type of the fault, the co-occurrence probability of which exceeds a threshold value, to the terminal or the like of the administrator. Then, the flow goes to step S1901.
Alternatively, if the co-occurrence probability that exceeds the predetermined threshold value is not included in the co-occurrence probabilities calculated in step S1904 (“NO” in step S1905), the flow goes to step S1901. Then, the message pattern detecting unit 803 obtains messages output from the components included in the information processing system 700 in the next classification duration. Next, the message pattern detecting unit 803 executes the processes in steps S1902 to S1906.
In step S2001, the configuration information attaching unit 806 references the message log storing unit 811. Then, the configuration information attaching unit 806 obtains one message from a message log stored in the message log storing unit 811. Assume that the message is obtained from the beginning of the message log. The obtained message is hereinafter referred to as a target message.
If the end of the message log is detected in step S2002 (“YES” in step S2002), the configuration information attaching unit 806 terminates the configuration information attachment process (step S2006).
Alternatively, if the end of the message log is not detected in step S2002 (“NO” in step S2002), the flow goes to step S2003. In this case, the configuration information attaching unit 806 extracts a component name of a component at a transmission source of the target message from the target message (step S2003).
In which position the component name at the transmission source is inserted within a message may be known in advance depending on the type of a message log. Accordingly, the configuration information attaching unit 806 may identify the position within the target message, in which the component name is inserted, depending on the type of the message log stored in the message log storing unit 811, and may extract a component name from the identified position.
In step S2004, the configuration information attaching unit 806 references the configuration information storing unit 805. Then, the configuration information attaching unit 806 identifies the component type at the transmission source of the target message from the component name extracted in step S2003.
In step S2005, the configuration information storing unit 805 stores the component type identified in step S2004 as the CI of the target message in the message table 1100 stored in the message dictionary 802.
Upon termination of the above described process, the flow goes to step S2001. Then, the configuration information attaching unit 806 obtains the next target message from the message log stored in the message log storing unit 811, and executes the processes in steps S2002 to S2005.
Upon receipt of the configuration change information 812, the message pattern updating unit 807 starts the message pattern update process (step S2100).
In step S2101, the message pattern updating unit 807 extracts a component name of a changed component from the configuration change information 812. A user may input the configuration change information 812 to the managing apparatus 701 by using an input device 2203 to be described later, or a message or the like output from a component included in the information processing system 700 may be used.
In step S2102, the message pattern updating unit 807 references the configuration information storing unit 805. Then, the message pattern updating unit 807 identifies the component type of the changed component from the component name extracted in step S2101.
In step S2103, the message pattern updating unit 807 references the message table 1100 stored in the message dictionary 802. Then, the message pattern updating unit 807 extracts a message ID having a component type that matches the component type identified in step S2102 among CIs stored in the message table 1100.
In step S2104, the message pattern updating unit 807 creates a mask pattern where a bit corresponding to the message ID extracted in step S2103 is set to “0” and the other bits are set to “1”.
In step S2105, the message pattern updating unit 807 references the message pattern table 900 stored in the message pattern dictionary 801. Then, the message pattern updating unit 807 performs a mask operation for all message patterns included in the message pattern table 900. The mask operation was described earlier with reference to
In step S2106, the message pattern updating unit 807 identifies message patterns that may be determined as being identical as a result of the mask operation.
In step S2107, the message pattern updating unit 807 creates a new message pattern table 900′ by merging the message patterns identified as being identical in step S2106.
In step S2108, the message pattern updating unit 807 copies the co-occurrence probability table 1000 stored in the message pattern dictionary 801.
In step S2109, the message pattern updating unit 807 calculates the total number of detection times after the message patterns are merged by adding the total numbers of detection times of the message patterns identified as being identical in step S2106. Moreover, the message pattern updating unit 807 calculates the number of detection times for each fault after the message patterns identified as being identical in step S2106 are merged by adding the numbers of detection times of the message patterns for each fault.
In step S2110, the message pattern updating unit 807 reflects the calculation results of step S2109 on the co-occurrence probability table 1000′ copied in step S2108.
Specifically, the following process is executed.
Initially, the message pattern updating unit 807 merges the message patterns identified in step S2106 among message patterns included in the co-occurrence probability table 1000′ copied in step S2108. Then, the message pattern updating unit 807 reflects the total number of detection times and the number of detection times for each fault, which are calculated in step S2109, on the co-occurrence probability table 1000′.
Upon termination of the above described process, the message pattern updating unit 807 terminates the message pattern update process (step S2111).
In this embodiment, the message pattern updating unit 807 creates the message pattern table 900′ from the message pattern table 900. This is equivalent to an update of the message pattern table 900 to contents of the message pattern table 900′.
Similarly, the message pattern updating unit 807 creates the co-occurrence probability table 1000′ from the co-occurrence probability table 1000 in this embodiment. This is equivalent to an update of the co-occurrence probability table 1000 to contents of the co-occurrence probability table 1000′.
The managing apparatus 701 illustrated in
The CPU 2201 is an arithmetic unit for executing a program that implements the fault detection in the embodiments in addition to implementing peripheral devices or various types of software.
The memory 2202 is a volatile storage device used to execute a program. For example, a RAM or the like is available as the memory 2202.
The input device 2203 is unit for externally inputting data. For example, a keyboard, a mouse or the like is available as the input device 2203.
The output device 2204 is a device for outputting data and the like to a display device or the like. The output device 204 may also include a display device.
The external storage device 2205 is a nonvolatile storage device for storing a program that implements the fault detection in the embodiments in addition to a program and data, which the managing apparatus 701 needs to run. For example, a magnetic disk storage device or the like is available as the external storage device 2205.
The medium driving device 2206 is a device for outputting data of the memory 2202 or the external storage device 2205 to a portable storage medium 2207 such as a floppy disk, an MO disk, a CD-R, a DVD-R or the like, or for reading a program, data and the like from the portable storage medium 2207.
The network connecting device 2208 is a device for making a connection to a network 702.
Note that a non-transitory medium is available as a storage medium, such as the memory 2202, the external storage device 2205, the portable storage medium 2207 or the like, which may be read by an information processing device.
Additionally,
A co-occurrence probability table 2300 illustrated in
Each <probability> tag illustrated in
For example, as illustrated in
Then, the message pattern updating unit 807 deletes, from the co-occurrence probability table 2300, the message IDs of the identified messages, namely, message IDs “10” and “118” enclosed with a square in
As described above, if the messages having the message IDs “10” and “118” output so far are not output anymore because the OS of Host XXX has been changed, the message IDs “10” and “118” are deleted also from the co-occurrence probability table 2300. Accordingly, the co-occurrence probability of the message pattern that does not include the messages having the message IDs “10” and “118” due to the change of the OS of Host XXX may be obtained from the co-occurrence probability table 2300.
As a result, even if a component such as the OS of Host XXX or the like is changed, the fault management may be performed for the information processing system 700, for example, by using the message pattern table 900 and the co-occurrence probability table 1000 without discarding them as a result of learning.
Additionally, already learned results may be used even if a component is changed. Therefore, even in an environment such as a cloud environment where a component is frequently changed, the need for newly executing the learning process each time a component is changed is eliminated. Accordingly, a workload imposed on the fault management is reduced even in an environment such as a cloud environment where a component is frequently changed.
Furthermore, the fault management of the information processing system 700 may be performed by using already learned results such as the message pattern table 900 and the co-occurrence probability table 1000 while the learning process is newly executed. Therefore, the fault management may be continuously performed, leading to improvements in the reliability of the fault management.
Still further, in the message pattern update process in this embodiment, merging of duplicate message patterns in the message pattern table 900, and the additions of the total number of detection times and the number of detection times in the co-occurrence probability table 1000 are the main processes as described with reference to
Still further, with the message pattern update process described with reference to
In the above provided description, the pattern 1 to the pattern (2m−1) represented in the message pattern table 900 may be cited as one example of the first message pattern. Moreover, the co-occurrence probability table 1000 may be cited as one example of the fault co-occurrence information. Furthermore, the message pattern dictionary 801 may be cited as one example of the storing unit. Still further, the pattern 2′ illustrated in
As described above, with the disclosed managing apparatus, a workload imposed on the fault management may be lightened.
The process procedures represented with the flowcharts illustrated in
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present inventions have been descried in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Matsubara, Masazumi, Matsumoto, Yasuhide, Watanabe, Yukihiro, Sekiguchi, Atsuji
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
5699403, | Apr 12 1995 | THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT | Network vulnerability management apparatus and method |
5699502, | Sep 29 1995 | International Business Machines Corporation | System and method for managing computer system faults |
7124060, | Nov 03 1999 | ABB Schweiz AG | Method for isolating a fault from error messages |
7542877, | Jun 30 2006 | Hitachi, Ltd. | Computer system and method for controlling computer system |
7827447, | Jan 05 2007 | LinkedIn Corporation | Sliding window mechanism for data capture and failure analysis |
8010675, | Dec 25 2007 | Murata Machinery, Ltd. | Relay server and relay communication system |
20050246590, | |||
20060256714, | |||
20090070640, | |||
20110208679, | |||
JP2001292143, | |||
JP2005184267, | |||
JP2005184500, | |||
JP2006318071, | |||
JP2007257184, | |||
JP20089842, | |||
JP2009159129, | |||
JP2009217381, | |||
JP200975817, | |||
JP201049551, | |||
JP8307524, | |||
JP9167099, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 21 2011 | WATANABE, YUKIHIRO | Fujitsu Limited | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 027372 | /0608 | |
Oct 21 2011 | MATSUMOTO, YASUHIDE | Fujitsu Limited | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 027372 | /0608 | |
Oct 21 2011 | SEKIGUCHI, ATSUJI | Fujitsu Limited | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 027372 | /0608 | |
Oct 25 2011 | MATSUBARA, MASAZUMI | Fujitsu Limited | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 027372 | /0608 | |
Nov 09 2011 | Fujitsu Limited | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Nov 23 2017 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jan 31 2022 | REM: Maintenance Fee Reminder Mailed. |
Jul 18 2022 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Jun 10 2017 | 4 years fee payment window open |
Dec 10 2017 | 6 months grace period start (w surcharge) |
Jun 10 2018 | patent expiry (for year 4) |
Jun 10 2020 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jun 10 2021 | 8 years fee payment window open |
Dec 10 2021 | 6 months grace period start (w surcharge) |
Jun 10 2022 | patent expiry (for year 8) |
Jun 10 2024 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jun 10 2025 | 12 years fee payment window open |
Dec 10 2025 | 6 months grace period start (w surcharge) |
Jun 10 2026 | patent expiry (for year 12) |
Jun 10 2028 | 2 years to revive unintentionally abandoned end. (for year 12) |