A method is provided for lossily compressing time series data that is collected from a storage area network. The method includes: determining the resolution of an output device that may be used to output a representation of the time series data; determining a sampling block size that is in part based on the resolution of the output device; partitioning the plurality of data points into a plurality of data blocks in accordance with the sampling block size; and applying retention criteria to each of the plurality of data blocks, thereby compressing the time series data to be output by the output device.
|
10. A method for lossily compressing time series data having a plurality of data points, comprising:
determining a sampling block size; partitioning the plurality o f data points into a plurality of data blocks in accordance with the sampling block size; and applying a retention criterion to each of the plurality of data blocks, where the step of applying retention criterion further includes identifying an extrema pair within a given data block, retaining the extrema pair for the given data block and discarding the remaining data points in the given data block.
1. A method for lossily compressing time series data having a plurality of data points, comprising:
determining a resolution of an output device that may be used to output a representation of the time series data, determining a sampling block size that is in part based on the resolution of the output device; partitioning the plurality of data points into a plurality of data blocks in accordance with the sampling block size; and applying retention criteria to each of the plurality of data blocks, thereby compressing the time series data to be output by the output device.
2. The method of
3. The method of
4. The method of
determining the number of data points associated with the time series data, prior to the step of determining a sampling block size; and performing the determining a sampling block size, partitioning and applying retention criteria steps when the number of data points exceeds the resolution of output device.
5. The method of
6. The method of
determining an expected time interval for the plurality of data blocks; and applying the expected time interval to the data points in a given data block, where each of the plurality of data points is defined to include a measured component and a timestamp component indicative of the time at which the measured component was taken; and applying the retention criteria to the data points in the given data block that fall within the expected time interval.
7. The method of
8. The method of
9. The method of
11. The method of
12. The method of
13. The method of
determining the number of data points associated with the time series data, prior to the step of determining a sampling block size; and performing the, determining a sampling block size, partitioning and applying retention criteria steps when the number of data points exceeds the resolution of the output device.
14. The method of
15. The method of
determining an expected time interval for the plurality of data blocks; and applying the expected time interval to the data points in a given data block, where each of the plurality of data points is defined to include a measured component and a timestamp component indicative of the time at which the measured component was taken; and applying the retention criterion to the data points in the given data block that fall within the expected time interval.
16. The method of
|
The present invention relates generally to a software-implemented diagnostic tool for assessing performance of storage area networks and, more particularly, to a method for lossily compressing time series data collected from storage area networks.
Software-implemented diagnostic tools are currently being developed to assess the performance of storage area networks and other network topologies. Many factors can affect network performance. Some of these factors include hardware parameters, such the amount of cache and the number of control processors; connectivity issues; software issue; and a multitude of network configuration settings, including RAID levels, LUN layout, LUN size, and LUN contention. Since network performance may be affected by one or more of these factors at the same time, diagnostic tools must assess each of the factors that may possibly affect performance.
To perform such an assessment, diagnostic tools collect and analyze large amounts of performance measurement data indicative of these different factors. In many instances, the performance measurement data is in the form of time series data that needs to be compressed into more manageable amounts of data.
Although techniques for lossily compressing time series data are generally known, it is desirable to provide an improved method for lossily compressing time series data that is collected from a storage area network.
In accordance with the present invention, a method is provided for lossily compressing time series data that is collected from a storage area network. The method includes: determining the resolution of an output device that may be used to output a representation of the time series data; determining a sampling block size that is in part based on the resolution of the output device; partitioning the plurality of data points into a plurality of data blocks in accordance with the sampling block size; and applying retention criteria to each of the plurality of data blocks, thereby compressing the time series data to be output by the output device.
For a more complete understanding of the invention, its objects and advantages, reference may be had to the following specification and to the accompanying drawings.
A software-implemented diagnostic software tool 30 may be used to collect and assess performance data associated with the storage area network as shown in FIG. 2. In operation, the diagnostic tool 30 systematically isolates performance bottlenecks in the storage area network and, if applicable, implements corrective actions to relieve the isolated bottlenecks. In addition, the diagnostic tool 30 generates performance reports that are output by an output device 32 associated with the diagnostic tool. The output device 32 is preferably graphical plotting software that assists in visualizing quantitative data by converting data values into pixels for a visual display device or a software representation for a hardcopy device. However, it is envisioned that other types of output devices, such as the visual display device or the printer, are also within the scope of the present invention.
To perform such an assessment, the diagnostic tool 30 receives time series data 32 from numerous input data sources associated with the network. Time series data is understood to be any volatile performance parameter that is repeatedly monitored or collected over a period of time at a particular sampling frequency. The time series data preferably includes a measured component and a timestamp component indicative of the time at which the measured component was taken. The timestamp component may be expressed in different formats (e.g., U.S. date format, European date format, neutral format, etc.) as is well known in the art.
In the context of assessing network performance, exemplary time series data may include performance measurements from each of the RAID controllers, such as the percentage utilization of its processor. In another example, time series data may include performance measurements from each of the disk drive units, such as the total number of read and write I/O operations, total number of sequential I/O operations, total number of random I/O operations, the total number of random read operations, the total number of random write operations, the total number of sequential read operations, and/or the total number of sequential write operations each taken over some predetermined period of time. It is readily understood that other types and sources of time series data, including but not limited to cached I/O operations, cache hit percentages, cache to drive I/O operations, drive to cache I/O operations, bus adapter utilization, switch port utilization, and array controller utilization, are within the scope of the present invention.
In accordance with the present invention, a method is provided for lossily compressing time series data that may be received by the diagnostic tool 30. An exemplary embodiment of the methodology is depicted in FIG. 3. Although the methodology is preferably implemented as a compression module 32 within the diagnostic software tool, it is to be understood that only the relevant steps of the methodology are discussed in FIG. 3 and that other software-implemented instructions may be needed to control and manage the overall operation of the diagnostic tool. In addition, it is also to be understood that the compression methodology of the present invention may be employed independently or in conjunction with other known compression algorithms.
First, the diagnostic tool 30 determines if it is necessary to perform the compression algorithm for a given set of time series data. To do so, the total number of data points in the given set of time series data is determined at step 42. Since the amount of time series data typically exceeds the granularity of most conventional output devices, this determination is tied to the resolution of the output device 32 employed by the diagnostic software 30. Thus, the resolution of the output device that is to be used to represent the time series data is determined at step 44. Of particular interest, the resolution is determined along the axis of the output device used to represent time. One skilled in the art will readily recognize that various techniques may be employed to determine the resolution, including (but not limited to) retrieving a predetermined resolution value from a look-up table and programmatically inquiring the output device for a resolution value.
The total number of data points is then compared to the resolution of the output device at step 46. When the number of data points exceeds the resolution of the output device, it is necessary to lossily compress the time series data. On the other hand, when the number of data points does not exceed the resolution of the output device, the compression algorithm is not performed.
The compression algorithm of the present invention also accounts for the resolution of the output device. For instance, the sampling block size is in part based on the resolution of the output device as shown at step 48. The sampling block size is also based on the number of data points that are retained in accordance with the subsequently applied retention criteria. In a one exemplary embodiment, the sampling block size is calculated as follows: sampling block size=total number of data points/(resolution of the output device * number of data points retained in accordance with the retention criteria). This results in a direct correlation between the total number of retained data points and the resolution of the output device. In a non-divisible scenario, the sampling block is rounded up to the nearest integer. In the case of the last partition of time series data is too small to contain the retention criteria, then the sampling block is increased to the next largest integer value. At step 50, the sampling block size is further used to partition the time series data into a plurality of data blocks.
Retention criteria is then applied to each of the data blocks at step 52. The retention criteria identifies the data points that are to be retained from a given data block; whereas the remaining data points in the data block are discarded. Exemplary retention criteria may include identifying an extrema pair (i.e., maximum value and minima value) for a given data block, randomly selecting one or more representative data points from a given data block, determining an average value for a given data block, or determining a modal value for a given data block. It is envisioned that other retention criteria, such as a median value and an accumulation value, are also within the broader aspects of the present invention.
An overall retention criterion may be developed from one or more of the retention criteria described above. Since it is important to identify the performance aberrations within the network, retaining the extrema pair data is preferably one of the retention criteria used in the overall retention criterion that is applied to each of the data blocks. However, it is to be understood that the overall retention criterion may vary depending on the type of time series data being compressed. By applying the overall retention criterion to each of the data blocks, the time series data is lossily compressed.
To further illustrate the compression algorithm of the present invention, a specific example is set forth. In this example, the time series data includes 20,000 data points which are to be displayed on a display device having 400 pixels of resolution. The overall retention criterion includes identifying an extrema pair, determining an average value and determining a modal value. In other words, four (4) data points are to be retained from each sampled data block.
In accordance with the exemplary embodiment, the sampling block size is computed to be (20,000/400 * 4=) 200 data points. Accordingly, the time series data is partitioned into (20,000/200=) 100 data blocks. The overall retention criterion is applied to each of these data blocks. Since four data points are retained from each of the data blocks, the time series data is lossily compressed to 400 data points.
In another aspect of the present invention, the compression algorithm may optionally apply a filter to each of the data blocks before applying the retention criterion. The filter is intended to ensure that each of the data points in an given data block fall within an expected time interval. As will be further described below, data points falling outside of the expected time interval may be shifted to an adjacent data block.
Next, the time series data is partitioned into a plurality of data blocks at step 64. At step 66, an expected time interval, te, is determined for the plurality of data blocks. For a given data block, its time interval may be computed by subtracting the timestamp component for a first data point in the data block from the timestamp component for the last data point in the data block. An expected time interval for each of the data blocks is then derived from the actual time interval associated with one or more of the data blocks. For instance, the expected time interval may be set to the actual time interval from a randomly selected data block. Additional data blocks may be randomly selected to verify the accuracy of the expected time interval. Although this technique is presently preferred, other techniques for determining an expected time interval for the data blocks are also with the scope of the present invention.
The expected time interval is then used at step 68 as a filter for the data points in each of the data blocks. Preferably, the timestamp value for the first data point, t1, in the first data block is used as the starting point. Data points associated with the first data block should have a timestamp value falling within the time interval defined between t1 and t1+te. In step 70, one or more of the data points falling outside of this time interval are shifted to the next data block. Conversely, one or more data points associated with the second data block but having a timestamp value that falls within this time interval are shifted from the second data block to the first data block. Lastly, the retention criterion is applied as described above to the data points falling within this time interval as noted at step 72. It should be noted that in the case of the second data block, the applicable time interval is defined between t1+te and t1+2te. Similar time intervals are defined for each of the remaining data blocks, but otherwise the above-described process is repeated for each of the data blocks. In this way, the time series data undergoes an additional integrity check as part of the compression algorithm.
Although the above filtering process is presently preferred, it is envisioned that other filters may also be applied to the time series data. For instance, the time series data may be sorted by time of day, where the data portion of the timestamp component is ignored. The compression algorithm as set forth above can then be applied to the sorted time series data. In this way, the compressed time series data is.grouped around time-of-day for further assessment.
While the invention has been described in its presently preferred form, it will be understood that the invention is capable of modification without departing from the spirit of the invention as set forth in the appended claims.
Wong, Joseph D., Truesdale, Scott F.
Patent | Priority | Assignee | Title |
10489266, | Dec 20 2013 | MICRO FOCUS LLC | Generating a visualization of a metric at one or multiple levels of execution of a database workload |
10909117, | Dec 20 2013 | MICRO FOCUS LLC | Multiple measurements aggregated at multiple levels of execution of a workload |
7529790, | Jan 27 2005 | THE CONUNDRUM IP LLC | System and method of data analysis |
8886689, | Feb 17 2009 | TRANE U S INC | Efficient storage of data allowing for multiple level granularity retrieval |
9754012, | Feb 17 2009 | Trane U.S. Inc. | Efficient storage of data allowing for multiple level granularity retrieval |
9892486, | Oct 19 2015 | International Business Machines Corporation | Data processing |
Patent | Priority | Assignee | Title |
4926482, | Jun 26 1987 | Unisys Corp. | Apparatus and method for real time data compressor |
20030007695, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Mar 28 2002 | Hewlett-Packard Development Company, LP. | (assignment on the face of the patent) | / | |||
May 29 2002 | WONG, JOSEPH D | Hewlett-Packard Company | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 013129 | /0885 | |
Jun 10 2002 | TRUESDALE, SCOTT F | Hewlett-Packard Company | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 013129 | /0885 | |
Jan 31 2003 | Hewlett-Packard Company | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 013776 | /0928 | |
Jun 05 2003 | Hewlett-Packard Company | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 014142 | /0757 | |
Oct 27 2015 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Hewlett Packard Enterprise Development LP | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037079 | /0001 |
Date | Maintenance Fee Events |
Jan 16 2007 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Nov 30 2010 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Dec 24 2014 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Jul 15 2006 | 4 years fee payment window open |
Jan 15 2007 | 6 months grace period start (w surcharge) |
Jul 15 2007 | patent expiry (for year 4) |
Jul 15 2009 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jul 15 2010 | 8 years fee payment window open |
Jan 15 2011 | 6 months grace period start (w surcharge) |
Jul 15 2011 | patent expiry (for year 8) |
Jul 15 2013 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jul 15 2014 | 12 years fee payment window open |
Jan 15 2015 | 6 months grace period start (w surcharge) |
Jul 15 2015 | patent expiry (for year 12) |
Jul 15 2017 | 2 years to revive unintentionally abandoned end. (for year 12) |