The present invention relates to collecting statistics automatically for data in a database. There is provided a method for automated statistics collection comprising determining a likelihood that statistics for data have changed; and collecting statistics for data in response to the likelihood. Indicators of the likelihood that statistics have changed may be useful to trigger automated statistics collection. tables having statistics that change significantly may be collected more often than statistics of tables that are stable. A preferred model is provided to facilitate the collection of statistics that are more relevant: a table is scheduled for collection in accordance with observed patterns of table activity; a table is considered for collection if it meets a threshold level of activity; and a table is sampled to predict whether the statistics to be collected have changed. When collecting statistics, throttling and lock contention can minimize impact on a database user's response experience.
|
1. A data processing system implemented method of collecting statistics associated with data stored in a database, the database operatively coupled to a data processing system, the data processing system implemented method comprising:
assembling a list of tables of said data, said tables being scheduled for periodic automatic statistics collection;
for each of the tables, determining a likelihood that currently computed statistics associated with the data in each of the tables have changed;
removing from the list tables for which said determined likelihood is low;
collecting updated statistics for the data in tables which remain in said list after said removing; and
updating the scheduled periodic automatic statistics collection for each of the tables, said periodic automatic statistics collection comprising periodically performing subsequent collections of statistics associated with the data in each of the tables, wherein said updating comprises scheduling, for each of the tables, said subsequent collections of statistics more often or less often based on another likelihood that the updated statistics have changed for each of the tables,
wherein said another likelihood that the updated statistics have changed is determined by comparing said updated statistics with said computed statistics.
13. A data processing system for collecting statistics associated with data stored in a database, the database operatively coupled to the data processing system operating on a computer, the data processing system comprising:
a processor;
an assembling module, executed on the processor, for assembling a list of tables of said data, said tables being scheduled for periodic automatic statistics collection;
a determining module for determining, for each of the tables, a likelihood that currently computed statistics associated with the data have changed;
a removing module for removing from the list tables for which said determined likelihood is low;
a collecting module for collecting updated statistics for the data in tables which remain in said list after said removing module removes tables for which said determined likelihood is low; and
an updating module for updating the scheduled periodic automatic statistics collection for each of the tables, said periodic automatic statistics collection comprising periodically performing subsequent collections of statistics associated with the data in each of the tables, wherein said updating comprises scheduling, for each of the tables, said subsequent collections of statistics more often or less often based on another likelihood that the updated statistics have changed for each of the tables,
wherein said another likelihood that the updated statistics have changed is determined by comparing said updated statistics with said computed statistics.
18. An article of manufacture for directing a data processing system to collect statistics associated with data stored in a database, the database operatively coupled to the data processing system, the article of manufacture comprising:
a program usable storage medium embodying one or more instructions executable by a processor of the data processing system, the one or more instructions comprising:
data processing system executable instructions for assembling a list of tables of said data, said tables being scheduled for periodic automatic statistics collection;
data processing system executable instructions for determining, for each of the tables, a likelihood that currently computed statistics associated with the data in each of the tables have changed;
data processing system executable instructions for removing from the list tables for which said determined likelihood is low;
data processing system executable instructions for collecting updated statistics for the data in tables which remain in said list after said removing; and
data processing system executable instructions for updating the scheduled periodic automatic statistics collection for each of the tables, said periodic automatic statistics collection comprising periodically performing subsequent collections of statistics associated with the data in each of the tables, wherein said updating comprises scheduling, for each of the tables, said subsequent collections of statistics more often or less often based on another likelihood that the updated statistics have changed for each of the tables,
wherein said another likelihood that the updated statistics have changed is determined by comparing said updated statistics with said computed statistics.
2. The data processing system implemented method of
sampling a portion of the data to generate a sample result; and
comparing the sample result to statistics previously collected for the data whereby said likelihood is responsive to said comparing.
3. The data processing system implemented method of
4. The data processing system implemented method of
5. The data processing system implemented method of
6. The data processing system implemented method of
7. The data processing system implemented method of
8. The data processing system implemented method of
9. The method of
10. The data processing system implemented method of
11. The data processing system implemented method of
12. The data processing system implemented method of
14. The data processing system of
15. The data processing system of
16. The data processing system of
17. The data processing system of
19. The article of
data processing system executable instructions for sampling a portion of the data to generate a sample result; and
data processing system executable instructions for comparing the sample result to statistics previously collected for the data whereby said likelihood is responsive to said comparing.
20. The article of
21. The article of
22. The article of
23. The article of
24. The article of
25. The article of
26. The article of
27. The article of
|
The invention relates to database management systems, particularly to a data processing system implemented method, a data processing system and an article of manufacture for collecting statistics associated with data stored in a database.
Many database management systems (DBMS) are configured to collect various statistics for data tables. Such statistics may be later used by a query optimizer for selecting a plan of execution for operation on the database from a plurality of plans for optimizing query access of data or for other handling of the data tables. Database administrators (DBA) often configure their databases to collect statistics using hard coded scripts run during system maintenance hours to avoid disrupting operation of the database for users. In many database management systems not having a DBA, statistics are often not collected.
In addition to DBS-controlled statistics collection, automated statistics collection is available in some DBMS. However, currently known automated operations may result in severe impact to the performance of the database. For example, automated statistics collection operations may be initiated during heavy production periods impacting user response.
Therefore a statistics collection solution that addresses at least one of these needs is desired.
The present invention relates to collecting statistics automatically for data in a database. In accordance with an aspect of the invention, there is provided a method for automated statistics collection for data in a database. The method comprises determining a likelihood that statistics for the data have changed; and collecting statistics for the data in response to the likelihood. The step of determining may comprise sampling a portion of the data to generate a sample result; and comparing the sample result to statistics previously collected for the data. The likelihood is thus responsive to the comparing. The step of determining may comprise examining metadata maintained for the data whereby the likelihood is responsive to the examining. Metadata may comprise at least one of: an indicator of a change in a size of the data; an indicator of a change in a content of the data; and an indicator of a recent use of the data.
Preferably, in accordance with a feature of the method, the data is scheduled for periodic automated statistics collection and the step of determining is responsive to the scheduling. Scheduling may be modified in response to the likelihood that statistics for the data have changed or in response to a change to the statistics collected for the data.
When collecting statistics, the collecting may be throttled to reduce a consumption of resources in response to a contending need for the resources. Further, collecting statistics may comprise rejecting a resource lock for the collecting in response to a contending need for the resource lock. When the data is stored in tables formed from columns and rows, the steps of determining and collecting may be performed for individual tables.
Other aspects of the invention such as apparatus, computer system, database management system and computer readable media aspects, among others, will be apparent to those of ordinary skill in the art.
Advantageously, indicators of the likelihood that statistics have changed may be useful to trigger automated statistics collection. Tables having statistics that change significantly over time may be collected more often than statistics of tables that are stable. A preferred model is provided to facilitate the collection of statistics that are more relevant: a table is scheduled for statistics collection in accordance with observed patterns of table activity; a table is considered for collection if it meets a threshold level of activity; and a table is sampled to predict whether the statistics to be collected have changed. When all three criteria are met, automated statistics collection is indicated.
If any of the evaluated criteria (e.g. a low level of table activity or stable sample statistics) indicate a low likelihood that statistics have changed, statistics are not collected for the table and the results of the evaluations may be reflected in the observed patterns of table activity for purposes of future scheduling.
The above model seeks to minimize the number of statistic maintenance tasks by ensuring that collected statistics are very likely to have changed. An approach which only considers a level of table activity may unnecessarily collect statistics that have not changed. Indeed, if the statistics have changed, a high level of table activity must be satisfied. However, a high level of activity alone itself does not ensure that statistics have changed.
When collecting statistics, two additional methods of operation may be useful to reduce system overhead. Utility throttling may be employed to adaptively reduce the aggressiveness of the statistics collection operations and minimize impact on a database user's response experience by reducing resource consumption (e.g. processor (CPU time) and input/output resources). Lock contention avoidance, whereby all resource locks obtained by the automated statistics collection operations have a lower priority than such locks obtained for other workload operations (e.g. user initiated activity), may also be performed to lessen impact on the user experience. If lock wait or deadlock situations occur involving locks held by automated statistics collection operations, such locks are rejected and statistics collection rolled back in preference to contending needs for the resource locks.
In an aspect of the invention, there is provided a data processing system implemented method of collecting statistics associated with data stored in a database, the database operatively coupled to a data processing system, the data processing system implemented method including determining a likelihood that currently computed statistics associated with the data have changed, and collecting updated statistics for the data in response to the determined likelihood.
In another aspect of the invention, there is provided a data processing system for collecting statistics associated with data stored in a database, the database operatively coupled to the data processing system, the data processing system including a determining module for determining a likelihood that currently computed statistics associated with the data have changed, and a collecting module for collecting updated statistics for the data in response to the determined likelihood.
In another aspect of the invention, there is provided an article of manufacture for directing a data processing system to collect statistics associated with data stored in a database, the database operatively coupled to the data processing system, the article of manufacture including a program usable medium embodying one or more instructions executable by the data processing system, the one or more instructions including data processing system executable instructions for determining a likelihood that currently computed statistics associated with the data have changed, and data processing system executable instructions for collecting updated statistics for the data in response to the determined likelihood.
The present invention will now be described in greater in reference to the figures referred to below wherein:
In the following description of the embodiments, reference to the accompanying drawings are by way of illustration of an example by which the invention may be practiced. It will be understood that other embodiments may be made without departing from the scope of the invention disclosed.
The present invention will now be described with reference to
In accordance with features of the present embodiment, a 3-tier model to detect tables of data (typically stored in rows and columns) in a database that require statistics collection is shown. The 3-tier model may be supplemented with scheduling and prioritizing operations for periodically considering a particular table for statistics collection and ordering the collection of statistics for those scheduled tables that indicate a likelihood that their respective statistics have change and thus need collecting.
The first tier of the model examines change activity on tables since the last time statistics the tables' respective statistics were collected. Table metadata maintained for the table such as a count of row changes (updates, deletions insertions), current and prior table size, etc. may be examined. The second tier examines table distribution changes (e.g. column histograms), for example, through sampling table data. A third tier collects statistics in an asynchronous low overhead manner, exploiting utility throttling and lock contention avoidance to minimize an impact on user experience for contending resource needs.
With reference to
If the table has not been accessed recently, via No branch to step 206, operations continue to prune the table from the list, removing the table from further current collection of statistics. If the table has been accessed, via Yes branch to step 208, operations examine one or more measures of table activity (e.g. metadata for the table) with a view to determining the extent of recent activity in the form of significant changes to the table. Metadata that may be conveniently maintained for the table by a database management system or other means for monitoring table access may include a count of changes to rows of the table. Updates, insertions and deletions may be counted in response to user queries for example. The count may be normalized relative to table size. A threshold for changes may be defined (e.g. 10% of the number of table rows) to determine significant changes.
Similarly, a record of current and prior table size may be maintained to provide an indication of major table changes. If significant table changes have not occurred the table may be pruned at step 206 via No branch from step 208. If significant changes have occurred, via Yes branch to step 210, the table may be sampled to generate sampling results, for example, sample statistics similar to those defined for collection over the entire table. At step 212 the results are examined for significant changes relative to previously collected statistics for the table.
The sample statistics are compared to the table statistics, and if the two sets are found to be similar (the difference may be measured by combining several tests), then statistics collection is postponed for the table. Only when the sample and previous statistics sets are different are table statistics collected and updated for the table. Histograms collected for each column (collecting histograms is de-facto standard for all database servers) may be used for comparison purposes. However, the present invention does not put any restrictions on the methods used to decide whether the sampled and previous statistics are different or not. For example, table cardinality, the set of most frequent values could be potentially used to determine the differences.
If the sample statistics do not indicate a likelihood that the statistics have changed, via No branch to step 206, the table is pruned. Otherwise, via Yes branch to step 216, operations end and the table remains in the list for current statistics collection.
From step 206, scheduling information for the table may be updated at step 214 in response to the likelihood that statistics have not changed. This scheduling information may be used to schedule the table less frequently in accordance with operations for scheduling tables.
A workload management system typically allows the user to partition tasks on their system so that quality of service objectives may be met. Typically this is done by controlling or limiting how the system resources are allocated amongst tasks. On a DBMS with integrated workload management capabilities with a high degree of control for specific tasks—this may involve an automated task manager always invoking the task with the lowest workload priority class. On a DBMS without workload management, this may involve using an operating system capability to lower the priority of a process/thread performing the work for the task. Yet another mechanism controls the utilities resource consumption by forcing the utility to sleep (or yield) for a portion of its time. Either mechanism can be used to limit the impact of the collection utility. The mechanism selected to limit the impact of the collection utility should not impede its progress when it is not impacting the production workload. Only restricting a utility when necessary to preserve the impact policy is often referred to as “throttling” the utility.
The configuration options will vary depending on the mechanism for limiting the impact of the collection utility. This is often referred to as a workload “policy” definition. One preferred embodiment for regulating the impact of non-production work uses a single parameter to limit how the production workload may be impacted. For example, setting an impact parameter to 10% means that the cumulative impact of background utilities on the production workload should not exceed 10%. In other words, the impact parameter defines the maximum slowdown in the workload performance a user is willing to tolerate. The advantage of this approach is simplicity—the DBA can confidently categorize tasks as either production or background with guarantees on the impact of background tasks on the production workload.
However, there may be drawbacks to a single parameter strategy since automated or autonomic utilities (such as statistics collection) or tasks may be initiated at any time. The policy for limiting the impact of background (i.e. automated or autonomic) tasks on production work may be too lenient for autonomically invoked jobs. Existing policies may have been defined by the DBA with the conscious knowledge of when the background utilities would be executed. For example, the DBA has set the policy too leniently because they know that they only explicitly execute maintenance tasks (i.e. backups, reorganizations) at night during off-peak hours. The problem is complicated because existing policies may have been defined in circumstances when automatic tasks may not have been a consideration. (i.e. An RDBMS upgrade introduces new autonomic tasks which did not exist in prior versions.) However, with autonomic utilities, a task may execute at any time—even during peak periods making the existing policy inadequate for limiting the tasks impact on the production workload. In accordance with one embodiment of the present approach, a second class of workload policies for automated jobs is used in addition to an explicitly defined user policy. If there are only automated tasks running on the system then the effective policy is the automated policy. Alternatively, the effective policy may be selected as the minimum of the explicitly defined policy and the autonomic policy. If there are non-automated tasks then the effective policy is the user-defined policy. Some DBMS implementations may elect to expose a parameter which allows the user to explicitly set how aggressively an autonomic task executes while others may choose to hide this configuration detail from the user.
At step 310, lock contention mechanisms are preferably employed to resolve contending needs for locked database objects and other resources. This mechanism may be operated to free locks held by the collection utility to respect a priority in favour of the lock needs of user-initiated operations. Preferably, incomplete collection utility operations resulting from the lock release(s) are rolled back and may be re-initiated.
At step 312, once the statistics are collected for a table, a determination may be made if a further table is in the list. If so, via Yes branch to step 306, statistics are collected for the next table. If not, via No branch to step 314, statistics collection operations 108 end.
At the end of the table processing, operations 112 to modify scheduling for the next check may be performed.
Persons of ordinary skill in the art will appreciate that the embodiment described is an example and may be modified without departing from the scope of the invention. For example, though a scheduling mechanism is described to periodically schedule tables for collection in an adaptively variable manner, tables may be selected for collection in other ways. For example, the scheduling need not be adaptive. As well tables may be pre-grouped for different periodic scheduling based on anticipated collection frequencies or limited observed behaviour. As well a step of ranking tables to order the collection of tables indicated for selection is optional. Though preferred to enhance collection for prioritized tables, situations may indicate ranking provides few or limited advantages.
When pruning a table, a step of examining table artefacts indicating recent use may be omitted. Though helpful to indicate a likelihood of high table activity, such a step may be difficult in certain implementations of a DBMS. Persons of ordinary skill in the art will appreciate that various metadata maintained for a table to indicate the extent of table activity may be used. Preferably such metadata is maintained in a low overhead and easily accessible manner that will not impact database performance. Metadata may be stored in a memory closely coupled to the database management system, for example.
The model described herein advantageously reduces table operations to automatically collect statistics. The model provides an approach which will “learn” that although a table may experience significant table activity, its statistics may not be changing. Despite the table activity, the model will minimize the maintenance time to collect the tables statistics. Adaptively scheduling collection may further reduce maintenance time. Moreover, collection operations may be throttled and operated to prioritize non-maintenance (i.e. production) activities to enhance user experience.
The above-described embodiments of the invention are intended to be examples of the present invention, and alterations and modifications may be affected thereto, by a person of skill in the art, without departing from the scope of the invention which is defined by the claims appended hereto.
Rose, Kevin R., Lightstone, Sam Sampson, Popivanov, Ivan
Patent | Priority | Assignee | Title |
10372578, | Mar 30 2015 | International Business Machines Corporation | Dynamically adjusting statistics collection time in a database management system |
10387283, | Mar 30 2015 | International Business Machines Corporation | Dynamically adjusting statistics collection time in a database management system |
10430411, | Apr 30 2013 | MICRO FOCUS LLC | Incrementally updating statistics |
11099960, | Mar 30 2015 | International Business Machines Corporation | Dynamically adjusting statistics collection time in a database management system |
11704216, | Mar 30 2015 | International Business Machines Corporation | Dynamically adjusting statistics collection time in a database management system |
8078652, | Dec 06 2007 | Oracle International Corporation | Virtual columns |
Patent | Priority | Assignee | Title |
5542089, | Jul 26 1994 | CISCO TECHNOLOGY, INC , A CORPORATION OF CALIFORNIA | Method and apparatus for estimating the number of occurrences of frequent values in a data set |
5781735, | Nov 16 1994 | HALL, RODERICK C M | Method for monitoring and managing operational characteristics of workstations on a network without user network impact |
5813002, | Jul 31 1996 | International Business Machines Corporation | Method and system for linearly detecting data deviations in a large database |
5870752, | Aug 21 1997 | NCR Voyix Corporation | Incremental maintenance of an approximate histogram in a database system |
5950185, | May 20 1996 | THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT | Apparatus and method for approximating frequency moments |
6029163, | Feb 10 1997 | Oracle International Corporation | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
6434570, | Aug 26 1998 | Lucent Technologies Inc | Method and apparatus for estimating a percentile for a value |
6477523, | Dec 03 1999 | TERADATA US, INC | Selectivity prediction with compressed histograms in a parallel processing database system |
6529901, | Jun 29 1999 | Microsoft Technology Licensing, LLC | Automating statistics management for query optimizers |
6535870, | Feb 09 2000 | International Business Machines Corporation | Method of estimating an amount of changed data over plurality of intervals of time measurements |
7302422, | Apr 14 2004 | GOOGLE LLC | Query workload statistics collection in a database management system |
20020087518, | |||
20020124001, | |||
20040034643, | |||
WO104798, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 07 2004 | POPIVANOV, IVAN | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 015750 | /0596 | |
Sep 07 2004 | ROSE, KEVIN R | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 015750 | /0596 | |
Sep 08 2004 | LIGHTSTONE, SAM SAMPSON | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 015750 | /0596 | |
Nov 22 2004 | International Business Machines Corporation | (assignment on the face of the patent) | / | |||
Dec 30 2013 | International Business Machines Corporation | TWITTER, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 032075 | /0404 | |
Oct 27 2022 | TWITTER, INC | MORGAN STANLEY SENIOR FUNDING, INC | SECURITY INTEREST SEE DOCUMENT FOR DETAILS | 061804 | /0001 |
Date | Maintenance Fee Events |
May 28 2010 | ASPN: Payor Number Assigned. |
Jan 24 2014 | REM: Maintenance Fee Reminder Mailed. |
Feb 27 2014 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Feb 27 2014 | M1554: Surcharge for Late Payment, Large Entity. |
Jan 29 2018 | REM: Maintenance Fee Reminder Mailed. |
Mar 29 2018 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Mar 29 2018 | M1555: 7.5 yr surcharge - late pmt w/in 6 mo, Large Entity. |
Dec 15 2021 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Jun 15 2013 | 4 years fee payment window open |
Dec 15 2013 | 6 months grace period start (w surcharge) |
Jun 15 2014 | patent expiry (for year 4) |
Jun 15 2016 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jun 15 2017 | 8 years fee payment window open |
Dec 15 2017 | 6 months grace period start (w surcharge) |
Jun 15 2018 | patent expiry (for year 8) |
Jun 15 2020 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jun 15 2021 | 12 years fee payment window open |
Dec 15 2021 | 6 months grace period start (w surcharge) |
Jun 15 2022 | patent expiry (for year 12) |
Jun 15 2024 | 2 years to revive unintentionally abandoned end. (for year 12) |