A computer implemented method, computer program product and system for data quality monitoring includes measuring a data quality of loaded data relative to a predefined data quality metric. The measuring the data quality includes identifying delta changes in at least one of the loaded data and the data quality rules relative to a previous measurement of the data quality of the loaded data. Logical calculus defined in the data quality rules is applied to the identified delta changes.
|
1. A computer-implemented method for data quality monitoring, the method comprising:
measuring, by a computer, a data quality of loaded data relative to a predefined data quality metric, the measuring comprising:
identifying delta changes in at least one of the loaded data and data quality rules relative to a previous measurement of the data quality of the loaded data, wherein a delta change of the data quality rules comprises at least one change in at least one term of at least one data quality rule, said at least one data quality rule containing a concatenation of terms; and
applying a logical calculus defined in the data quality rules to the identified delta changes;
applying the changed term to one of the loaded data and the delta change of the loaded data;
creating a plurality of columnar data objects, with each of the loaded data; and
caching intermediary results of application of each of the terms of the data quality rules to individual data columns, along a dependency path gained by a decomposition of the data quality rules into their individual terms and embodied as a tree structure;
wherein the application of the data quality rules is performed in based on one of an initial data load and a delta data load.
15. A system for data quality monitoring, the system comprising:
a computer processor, the system configured to perform a method comprising:
measuring a data quality of loaded data relative to a predefined data quality metric, the measuring comprising:
identifying delta changes in at least one of the loaded data and data quality rules relative to a previous measurement of the data quality of the loaded data, wherein a delta change of the data quality rules comprises at least one change in at least one term of at least one data quality rule, said at least one data quality rule containing a concatenation of terms; and
applying a logical calculus defined in the data quality rules to the identified delta changes;
applying the changed term to one of the loaded data and the delta change of the loaded data;
creating a plurality of columnar data objects, with each of the loaded data and
caching intermediary results of application of each of the terms of the data quality rules to individual data columns, along a dependency path gained by a decomposition of the data quality rules into their individual terms and embodied as a tree structure;
wherein the application of the data quality rules is performed in based on one of an initial data load and a delta data load.
11. A computer program product for data quality monitoring, the computer program product comprising:
a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to perform:
measuring a data quality of loaded data relative to a predefined data quality metric, the measuring comprising:
identifying delta changes in at least one of the loaded data and data quality rules relative to a previous measurement of the data quality of the loaded data, wherein a delta change of the data quality rules comprises at least one change in at least one term of at least one data quality rule, said at least one data quality rule containing a concatenation of terms; and
applying a logical calculus defined in the data quality rules to the identified delta changes;
applying the changed term to one of the loaded data and the delta change of the loaded data;
creating a plurality of columnar data objects, with each of the loaded data; and
caching intermediary results of application of each of the terms of the data quality rules to individual data columns, along a dependency path gained by a decomposition of the data quality rules into their individual terms and embodied as a tree structure;
wherein the application of the data quality rules is performed in based on one of an initial data load and a delta data load.
2. The computer-implemented method of
extracting the data set;
loading the data set; and
transforming the data set into a columnar format with an individual data column per element of the data set.
3. The computer-implemented method of
depicting one column cache of each data column as a leaf node;
depicting one column cache of each term of the data quality rules as a branch node;
depicting one column cache of each subset of individual terms of the data quality rules as a branch node; and
depicting one column cache of each representation of an overall result of the applied data one or more quality rules as a root node.
4. The computer-implemented method of
5. The computer-implemented method of
applying an update strategy based on a type of operator in the data quality rules based on the dependency; and
minimizing a number of updates to the intermediary column caches based on the update strategy.
6. The computer-implemented method according to
scalar operations;
multi column scalar operations;
aggregation operations; and
multi-column, multi-value operations.
7. The computer-implemented method according to
direct update of individual intermediary column caches having a dependency on the updated individual data columns;
marking affected rows of individual intermediary column caches and re-computing the affected rows later; and
invalidating individual intermediary column caches.
8. The computer-implemented method of
9. The computer-implemented method of
decomposing a selected quality rule into single terms and projecting them onto a tree structure with each node of the tree structure corresponding to one individual term of the selected quality rule;
applying each individual data column to the tree structure resulting in an intermediary column cache for each node containing an intermediate result of a selected operation;
updating, based on the delta change of the loaded data, the intermediary column cache of a node based on determining that the delta change of the loaded data affects content of the intermediary column cache of said node; and
updating, based on the delta change of the data quality rule, the tree structure with the updated data quality rule and updating the intermediary column cache based on determining that the delta change of the data quality rule affects the content of the intermediary column cache of said node.
10. The computer-implemented method of
12. The computer program product of
extracting the data set;
loading the data set; and
transforming the data set into a columnar format with an individual data column per element of the data set.
13. The computer program product of
propagating the update of a data column to intermediary column caches that have a dependency on an individual related data column based on the tree structure.
14. The computer program product of
identifying, using a change data capture mechanism, the delta changes of the loaded data.
16. The system of
extracting the data set;
loading the data set; and
transforming the data set into a columnar format with an individual data column per element of the data set.
17. The system of
|
The present application claims priority to European Patent Application No. 11172421.7, filed on Jul. 1, 2011, and all the benefits accruing therefrom under 35 U.S.C. §119, the contents of which in its entirety are herein incorporated by reference.
The present invention relates generally to data profiling during extract-transfer-load (ETL) processes, and more particularly, to data quality monitoring by running data quality rules and comparing them against previous data quality results to determine whether or not data quality has changed.
Data quality issues for data integration projects for master data management (MDM) systems, data warehousing (DW) systems, business application consolidation systems etc., are identified using data profiling techniques and/or data cleansing approaches in ETL processes. These issues are identified so that only high-quality data is loaded during the initial load phase of these systems. However, when data quality degrades in business applications, data quality issues become a cost factor for enterprises and may even break the business processes entirely. Periodically measuring the data quality using data quality rules is one way to detect the speed of degradation and/or the change of data quality over time. Periodic measuring may also be used as a prompt for an action if certain minimal required data quality key performance indicators (KPIs) are no longer being met. For MDM systems, a data governance (DG) program is established alongside the deployment of the MDM system to control the creation, maintenance and use of master data and master data quality throughout its life cycle. Data stewards working in the data governance organization apply data profiling measurements periodically in order to control compliance with data quality KPIs for the master data. A measurement is often done using semantic rules, which is one of many data profiling techniques. Data quality monitoring includes defining data quality KPIs, creating semantic rules, creating a first baseline measurement during initial load, periodically executing the semantic rules and comparing the results against a baseline result.
Data profiling and data quality monitoring are input/output (I/O) intensive and time consuming operations. Therefore, for data quality profiling and data quality monitoring, data is typically extracted into a staging area in order to avoid performance degradation of an application due to the additional I/O requirements of data extraction. In some instances applications do not allow direct access to the underlying database without using an application specific mechanism. Another reason that the data is extracted into a staging area is to avoid functional issues for the application due to structured query language (SQL) statement concurrency issues caused by conflicts between SQL created by the application and SQL created by the data profiling tool operating in the application database at the same time.
The initial full data extraction required for systems such as DW or business application consolidation often requires a full weekend, which may cause performance degradation of the application for an extended period of time due to the increased I/O requirements. For some systems it may be possible to periodically perform the extract over a weekend. For applications such as e-commerce systems which operate permanently, or for other critical systems, finding a good time to perform the data extract may be difficult. If data quality monitoring is not done, degradation of data quality will remain undetected until the business processes break down or other business issues arise.
Currently known data quality monitoring techniques process all the data that is in scope for the baseline measurement while it is performing additional measurements. The volume of data is also constantly growing and the time window between two measurements is shrinking. As a result, one measurement might not be complete by the time the next one is scheduled to begin which makes data quality monitoring difficult to perform.
Embodiments include a system, computer implemented method and computer program product for data quality monitoring. The data quality monitoring includes measuring a data quality of loaded data relative to a predefined data quality metric. The measuring includes identifying delta changes in at least one of the loaded data and the data quality rules relative to a previous measurement of the data quality of the loaded data. Logical calculus defined in the data quality rules is applied to the identified delta changes.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several embodiments and together with the description, serve to explain the principles of the present invention. Embodiments of the invention are explained in detail referencing to the drawings.
Embodiments described herein monitor the quality of data over time by comparing a current data quality analysis result against a base line data quality result. In embodiments, in order to optimize the operational technique, the data profiling analysis is performed only on the delta changes of data, on the delta changes of quality rules and/or on a combination thereof. The data quality monitoring method is based on a change data capture (CDC) mechanism in embodiments, such as a transactional log based mechanism or trigger based replication, by exploiting cross-reference tables during delta loads. Columnar store techniques are used to minimize the execution time of data quality rules on data by a decomposition of the data quality rules, as well as minimizing additional memory by high compression and a propagation strategy based on the differentiation of operations of the quality rules.
A change data capture (CDC) mechanism, CDC process 104, operates between the data source 100 and the staging area 106 to identify the delta changes that will be loaded using a delta load process 112. During the delta load process 112 a new set of operations is applied to the data in order to minimize the impact to columnar structures 122. The new process includes using data columns and column caches in the main memory 120. In an embodiment, the results of the delta quality monitoring performed by the DDQM 118 is visualized by using a data quality monitoring application deployed on application server 116 and invoked by the application client 115. The quality rules, which define a logical calculus to be applied to the data, are also stored and versioned in the metadata repository 114 which the application server 116 accesses to run on the main memory 120. The metadata repository 114 includes the columnar structures 122, including the data columns and the column caches. After the execution of the quality rules completes, the results of quality measurement are stored, and may be used at a later time for data trend analysis investigation in the profiling mart 124.
Referring now to
The following drawings illustrate embodiments for evaluating quality rules on loaded data that is transformed into the columnar format.
Once an operation has been computed (e.g., a ‘ucase(Col 1)’ 506 in
After all quality rule operations have been evaluated, the main memory 120 contains column structures 122 that include data columns and column caches and dependencies between the column caches as shown in
The DDQM 118 stores the dependent relations between the data columns and the column caches containing the operations of the quality rules, thus storing the information about which column caches have to be updated or invalidated as a response to the modification of the loaded data. For scalar functions such as ‘ucase(Col 1)’ 506 and ‘ucase(Col2)’ 508 only the values of the column caches are updated. Therefore, only the new distinct values have to be applied to the single terms of the quality rules which shortens the calculation time. A modification of the column cache for ‘ucase(Col 1)’ 506a and ‘ucase(Col 2)’ 508a propagates a modification or invalidation to the column caches that depend on them. For a multi-column scalar operation such as ‘ucase(Col 1)+ucase(Col 2)’ 510a, the computation of the new value for row 2 may be more complex than for a simple scalar function because the same row IDs must be found in ‘ucase(Col 1)’ 506a and ‘ucase(Col2)’ 508a before the new concatenated value is computed. The main memory 122 of the DDQM 118 may simply decide to re-assign the row ID 2 to the special value ‘unknown’ indicating that the next time an execution of a quality rule requests this column cache, all row IDs assigned to the ‘unknown’ value having to be recomputed.
The update and/or invalidation of column caches containing operations of the quality rules propagate other updates and/or invalidations to other column caches that contain operations of the quality rules. In an embodiment, this update and/or invalidation propagation follows the dependency path as given by the decomposition tree of the data rule. If the immediate update of the dependent columnar cache is not possible, only the modified data values are marked as being ‘unknown’ and not the whole columnar cache. The column caches may remain in this partially unknown status until the next execution of a selected quality rule requires one of these column caches. When this happens, only the row IDs marked as ‘unknown’ have to be recomputed, therefore speeding up the evaluation of the values for all entries in the source data set.
In an embodiment when operations are aggregated, a change in the value of one row may result in re-computing the whole column cache that contains the operation with its dependent subordinate columnar caches having to be completely recomputed the next time the execution of a quality rule requires them. Performance is still improved however, because another rule change in the column cache will not be needed because the operations were aggregated.
Column caches containing operations of terms of the data quality rules which are not dependent on the modified data columns remain unchanged.
Minimal Effort Strategy for Updating Data Columns and Column Caches
In an embodiment, a minimal update strategy is used for updating data columns and column caches. As described above, when a cache becomes invalid due to the underlying data being changed, it will trigger all dependent caches to also become invalid. However, following the dependency graph and doing a re-computation of the cache and all its subsequent caches is cost and time intensive. Furthermore a re-computation of all the caches isn't always necessary and it may be possible to update only the single cache. In an embodiment, in order to determine whether an update on a cache is possible or a full re-computation is needed, it's necessary to differentiate operation categories. For the following examples, it is assumed that all expressions are decomposed so that only atomic expressions will remain.
Example: Col 1+Col2+Col3=→(Col 1+Col2)+(Col3)
Overview of operation categories and their scope in accordance with an embodiment:
Operator
category
scope
description
Examples
Scalar
Individual
Operations that work
ucase(Col 1)
operations
values of a
on individual scalar
sqrt(Col 2)
data set
values of exactly one
abs(Col 1 *
data set. Constant
19,5583)
values may be used
Col 1 + 10
Col 1/2.0
Multi-
Individual
Operations that work
Col 1 + Col 2
column
values
on individual scalar
trim(Col 1 +
scalar
of multiple data
values of multiple data
Col 2)
operations
sets.
sets. Constant values
Col 1/Col 2
may be used.
Col 1 + Col 2 *
1,95583
Aggregation
multiple/all
In contrast to scalar
min(Col 1)
operations
values of a
operations, aggregation
max(Col 2)
data sets
operations don't work
avg(Col 1)
on individual values,
median(Col 2)
instead they use the
count(Col 3)
complete data set to
calculate their results
Multi
multiple/all
one value of a data set
Inner join
column,
values of
is depending on/related
left outer join
multi value
multiple data
to one or many values
operations
sets
of a different data set.
(set
Relationships between
operations)
data tables and their
check/lookup tables,
Cartesian products.
In an embodiment, it is possible to determine whether it is possible to update the operation's cache or mark the complete cache as invalid and do a re-computation of all results based on the scope of data an operator category works on. The following table gives an overview of the update strategy taken on a cache in case the underlying data set has changed.
Cache update strategy for operator category
Multi column,
Operation on
Multi-column
Aggregation
multi value
data set
Scalar operations
scalar operations
operations
operations
Example
UCASE(col1)
Col 1 + Col 2
avg(Col 3)
Inner join
addNewValue
addNewValue(f(v))
markInvalid(RID)
markAllInvalid( )
markAllInvalid( )
or
addRidToValue(RID,
f(v))
addRowIdToValue
addRidToValue(RID,
markInvalid(RID)
markAllInvalid( )
markAllInvalid( )
f(v))
removeRowId
remove(RID, f(v))
remove(RID, f(v))
markAllInvalid( )
markAllInvalid( )
Adding a New Value and Associated Row IDs to a Column
In an embodiment, for scalar operations the cache may be updated. If the new calculated value already exists in the cache the row ID for the new value needs to be added to the result value of the scalar operation on the new value. If the result value does not yet exist in the cache, it is added and linked to the new row ID.
In an embodiment, for multi-column scalar operations, the cache that contains the operation's result is not as easy to update as for scalar operations. The update is performed by flagging the row IDs being affected as invalid and recalculating the corresponding values the next time the cache is accessed.
In an embodiment, for aggregation and multi-column multi-value operations such as joins, for example, there is no straight forward way to update the cache. Therefore the complete cache is marked as invalid and is either recomputed immediately or the next time it is accessed.
Adding a New Row ID to an Existing Value in the Column
For scalar operations the cache may be updated by linking the new row ID to the result value of the scalar operation.
For multi-column operations, no easy cache update is possible so the corresponding row ID should be flagged as “invalid” and updated the next time the cache is accessed.
For aggregation and multi-column multi-value operations, the behaviour is the same as adding a new value. There is no straight forward way to update the cache so the complete cache is marked as invalid and it is either recomputed immediately or the next time it is accessed.
Remove a Row ID for an Existing Value in the Column
For scalar operations, the row ID is removed for the result of the scalar operation on the original value. If the result value has no associated row IDs the value can be removed from the column cache.
For multi-column scalar operations, the behavior is similar to scalar operations and the column store cache is corrected.
For aggregation and multi column multi value operations, the complete cache is marked as invalid and is either recomputed immediately or the next time it is accessed.
In an embodiment, the DDQM 118 may update column caches in the background if they do not belong to data quality rules that are being executing. This is determined by anticipating their needs according to a pre-set execution schedule.
In an embodiment, measuring the data quality KPIs and the comparison of them between two runs provides the ability to determine whether or not data quality increased or decreased over time. In an embodiment, the impact on application system performance is minimized and the time for performing data quality monitoring is shortened. An embodiment may overcome the problems in data profiling/quality monitoring currently existing in this field by performing the data quality analysis only on delta changes of data, only on delta changes of quality rules and/or on a combination thereof.
The embodiments provided describe only certain combinations of the various features discussed herein for purposes of clarity. It will be appreciated by those of ordinary skill in the art that various other combinations are evident to a skilled person studying this description.
While the foregoing has been with reference to particular embodiments, it will be appreciated by those skilled in the art that changes in these embodiments may be made to the embodiments.
Seifert, Jens, Saillet, Yannick, Oberhofer, Martin, Nelke, Sebastian
Patent | Priority | Assignee | Title |
10147040, | Jan 20 2017 | ROCKWELL AUTOMATION CANADA LTD | Device data quality evaluator |
10331740, | Feb 10 2014 | Apple Inc | Systems and methods for operating a server-side data abstraction layer |
10565173, | Feb 10 2017 | WIPRO LIMITED | Method and system for assessing quality of incremental heterogeneous data |
10572456, | Sep 24 2013 | HERE GLOBAL B V | Method, apparatus, and computer program product for data quality analysis |
10928970, | Jul 18 2014 | Apple Inc | User-interface for developing applications that apply machine learning |
11157470, | Jun 03 2019 | International Business Machines Corporation | Method and system for data quality delta analysis on a dataset |
11182394, | Oct 30 2017 | Bank of America Corporation | Performing database file management using statistics maintenance and column similarity |
11416456, | Sep 24 2013 | HERE Global B.V. | Method, apparatus, and computer program product for data quality analysis |
11580444, | Apr 16 2019 | Apple Inc. | Data visualization machine learning model performance |
Patent | Priority | Assignee | Title |
7505868, | Jan 31 2005 | VALTRUS INNOVATIONS LIMITED | Performing quality determination of data |
8392359, | Aug 11 2009 | Business Objects Software Ltd. | Trend analysis based upon derived state tables |
8762333, | Jul 08 2009 | PIVOTAL SOFTWARE, INC | Apparatus and method for read optimized bulk data storage |
20040138936, | |||
20060238919, | |||
20070179939, | |||
20070239636, | |||
20080027958, | |||
20080195430, | |||
20090125553, | |||
20100010979, | |||
20100035598, | |||
20110238616, | |||
20120053954, | |||
20120150820, | |||
CN1991652, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 26 2012 | OBERHOFER, MARTIN | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028556 | /0838 | |
Jun 26 2012 | SAILLET, YANNICK | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028556 | /0838 | |
Jun 26 2012 | SEIFERT, JENS | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028556 | /0838 | |
Jun 29 2012 | International Business Machines Corporation | (assignment on the face of the patent) | / | |||
Jul 11 2012 | NELKE, SEBASTIAN | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028556 | /0838 |
Date | Maintenance Fee Events |
Jan 15 2019 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jan 18 2023 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Jul 28 2018 | 4 years fee payment window open |
Jan 28 2019 | 6 months grace period start (w surcharge) |
Jul 28 2019 | patent expiry (for year 4) |
Jul 28 2021 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jul 28 2022 | 8 years fee payment window open |
Jan 28 2023 | 6 months grace period start (w surcharge) |
Jul 28 2023 | patent expiry (for year 8) |
Jul 28 2025 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jul 28 2026 | 12 years fee payment window open |
Jan 28 2027 | 6 months grace period start (w surcharge) |
Jul 28 2027 | patent expiry (for year 12) |
Jul 28 2029 | 2 years to revive unintentionally abandoned end. (for year 12) |