A computer readable storage medium includes executable instructions to analyze a categorical dimension of multi-dimensional data as a function of entropy to form entropy results. The entropy results are plotted as a stacked bar chart. A user graphically navigates through the stacked bar chart.
|
1. A non-transitory computer readable media, comprising executable instructions for one or more data processors to:
analyze a categorical dimension of multi-dimensional data as a function of entropy to form entropy results;
plot the entropy results as a stacked bar chart; and
allow a user to graphically navigate through the stacked bar chart to display different views of the stacked bar chart, each view of the different views displaying data structured according to respective recalculated entropy results associated with each view;
wherein entropy H(x) of a categorical dimension is determined using:
where pi is the probability of each category, from i-1 to n, occurring the categorical dimension and is calculated as the frequency or distinct count of the value for each category divided by the sum of frequencies for all values in the category.
15. A computer-implemented method for implementation by one or more data processors comprising:
analyzing, by at least one data processor, a categorical dimension of multi-dimensional data as a function of entropy to form entropy results;
plotting, by at least one data processor, the entropy results as a stacked bar chart; and
allowing, by at least one data processor, a user to graphically navigate through the stacked bar chart to display different views of the stacked bar chart, each view of the different views displaying data structured according to respective recalculated entropy results associated with each view;
wherein entropy H(x) of a categorical dimension is determined using:
where pi is the probability of each category, from i-1 to n, occurring the categorical dimension and is calculated as the frequency or distinct count of the value for each category divided by the sum of frequencies for all values in the category.
16. A computer system comprising:
a central processing unit;
memory coupled to the central processing unit storing executable programs to cause the central processing unit to perform operations comprising:
analyzing a categorical dimension of multi-dimensional data as a function of entropy to form entropy results;
plotting the entropy results as a stacked bar chart; and
allowing a user to graphically navigate through the stacked bar chart to display different views of the stacked bar chart, each view of the different views displaying data structured according to respective recalculated entropy results associated with each view,
wherein entropy H(x) of a categorical dimension is determined using:
where pi is the probability of each category, from i-1 to n, occurring the categorical dimension and is calculated as the frequency or distinct count of the value for each category divided by the sum of frequencies for all values in the category.
2. The non-transitory computer readable media of
3. The non-transitory computer readable media of
4. The non-transitory computer readable media of
5. The non-transitory computer readable media of
6. The non-transitory computer readable media of
7. The non-transitory computer readable media of
8. The non-transitory computer readable media of
9. The non-transitory computer readable media of
10. The non-transitory computer readable media of
11. The non-transitory computer readable media of
12. The non-transitory computer readable media of
13. The non-transitory computer readable media of
14. The non-transitory computer readable media of
17. The non-transitory computer readable media of
18. The non-transitory computer readable media of
19. The method of
20. The system of
|
This invention relates generally to multidimensional databases. More particularly, this invention relates to techniques for fast and informative navigation through the data of a multidimensional database.
Business Intelligence (BI) generally refers to software tools used to improve business enterprise decision-making. These tools are commonly applied to financial, human resource, marketing, sales, customer and supplier analyses. More specifically, these tools can include: reporting and analysis tools to present information, content delivery infrastructure systems for delivery and management of reports and analytics, data warehousing systems for cleansing and consolidating information from disparate sources, and data management systems, such as relational databases or On Line Analytic Processing (OLAP) systems used to collect, store, and manage raw data.
OLAP tools are a subset of business intelligence tools. There are a number of commercially available OLAP tools including Business Objects Voyager™ which is available from Business Objects Americas of San Jose, Calif. An OLAP tool is a report generation tool that is configured for ad hoc analyses. OLAP generally refers to a technique of providing fast analysis of shared information stored in a multidimensional database. OLAP systems provide a multidimensional conceptual view of data, including full support for hierarchies and multiple hierarchies. This framework is used because it is a logical way to analyze businesses and organizations. In some OLAP tools the data is arranged in a schema which simulates a multidimensional schema. The multidimensional schema means redundant information is stored, but it allows for users to initiate queries without the need to know how the data is organized.
There are other report generation tools, including tools that couple to a metadata layer that overlies a data source. The metadata layer can be a semantic metadata layer, or semantic layer, which includes metadata about the type of data within the data source. Some metadata layers map the data source fields into familiar terms, such as, product, customer, or revenue. The metadata layer can provide a multidimensional view of information in a data source. There are a number of commercially available report generation tools that are characterized by a semantic layer, including Business Objects Web Intelligence™, which is available from Business Objects Americas of San Jose, Calif.
There are known techniques for graphically portraying quantitative information. The techniques are used in the fields of statistical graphics, data visualization, and the like. Charts, tables, and maps are visualizations of quantitative information. Visualizations are produced from data in a data source (e.g., an OLAP cube, relational database). A visualization is a graphic display of quantitative information. Types of visualizations include charts, tables, and maps. Visualizations can reveal insights into the relationships between data. The data within an OLAP cube may be comprised of categorical dimensions, numerical measure dimensions, and time dimensions. A categorical dimension is a data element that categorizes each item in a data set into non-overlapping regions. A numerical measure dimension comprises data defined by a computation, such as a sum or average. For example, an OLAP cube of Beverages might have categorical dimensions such as Product, Country, Color, Volume, Alcohol Level, and Sweetness and numerical measures such as Revenue and Profit margin. The time dimension comprises data grouped in accordance with a time metric. For example, time dimensions may include Quarter 1, Quarter 2, Quarter 3, and Quarter 4. Multidimensional databases undertake to provide fast navigation and informative presentation of data inside an OLAP cube.
However, existing multidimensional databases have limitations with regards to their ability to deliver these results. Existing multidimensional databases are user driven, giving little direction into effective navigation of the data therein. The problem has been further augmented as the data volumes within OLAP cubes have increased and forced data navigation to become even more complex.
In view of the foregoing, it would be highly desirable to provide an improved technique for guided navigation through the data within an OLAP cube. In particular, it would be highly desirable to provide a method for guided graphical navigation through the categorical, numerical measures, and time dimensions of an OLAP cube.
The invention includes a computer readable storage medium with executable instructions to analyze a categorical dimension of multi-dimensional data as a function of entropy to form entropy results. The entropy results are plotted as a stacked bar chart. A user graphically navigates through the stacked bar chart.
The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
The CPU 108 is also connected to a memory 112 via the bus 110. The memory 112 stores a set of executable programs. One executable program is the categorical dimension module 116. The categorical dimension module 116 includes executable instructions to access a data source to construct a chart characterizing the categorical dimensions in an OLAP cube. By way of example, the data source may be database 114 resident in memory 112. The data source may be located anywhere in the network 126. The categorical dimension module 116 also includes executable instructions to allow the user to graphically navigate through the chart.
As shown in
While the various components of memory 112 are shown residing in the single computer 102, it should be recognized that such a configuration is not required in all applications. For instance, the categorical dimension module 116 may reside in a separate computer (not shown in
Entropy is a concept from information theory that may be used as a measure of the uncertainty associated with a specific categorical dimension, and thus the value of the information in that categorical dimension. Entropy may be considered a measure of the amount of information that is missing. Claude Shannon devised an entropy measure to characterize the amount of information transmitted in a message.
In one embodiment of the invention, the formula for entropy, H(x), of a categorical dimension is
where pi is the probability of each category, from i=1 to n, occurring in the categorical dimension, and is calculated as the frequency or distinct count of the value of each category divided by the sum of frequencies for all values in the category. The term log2 (1/pi) is commonly referred to as the surprisal (i.e., the degree to which you are surprised to see the result).
Maximum entropy occurs when all outcomes in a categorical dimension are equally likely so that:
Relying upon the example of
The value of the entropy calculation is a function of the probability distribution of outcomes, pi, and the number of outcomes, N. Therefore, one criterion for ordering the categorical dimensions in the chart constructed by the categorical dimension module 116 is to calculate the entropy values as a percent of the maximum entropy value. Consider an OLAP cube with the following dimensions:
A and members A1, A2
B and members B1, B2, B3
C and members C1, C2
Measures M1, M2
Time H1, H2, Total
This cube of information may be flattened into a two dimensional File, where each dimension and measure is represented in a column and each row represents each unique combination of the categorical dimensions in the OLAP cube. This processing results in the flattened file of
Now, consider an OLAP cube with categorical dimensions: product, country, color, volume, alcohol level and sweetness. The cube has measures of revenue and profit margin. Time is grouped in accordance with quarter 1, quarter 2, quarter 3, quarter 4 and year. This cube may be flattened into a two dimensional file in which each dimension and measure is represented in a column and each row represents each unique combination of the categorical dimensions in the OLAP cube. Entropy calculations may then be applied to the flattened file. To illustrate,
Once the probability for each category is found, the categorical dimension module 116 may then calculate the entropy associated with the categorical dimension. In this example, the total entropy 508 is 2.3649, the maximum entropy 510 of the categorical dimension is 4.3923, and the entropy percent 512 (i.e. total entropy 508/maximum entropy 510) is 53.84.
Categorical dimensions may be charted in the order of increasing entropy percent.
The user may now navigate through the categorical dimension of the chart. In one embodiment of the invention, this is accomplished with executable instructions of the categorical dimension module 116. The categorical dimension module 116 allows the user to select a specific category within a categorical dimension to give a new graphical visualization for all of the categorical dimensions in the OLAP cube. For example, as shown in
Accordingly, the table 900 presents the recalculated entropy values for the Country 902 dimension when the Product 802 dimension is limited to the Beer 804 category. Similarly, the same analysis is applied to other categorical dimensions within the OLAP cube.
The user may then continue to navigate through the data by selecting another categorical dimension or by choosing to move back to any previous visualization. For instance, the user may choose to select the category United Kingdom 1112, having 48 rows, in the dimension Country 1104 to navigate.
In the same way, the user may continue to navigate through the categorical dimensions by continuing to select specific categories within a dimension or choosing to return to a previous visualization. Each time the user navigates to an alternate visualization, new entropy values are calculated based on the user selection to determine the ordering of the next chart. Ultimately, the user may reach a point where an additional selection cannot be made. To illustrate,
Returning to
As discussed above, the user may graphically navigate through the categorical dimensions in an OLAP cube. Accordingly, as the user navigates through the categorical dimensions, the contents of the vectors for the numerical measures will change. Similar to the categorical dimensions, the user is able to navigate through the continuous numeric data within the numerical measures dimension with the aid of a suggested navigation path. Selections from the numerical measures dimension would conversely change the contents of the categorical dimensions.
One embodiment for the graphical representation of the numerical measures dimension is the box plot (i.e., whisker plot). The box plot of the numerical measures dimension is displayed in such a way so as to identify a suggested path for navigation. In order to create a box plot for a member of the numerical measures dimension, the following criteria should be determined from the members vector of continuous numeric values: the median, the upper quartile (“UQ”) (i.e. the 75th percentile), the lower quartile (“LQ”) (i.e., the 25th percentile), the inter quartile range (“IQR”) (i.e., the UQ−the LQ), the upper inner fence (i.e., the UQ+1.5*IQR), the lower inner fence (i.e., the LQ−1.5*IQR), the upper outer fence (i.e., UQ+3.0*IQR), the lower outer fence (i.e., the LQ−3.0*IQR), the first value above the lower inner fence, and the first value below the upper inner fence. Values outside of the outer fences are referred to herein as probable or extreme outliers, Values between the inner and outer fences are referred to herein as suspect or possible outliers.
By way of example, assume that the vector of numeric values for the measure Revenue is 10, 11, 10, 9, 10, 24, 11, 12, 10, 6, 1, 11, 16, 13, and 12.
Expanding on the foregoing example, consider the following profit margin values: 18, 14, 16, 18, 15, 18, 19, 10, 8, 6, 31, 12, 16, 8, and 10. These values result in the calculations shown in
TABLE 1
Suspect outliers below the median
Period
11
Value
1
Distance from Median
−9
Distance from Median %
−75.00
Distance from Median Absolute
9
Distance from Median Absolute %
75.00
Extreme outliers above the median
Period
6
Value
25
Distance from Median
15
Distance from Median %
150.00
Distance from Median Absolute
15
Distance from Median Absolute %
150.00
Criteria
Total number of outliers
2
Total number of extreme outliers
1
Total number of outliers % total number of values
13.33
Total number of extreme outliers % total number of values
6.67
Highest absolute distance from median %
150.00
Table 2 shows a summary of various calculations associated with the outliers identified in the numerical measure Profit Margin.
TABLE 2
Suspect outliers above the median
Period
11
Value
31
Distance from Median
16
Distance from Median %
106.67
Distance from Median Absolute
16
Distance from Median Absolute %
106.67
Criteria
Total number of outliers
1
Total number of extreme outliers
0
Total number of outliers % total number of values
6.67
Total number of extreme outliers % total number of values
0.00
Highest absolute distance from median %
106.67
Accordingly, as a higher number of total outliers was identified in the member Revenue than the member Profit Margin the Revenue box plot 1902 is ordered first along the x-axis in
Various criteria may be used to determine the value of information associated with numerical measures. For example, alternative criteria may be the spread of values in a measure characterized by the skewness and kurtorsis of the set of values in a numerical measure. Skewness is a measure of the asymmetry of the values in a distribution and could therefore be used to analyze a numerical measure. A positive skew shows that the majority of the distribution is concentrated to the left of the mode. A negative skew shows that the majority of the distribution is concentrated to the right of the mode. Kurtosis is a measure of the peakedness of a distribution. A distribution with zero kurtosis is called mesokurtic. The most prominent example of a mesokurtic distribution is the normal distribution. A distribution with positive kurtosis is called leptokurtic. A leptokurtic distribution has a more acute peak around the mean than the normal distribution. A distribution with negative kurtosis is called platykurtic. A platykurtic distribution has a smaller peak around the mean. The criteria for ordering measures along the x-axis could therefore be the degree of peakedness or conversely the degree of flatness.
The user may now navigate through the box plots by selecting: a specific outlier, the specific set of values in the box (i.e., between the LQ and UQ) of the plot, or the specific values between the upper and lower fences of the plot. Additionally, if more percentiles were plotted in the box plot, the user may select a specific percentile range to navigate into. For example,
Returning to
As shown in
An embodiment of the present invention relates to a computer storage product with a computer-readable medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using Java, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention as defined by the appended claims. In addition, many modifications may be made to adapt to a particular situation, material, composition of matter, method, process step or steps, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto. In particular, while the methods disclosed herein have been described with reference to particular steps performed in a particular order, it will be understood that these steps may be combined, sub-divided, or re-ordered to form an equivalent method without departing from the teachings of the present invention. Accordingly, unless specifically indicated herein, the order and grouping of the steps is not a limitation of the present invention.
Patent | Priority | Assignee | Title |
10754516, | Jun 05 2018 | GE Inspection Technologies, LP | User interface |
11144184, | Jan 23 2014 | ESI US R&D, INC | Selection thresholds in a visualization interface |
Patent | Priority | Assignee | Title |
5581677, | Apr 22 1994 | Carnegie Mellon University | Creating charts and visualizations by demonstration |
6330283, | Dec 30 1999 | IA GLOBAL ACQUISITION CO | Method and apparatus for video compression using multi-state dynamical predictive systems |
6704016, | May 08 2000 | VALTRUS INNOVATIONS LIMITED | Method and apparatus for the graphical presentation of selected data |
6750864, | Nov 15 1999 | POLYVISTA, INC | Programs and methods for the display, analysis and manipulation of multi-dimensional data implemented on a computer |
7071940, | Oct 30 2002 | IVIZ, INC | Interactive data visualization and charting framework with self-detection of data commonality |
7082568, | Jun 20 1997 | Fujitsu Limited | Interactive data analysis support apparatus and media on which is recorded an interactive data analysis support program |
7239316, | Nov 13 2000 | AVAYA Inc | Method and apparatus for graphically manipulating data tables |
7530012, | May 22 2003 | International Business Machines Corporation | Incorporation of spreadsheet formulas of multi-dimensional cube data into a multi-dimensional cube |
7643029, | Feb 06 2004 | MICRO FOCUS LLC | Method and system for automated visual comparison based on user drilldown sequences |
7693822, | Jul 27 2006 | International Business Machines Corporation | Apparatus of generating browsing paths for data and method for browsing data |
7779344, | Oct 31 2006 | ENT SERVICES DEVELOPMENT CORPORATION LP | System and method for creating a value-based stacked bar chart |
8244689, | Feb 17 2006 | GOOGLE LLC | Attribute entropy as a signal in object normalization |
20040237029, | |||
20060031187, | |||
20080071580, | |||
20080148168, | |||
20090105984, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 29 2007 | BUSINESS OBJECTS SOFTWARE LIMITED | (assignment on the face of the patent) | / | |||
Oct 12 2007 | MACGREGOR, JOHN MALCOLM | BUSINESS OBJECTS, S A | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 019996 | /0597 | |
Oct 31 2007 | BUSINESS OBJECTS, S A | Business Objects Software Ltd | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 020156 | /0411 |
Date | Maintenance Fee Events |
Apr 03 2018 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Apr 06 2022 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Oct 14 2017 | 4 years fee payment window open |
Apr 14 2018 | 6 months grace period start (w surcharge) |
Oct 14 2018 | patent expiry (for year 4) |
Oct 14 2020 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 14 2021 | 8 years fee payment window open |
Apr 14 2022 | 6 months grace period start (w surcharge) |
Oct 14 2022 | patent expiry (for year 8) |
Oct 14 2024 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 14 2025 | 12 years fee payment window open |
Apr 14 2026 | 6 months grace period start (w surcharge) |
Oct 14 2026 | patent expiry (for year 12) |
Oct 14 2028 | 2 years to revive unintentionally abandoned end. (for year 12) |