Provided is an aggregation index structure and an aggregation index method for improving aggregation query efficiency. The aggregation index partitions streaming data through two dimensions of grouping and slicing, and then aggregates the partitioned data. The structure includes index metadata, a sliced data list and a detailed data store; the aggregation index method includes three parts: aggregation index definition, aggregation index creation and aggregation index query. The aggregation index structure and method provided by the present disclosure can greatly improve the efficiency of aggregation query, support the ad hoc aggregation query of PB-level data, complete the aggregation query of a large amount of data in seconds, support the insertion of new data at any time, and return the query results in minutes or even seconds latency in the event of changes of the query range conditions.

Patent
   11928113
Priority
May 21 2020
Filed
Jul 17 2022
Issued
Mar 12 2024
Expiry
Sep 22 2040
Assg.orig
Entity
Small
0
9
currently ok
1. A data index apparatus for improving aggregation query efficiency, the apparatus comprising a memory and at least one processor, data to be analyzed being arranged in a structure of an aggregation index, wherein the aggregation index partitions streaming data by two dimensions of grouping and slicing, and then aggregates partitioned data, of which the structure comprises index metadata, a sliced data list and a detailed data store, wherein data arrangement, partition and aggregation are performed by the at least one processor according to instructions stored in the memory;
the index metadata contain definition information of the aggregation index, comprising a grouping field groupby, a slice field sliceby, a slice starting point Start, a slice ending point Stop, a slice step length step, an aggregation field and an aggregation function aggregation;
the sliced data list consists of intermediate-state data of all slices belonging to a same group; the intermediate-state data of each slice contains a current slice range and an aggregation result; in addition, the intermediate-state data of each slice further contains a storage location of the detailed data corresponding to the slice, so as to implement more accurate query and operations of adding new data later;
the detailed data store stores the streaming detailed data in units of slices; the memory, a local file system or a distributed file system including Hadoop Distributed file system HDFS is selected as a storage medium of streaming detailed data according to different data volumes; the streaming detailed data store stores a value of the aggregation field or all fields of streaming details.
4. A method of an aggregation index for improving aggregation query efficiency, comprising the following steps:
(1) definition of the aggregation index; defining the aggregation index to declare establishment rules of the index, comprising a grouping field, a slice field and method, an aggregation field and an aggregation function;
(1.1) grouping streaming data by the grouping field, and further slicing and segmenting the streaming data with a same corresponding value in the grouping field and then aggregating;
(1.2) partitioning, by the slice field and method, the grouped streaming data with finer granularity, wherein the grouped streaming data will be further segmented according to a value of the slice field and the slice method defined by a user; the slice method has three parameters: a slice starting point, a slice ending point and a slice step length; by setting the above three parameters, streaming data of a same group can be partitioned into a limited number of segments; if a range of slice field is discrete distribution, there is no need to set the slice starting point, the slice ending point or the slice step length;
(1.3) performing aggregation calculation on streaming data belonging to a same slice through the aggregation field and the aggregation function; specifically, aggregating the data in the aggregation field according to the specified aggregation function, and recording an aggregation result in a form of intermediate-state data;
(2) creation of the aggregation index: after the definition of the aggregation index is completed, establishing corresponding index metadata according to the definition wherein the index metadata comprise a grouping field, a slice field, a slice starting point, a slice ending point, a slice step length, an aggregation field and an aggregation function; constructing the aggregation index by using original streaming data, wherein the original streaming data is added into the aggregation index in sequence to the following steps:
(2.1) determining the corresponding sliced data list according to a value of the grouping field of the original streaming data;
(2.2) determining slice intermediate-state data in the sliced data list according to a value of the slice field of the original streaming data;
(2.3) updating the slice intermediate-state data according to a value of the aggregation field of the streaming data;
(2.4) positioning a storage location of the corresponding detailed data according to the sliced data and storing the streaming data;
(3) aggregation index query: using the intermediate-state data aggregation of the slice in the aggregation index, and quickly returning an aggregation query result; wherein the specific steps are as follows:
(3.1) confirming that the current query conforms to the established index data, and the query fails if the following conditions exist:
a. the index data is not grouped according to the grouping field of a query statement;
b. the index data is not sliced according to the slice field of the query statement;
(3.2) determining whether the current query hits the index data; wherein if the aggregation field and the aggregation function of the current query are consistent with the aggregation field and the aggregation function in the index data, then step (3.3) is executed; if there is any inconsistency between the aggregation function and the aggregation method, it is necessary to find all the detailed data of the slice involved in the current query, and calculate the required aggregation result according to the aggregation field and aggregation function in the query;
(3.3) determining whether the slice range of the current query is consistent with the slice intermediate-state data in the index data; wherein if the slice range of the current query can be directly combined with the slice range of one or more slice intermediate-state data in the index data, the query result can be aggregated by the aggregation result of the slice intermediate-state data; if the slice range of the current query cannot be directly combined with the slice range of the slice intermediate-state data in the index data, it is necessary to traverse and aggregate the detailed data of adjacent slices to form a query result.
2. The data index apparatus for improving aggregation query efficiency according to claim 1, wherein in the partitioning of the streaming data into two dimensions of grouping and slicing, the streaming data is first partitioned into different groups according to the grouping field, then the streaming data of each group is partitioned into limited segments according to the slice field, and finally the value corresponding to the aggregation field in the streaming data is aggregated by the aggregation function.
3. The data index apparatus for improving aggregation query efficiency according to claim 2, wherein the aggregation function comprises summation, maximum, minimum, count and other functions.
5. The method of an aggregation index for improving aggregation query efficiency according to claim 4, wherein a process of inserting new data into the aggregation index is consistent with the process of adding streaming data in sequence in the process of creating the aggregation index.

The present application is a continuation of International Application No. PCT/CN2020/116654, filed on Sep. 22, 2020, which claims priority to Chinese Application No. 202010436039.3, filed on May 21, 2020, the contents of both of which are incorporated herein by reference in their entireties.

The present disclosure relates to the field of big data analysis, and improves the efficiency of exploratory ad hoc query of big data through a structure and a method of an aggregation for improving the efficiency of aggregation query.

Exploratory ad hoc query of big data is an important branch in the field of big data analysis. It helps users to mine data characteristics, understand business conditions and summarize business rules through interactive and flexible query of large amounts of data in seconds and minutes, and thus it has important application value in the fields of finance, e-commerce, logistics, telecommunications, transportation, public security, military industry and so on. Big data exploratory ad hoc query has the following characteristics:

Based on the above characteristics, the current technical scheme cannot fully meet the requirements of an exploratory ad hoc query of big data in terms of query data quantity, data update, query condition change, result return time and so on.

Traditional relational database RDBMS is mostly used in OLTP scenarios, which can improve its transaction performance under the constraint of ACID. The aggregation query ability of the traditional relational database is very poor. In the ad hoc query scenario of big data with over 100 million data and tens of thousands of return result sets, the result return is usually at the hour level, and the memory may even overflow and thus the results cannot be returned.

Search engines such as ElasticSearch have high performance of insertion and query, which can support original data at the PB level, inverted index and various filtered aggregation queries. However, like the traditional DBMS, since it is optimized to index and query of detailed data, it encounters serious performance bottlenecks when querying aggregation of big data, and cannot return the results quickly.

Big data processing systems, such as Spark and HBase, can store and query large-scale data through the MapReduce mechanism, can process PB-level data, and can perform aggregation queries almost without any limitation. However, due to the lack of effective indexing mechanism, the query efficiency is extremely low, which cannot meet the requirements of the ad hoc query for latency. In addition, big data processing systems such as Spark mainly involve offline batch processing, so it is inconvenient to import new data, and it is impossible to process the latest data quickly.

Kylin and Druid are OLAP tools with pre-defined logic and pre-calculated results, which support the insertion of new data and can quickly query according to the logic booked in advance, and can meet the requirements to a certain extent. However, they do not store details, so they cannot meet the change requirements of query conditions in exploratory scenarios.

The purpose of the present disclosure is to provide an aggregation index structure and aggregation index method for improving aggregation query efficiency, to solve the problem of exploratory ad hoc aggregation query of big data, which are suitable for the technical field of general database and big data analysis, and support OLTP and OLAP application scenarios including finance, e-commerce, logistics, telecommunications, transportation, public security, military industry, etc.

The present disclosure is realized by the following technical solution: structure of an aggregation index for improving aggregation query efficiency, wherein the aggregation index partitions streaming data by two dimensions of grouping and slicing, and then aggregates the partitioned data, and its structure includes index metadata, a sliced data list and a detailed data store.

The index metadata records definition information of the aggregation index, including a grouping field GroupBy, a slice field SliceBy, a slice starting point Start, a slice ending point Stop, a slice step length Step, an aggregation field and an aggregation function Aggregation.

The sliced data list consists of intermediate-state data of all slices belonging to a same group; the intermediate-state data of each slice contains a current slice range and an aggregation result; in addition, the intermediate-state data of each slice also contains the storage location of the detailed data corresponding to the slice, so as to implement more accurate query and addition of new data later.

The detailed data store stores the streaming detailed data in units of slices; a memory, a local file system or a distributed file system such as HDFS can be selected as a storage medium of streaming detailed data according to the different data volumes; the streaming detailed data store stores a value of the aggregation field or all fields of streaming details. The streaming detailed data store stores the value of the aggregation field or all fields of streaming details. The streaming detailed data store stores the value of the aggregation field, which can save space and improve efficiency, and supports other query operations of the aggregation field. The streaming detailed data store stores all fields of streaming details, and can query and analyze other fields except the aggregation fields, so as to exchange storage space for query flexibility.

Furthermore, in the partitioning of the streaming data into two dimensions of grouping and slicing, the streaming data is first partitioned into different groups according to the grouping field, then the streaming data of each group is partitioned into limited segments according to the slice field, and finally the value corresponding to the aggregation field in the streaming data is aggregated by the aggregation function.

Furthermore, the aggregation function includes summation, maximum, minimum, count and other functions.

Furthermore, if the range of slice field is discrete distribution, there is no need to set the slice starting point, the slice ending point or the slice step length.

Furthermore, the streaming detailed data store stores the value of the aggregation field or all fields of streaming details. The streaming detailed data store stores the value of the aggregation field, which can save space and improve efficiency, and supports other query operations of the aggregation field. The streaming detailed data store stores all fields of streaming details, and can query and analyze other fields except the aggregation fields, so as to exchange storage space for query flexibility.

A method of an aggregation index for improving aggregation query efficiency includes the following steps:

Furthermore, a process of inserting new data into the aggregation index is consistent with the process of adding the streaming data in sequence in the process of creating the aggregation index.

The aggregation index structure and method provided by the present disclosure can greatly improve the efficiency of aggregation query, support the ad hoc aggregation query of PB-level data, complete the aggregation query of a large number of data in seconds, support the insertion of new data at any time, return query results in minutes or even seconds latency when the query range conditions change, and support the functional and performance requirements of all aspects of the exploratory ad hoc query of big data.

FIG. 1 is a diagram of an aggregation index structure.

FIG. 2 is an exemplary diagram of an aggregation index.

FIG. 3 is a flow chart of aggregation index retrieval.

FIG. 4 is a schematic diagram of an example sentence of an aggregation index query.

Hereinafter, the specific embodiments of the present disclosure will be described in further detail with reference to the accompanying drawings.

FIG. 1 show the aggregation index structure and aggregation index method for improving aggregation query efficiency provided by the present disclosure. In the aggregation index method of the present disclosure, massive streaming data is processed, corresponding aggregation index data is established, and the aggregation query speed of streaming data is accelerated. The aggregation index deals with structured streaming data, that is, each piece of streaming data includes multiple fields, and each field has a corresponding name and field value. The aggregation index structure for improving aggregation query efficiency incudes index metadata, a sliced data list and a detailed data store.

The aggregation index partitions the streaming data by grouping and slicing. The streaming data is first partitioned into different groups according to a grouping field, and then the streaming data of each group is partitioned into limited segments according to slice field. Finally, the value corresponding to the aggregation field in the streaming data is aggregated by the aggregation function. Its structure includes index metadata, a sliced data list and a detailed data store.

The index metadata records definition information of the aggregation index, including a grouping field GroupBy, a slice field SliceBy, a slice starting point Start, a slice ending point Stop, a slice step length Step, an aggregation field and an aggregation function. The value range of the grouping field is a finite set of discrete distribution. The aggregation field represents the field of an actual data store.

The sliced data list consists of intermediate-state data of all slices belonging to the same group. The intermediate-state data of each slice contains the range of a current slice and an aggregation result. In addition, the intermediate-state data of each slice also saves the storage location of the corresponding fine data contained in the slice, so as to implement more accurate query and addition of new data later. Taking the aggregation index in FIG. 1 as an example, slice [20, 30) contains two items of data, i.e., 21 and 22. Therefore, in the intermediate-state data, the slice range is [20, 30), and the value of the aggregation result is 21+22=43. At the same time, the slice also saves the storage location Slice_Detail_04 of its corresponding detailed data, which is convenient for adding new data and traversing data.

The detailed data store stores the streaming detailed data in units of slices; the memory, local file system or distributed file system such as HDFS can be selected as the storage medium of the streaming detailed data according to different data volumes. The streaming detailed data store stores the value of the aggregation field or all fields of streaming details. The streaming detailed data store stores the value of the aggregation field, which can save space and improve efficiency, and supports other query operations of the aggregation field. The streaming detailed data store stores all fields of streaming details, and can query and analyze other fields except the aggregation fields, so as to exchange storage space for query flexibility. In actual operation, different storage schemes can be selected according to business requirements. For example, the aggregation index shown in FIG. 2 requires data analysis of the total order amount of users from different sources and different levels, with the source Origin as the grouping field and the UserLevel as the slice field. If only the order amount AmtDue is stored in the streaming details, only query related to the order amount can be queried later. If other fields, such as order time, are stored in the details, you can make more flexible queries. For example, the total amount of orders placed in the morning by L4 users, whose Origin equals to “Online”.

The present disclosure provides an aggregation index method for improving aggregation query efficiency based on an aggregation index structure, which comprises the following steps:

The process of inserting new data to update the aggregated index is consistent with the process of adding streaming data in sequence in the index creation process. After searching the index, searching the slice, updating the intermediate-state data and storing the details, the aggregated index data can be updated;

Definition of the query conditions includes the following parts:

The specific steps of the aggregation index query are as follows:

It should be noted that when the data compression apparatus provided in the foregoing embodiment performs data compression, division into the foregoing functional modules is used only as an example for description. In an actual application, the foregoing functions can be allocated to and implemented by different functional modules based on a requirement, that is, an inner structure of the apparatus is divided into different functional modules, to implement all or some of the functions described above. For details about a specific implementation process, refer to the method embodiment. Details are not described herein again.

All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When the software is used for implementation, all or some of the embodiments may be implemented in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a server or a terminal, all or some of the procedures or functions according to the embodiments of this application are generated. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a web site, computer, server, or data center to another web site, computer, server, or data center in a wired (for example, a coaxial optical cable, an optical fiber, or a digital subscriber line) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a server or a terminal, or a data storage device, such as a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disk (DVD)), or a semiconductor medium (for example, a solid-state drive).

The above embodiments are used to explain the present disclosure, but not intended to limit the present disclosure. Any modifications and changes made to the present disclosure within the spirit of the present disclosure and the protection scope of the claims shall fall into the protection scope of the present disclosure.

Gao, Yang, Chen, Wei, Wang, Xinyu, Huang, Tao, Lu, Ping, Jin, Lu, Wang, Xingen

Patent Priority Assignee Title
Patent Priority Assignee Title
5852821, Apr 16 1993 SYBASE, INC High-speed data base query method and apparatus
20040002954,
20110093486,
20230128085,
CN104376119,
CN105205062,
CN106021458,
CN106570113,
CN109299102,
/////////
Executed onAssignorAssigneeConveyanceFrameReelDoc
Jun 24 2022WANG, XINGENZHEJIANG BANGSUN TECHNOLOGY CO , LTD ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0605860783 pdf
Jun 24 2022WANG, XINYUZHEJIANG BANGSUN TECHNOLOGY CO , LTD ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0605860783 pdf
Jun 24 2022CHEN, CHUNZHEJIANG BANGSUN TECHNOLOGY CO , LTD ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0605860783 pdf
Jun 24 2022JIN, LUZHEJIANG BANGSUN TECHNOLOGY CO , LTD ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0605860783 pdf
Jun 24 2022CHEN, WEIZHEJIANG BANGSUN TECHNOLOGY CO , LTD ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0605860783 pdf
Jun 24 2022GAO, YANGZHEJIANG BANGSUN TECHNOLOGY CO , LTD ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0605860783 pdf
Jun 24 2022LU, PINGZHEJIANG BANGSUN TECHNOLOGY CO , LTD ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0605860783 pdf
Jun 24 2022HUANG, TAOZHEJIANG BANGSUN TECHNOLOGY CO , LTD ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0605860783 pdf
Jul 17 2022ZHEJIANG BANGSUN TECHNOLOGY CO., LTD.(assignment on the face of the patent)
Date Maintenance Fee Events
Jul 17 2022BIG: Entity status set to Undiscounted (note the period is included in the code).
Jul 27 2022SMAL: Entity status set to Small.


Date Maintenance Schedule
Mar 12 20274 years fee payment window open
Sep 12 20276 months grace period start (w surcharge)
Mar 12 2028patent expiry (for year 4)
Mar 12 20302 years to revive unintentionally abandoned end. (for year 4)
Mar 12 20318 years fee payment window open
Sep 12 20316 months grace period start (w surcharge)
Mar 12 2032patent expiry (for year 8)
Mar 12 20342 years to revive unintentionally abandoned end. (for year 8)
Mar 12 203512 years fee payment window open
Sep 12 20356 months grace period start (w surcharge)
Mar 12 2036patent expiry (for year 12)
Mar 12 20382 years to revive unintentionally abandoned end. (for year 12)