A data integration system is disclosed that incorporates one or more techniques for eases the design and maintenance of a mapping. As components are added to an existing design, the data integration system removes the need to specify all input and output attributes. In one aspect, components types are implement that allow entity relationships to be added and/or edited in a flow view of a logical design. Therefore, attributes of components representing datasets can be added and propagated to downstream components with minimal effort on the part of a map designer.
|
19. A system for generating a data integration scenario between a data source and a data target using a combination of a data flow extract-transform-load (ETL) and an entity relationship extract-transform-load (ETL), the system comprising:
a processor; and
a memory storing instructions which when executed by the processor configure the processor to:
receive a set of extract-transform-load (ETL) entity relationships defined via a component of a logical design of the data integration scenario, the set of ETL entity relationships having a relationship defined between one or more attributes of a first entity in a dataset and one or more attributes of a second entity in the dataset;
automatically determine an extract-transform-load (ETL) data flow model using the set of ETL entity relationships, wherein the ETL data flow model is configured to provide semantic definitions for the logical design, wherein the semantic definitions define entity relationships in the dataset that identify how columns of the data target are populated from attributes of the data source and from one of the one or more attributes of the first entity in the dataset and the one or more attributes of the second entity in dataset; and
generate information indicative of the ETL data flow model in a view of the logical design, wherein a set of attributes are exposed to downstream components of the logical design that includes the one or more attributes of the first entity and the one or more attributes of the second entity.
12. A non-transitory computer-readable medium storing computer-executable code for generating a data integration scenario between a data source and a data target using a combination of a data flow extract-transform-load (ETL) and an entity relationship extract-transform-load (ETL), the non-transitory computer-readable medium comprising:
code for receiving a set of extract-transform-load (ETL) entity relationships defined via a component of a logical design of the data integration scenario, the set of ETL entity relationships having a relationship defined between one or more attributes of a first entity in a dataset and one or more attributes of a second entity in the dataset;
code for automatically determining an extract-transform-load (ETL) data flow model using the set of ETL entity relationships, wherein the ETL data flow model is configured to provide semantic definitions for the logical design, wherein the semantic definitions define entity relationships in the dataset that identify how columns of the data target are populated from attributes of the data source and from one of the one or more attributes of the first entity in the dataset and the one or more attributes of the second entity in the dataset; and
code for generating information indicative of the ETL data flow model in a view of the logical design, wherein a set of attributes are exposed to downstream components of the logical design that includes the one or more attributes of the first entity and the one or more attributes of the second entity.
1. A method for generating a data integration scenario between a data source and a data target using a combination of a data flow extract-transform-load (ETL) and an entity relationship extract-transform-load (ETL), the method comprising:
receiving, at one or more computer systems, a set of extract-transform-load (ETL) entity relationships defined via a component of a logical design of the data integration scenario, the set of ETL entity relationships having a relationship defined between one or more attributes of a first entity in a dataset and one or more attributes of a second entity in the dataset;
automatically determining, with one or more processors associated with the one or more computer systems, an extract-transform-load (ETL) data flow model using the set of ETL entity relationships, wherein the ETL data flow model is configured to provide semantic definitions for the logical design, wherein the semantic definitions define entity relationships in the dataset that identify how columns of the data target are populated from attributes of the data source and from one of the one or more attributes of the first entity in the dataset and the one or more attributes of the second entity in the dataset; and
generating, with the one or more processors associated with the one or more computer systems, information indicative of the ETL data flow model in a view of the logical design, wherein a set of attributes are exposed to downstream components of the logical design that includes the one or more attributes of the first entity and the one or more attributes of the second entity.
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
7. The method of
receiving, at the one or more computer systems, a change in the logical design by introducing the set of ETL entity relationships; and
determining, with the one or more processors associated with the one or more computer systems, an updated data flow using the introduced set of ETL entity relationships.
8. The method according to
9. The method according to
10. The method according to
11. The method according to
13. The non-transitory computer-readable medium of
14. The non-transitory computer-readable medium of
15. The non-transitory computer-readable medium of
16. The non-transitory computer-readable medium of
17. The non-transitory computer-readable medium of
18. The non-transitory computer-readable medium of
code for receiving a change in the logical design by introducing the set of ETL entity relationships; and
code for determining an updated data flow using the introduced set of ETL entity relationships.
20. The system of
21. The system of
22. The system of
23. The system of
24. The system of
25. The system of
receive a change in the logical design by introducing the set of ETL entity relationships; and
determine an updated data flow using the introduced set of ETL entity relationships.
|
This application claims the benefit of and priority to:
In today's increasingly fast-paced business environment, organizations need to use more specialized software applications. Additionally, organizations need to ensure the coexistence of these applications on heterogeneous hardware platforms and systems and guarantee the ability to share data between applications and systems.
Accordingly, what is desired is to solve problems relating to developing data integration scenarios, some of which may be discussed herein. Additionally, what is desired is to reduce drawbacks relating to developing data integration scenarios, some of which may be discussed herein.
The following portion of this disclosure presents a simplified summary of one or more innovations, embodiments, and/or examples found within this disclosure for at least the purpose of providing a basic understanding of the subject matter. This summary does not attempt to provide an extensive overview of any particular embodiment or example. Additionally, this summary is not intended to identify key/critical elements of an embodiment or example or to delineate the scope of the subject matter of this disclosure. Accordingly, one purpose of this summary may be to present some innovations, embodiments, and/or examples found within this disclosure in a simplified form as a prelude to a more detailed description presented later.
In various embodiments, a data integration system enables users to create a logical design which is platform and technology independent. The user can create a logical design that defines, at a high level, how a user wants data to flow between sources and targets. The tool can analyze the logical design, in view of the user's infrastructure, and create a physical design. The logical design can include a plurality of components corresponding to each source and target in the design, as well as operations such as joins or filters, and access points. Each component when transferred to the physical design generates code to perform operations on the data. Depending on the underlying technology (e.g., SQL Server, Oracle, Hadoop, etc.) and the language used (SQL, pig, etc.) the code generated by each component may be different.
In one aspect, a user of data integration system is not required to specify all data attributes at each component in the logical design, from start to end. The data integration system provides a plurality of component types, such as projector and selector types, that avoid the need to fully declare the information that flows through the logical design. The data integration system is able to decide what attributes are needed at operations represented by predetermined component types. This simplifies both the design and maintenance.
In one embodiment, a method facilitating generation of a data mapping includes receiving information specifying a set of entity relationships as a component of the logical design. An equivalent data flow model is determined based on the set of entity relationships. Information is then generated indicative of the equivalent data flow model in the logical flow design. One or more attributes of a dataset representing the set of entity relationships may be derived based on information declaring relationships between attributes of data sources.
In further embodiments, information may be received specifying one or more components of the logical design that includes information indicative of an operation that changes shape of the information flowing through the logical design. Information may be received specifying one or more components of the logical design that includes information indicative of an operation that controls the flow of information flowing through the logical design but does not change shape of the information flowing through the logical design. Information may be received specifying one or more components of the logical design that includes information indicative of a target component having one or more attributes of data to be stored in a target datastore.
In one aspect, generating the information indicative of the equivalent data flow model in the logical flow design may include exporting a list of attributes to a downstream component. In another, a change in the logical design may be received through the introduction of one or more relationships. An updated equivalent data flow model may then be determined.
In one embodiment, a non-transitory computer-readable medium storing computer-executable code for facilitating generation of a data mapping includes code for receiving information specifying a set of entity relationships as a component of the logical design, code for determining an equivalent data flow model based on the set of entity relationships, and code for generating information indicative of the equivalent data flow model in the logical flow design.
In a further embodiment, a system facilitating generation of a data mapping includes a processor and a memory storing instructions which when executed by the processor configure the processor to receive information specifying a set of entity relationships as a component of the logical design, determine an equivalent data flow model based on the set of entity relationships, and generate information indicative of the equivalent data flow model in the logical flow design.
A further understanding of the nature of and equivalents to the subject matter of this disclosure (as well as any inherent or express advantages and improvements provided) should be realized in addition to the above section by reference to the remaining portions of this disclosure, any accompanying drawings, and the claims.
In order to reasonably describe and illustrate those innovations, embodiments, and/or examples found within this disclosure, reference may be made to one or more accompanying drawings. The additional details or examples used to describe the one or more accompanying drawings should not be considered as limitations to the scope of any of the claimed inventions, any of the presently described embodiments and/or examples, or the presently understood best mode of any innovations presented within this disclosure.
Introduction
In various embodiments, a data integration system enables users to create a logical design which is platform and technology independent. The user can create a logical design that defines, at a high level, how a user wants data to flow between sources and targets. The tool can analyze the logical design, in view of the user's infrastructure, and create a physical design. The logical design can include a plurality of components corresponding to each source and target in the design, as well as operations such as joins or filters, and access points. Each component when transferred to the physical design generates code to perform operations on the data. Depending on the underlying technology (e.g., SQL Server, Oracle, Hadoop, etc.) and the language used (SQL, pig, etc.) the code generated by each component may be different.
In one aspect, a user of data integration system is not required to specify all data attributes at each component in the logical design, from start to end. The data integration system provides a plurality of component types, such as projector and selector types, that avoid the need to fully declare the information that flows through the logical design. The data integration system is able to decide what attributes are needed at operations represented by predetermined component types. This simplifies both the design and maintenance.
In one embodiment, system 100 includes one or more user computers 110 (e.g., computers 110A, 110B, and 110C). User computers 110 can be general purpose personal computers (including, merely by way of example, personal computers and/or laptop computers running any appropriate flavor of Microsoft Corp.'s Windows™ and/or Apple Corp.'s Macintosh™ operating systems) and/or workstation computers running any of a variety of commercially-available UNIX™ or UNIX-like operating systems. These user computers 110 can also have any of a variety of applications, including one or more applications configured to perform methods of the invention, as well as one or more office applications, database client and/or server applications, and web browser applications.
Alternatively, user computers 110 can be any other electronic device, such as a thin-client computer, Internet-enabled mobile telephone, and/or personal digital assistant, capable of communicating via a network (e.g., communications network 120 described below) and/or displaying and navigating web pages or other types of electronic documents. Although the exemplary system 100 is shown with three user computers, any number of user computers or devices can be supported.
Certain embodiments of the invention operate in a networked environment, which can include communications network 120. Communications network 120 can be any type of network familiar to those skilled in the art that can support data communications using any of a variety of commercially-available protocols, including without limitation TCP/IP, SNA, IPX, AppleTalk, and the like. Merely by way of example, communications network 120 can be a local area network (“LAN”), including without limitation an Ethernet network, a Token-Ring network and/or the like; a wide-area network; a virtual network, including without limitation a virtual private network (“VPN”); the Internet; an intranet; an extranet; a public switched telephone network (“PSTN”); an infra-red network; a wireless network, including without limitation a network operating under any of the IEEE 802.11 suite of protocols, the Bluetooth™ protocol known in the art, and/or any other wireless protocol; and/or any combination of these and/or other networks.
Embodiments of the invention can include one or more server computers 130 (e.g., computers 130A and 130B). Each of server computers 130 may be configured with an operating system including without limitation any of those discussed above, as well as any commercially-available server operating systems. Each of server computers 130 may also be running one or more applications, which can be configured to provide services to one or more clients (e.g., user computers 110) and/or other servers (e.g., server computers 130).
Merely by way of example, one of server computers 130 may be a web server, which can be used, merely by way of example, to process requests for web pages or other electronic documents from user computers 110. The web server can also run a variety of server applications, including HTTP servers, FTP servers, CGI servers, database servers, Java servers, and the like. In some embodiments of the invention, the web server may be configured to serve web pages that can be operated within a web browser on one or more of the user computers 110 to perform methods of the invention.
Server computers 130, in some embodiments, might include one or more file and or/application servers, which can include one or more applications accessible by a client running on one or more of user computers 110 and/or other server computers 130. Merely by way of example, one or more of server computers 130 can be one or more general purpose computers capable of executing programs or scripts in response to user computers 110 and/or other server computers 130, including without limitation web applications (which might, in some cases, be configured to perform methods of the invention).
Merely by way of example, a web application can be implemented as one or more scripts or programs written in any programming language, such as Java, C, or C++, and/or any scripting language, such as Perl, Python, or TCL, as well as combinations of any programming/scripting languages. The application server(s) can also include database servers, including without limitation those commercially available from Oracle, Microsoft, IBM and the like, which can process requests from database clients running on one of user computers 110 and/or another of server computers 130.
In some embodiments, an application server can create web pages dynamically for displaying the information in accordance with embodiments of the invention. Data provided by an application server may be formatted as web pages (comprising HTML, XML, Javascript, AJAX, etc., for example) and/or may be forwarded to one of user computers 110 via a web server (as described above, for example). Similarly, a web server might receive web page requests and/or input data from one of user computers 110 and/or forward the web page requests and/or input data to an application server.
In accordance with further embodiments, one or more of server computers 130 can function as a file server and/or can include one or more of the files necessary to implement methods of the invention incorporated by an application running on one of user computers 110 and/or another of server computers 130. Alternatively, as those skilled in the art will appreciate, a file server can include all necessary files, allowing such an application to be invoked remotely by one or more of user computers 110 and/or server computers 130. It should be noted that the functions described with respect to various servers herein (e.g., application server, database server, web server, file server, etc.) can be performed by a single server and/or a plurality of specialized servers, depending on implementation-specific needs and parameters.
In certain embodiments, system 100 can include one or more databases 140 (e.g., databases 140A and 140B). The location of the database(s) 140 is discretionary: merely by way of example, database 140A might reside on a storage medium local to (and/or resident in) server computer 130A (and/or one or more of user computers 110). Alternatively, database 140B can be remote from any or all of user computers 110 and server computers 130, so long as it can be in communication (e.g., via communications network 120) with one or more of these. In a particular set of embodiments, databases 140 can reside in a storage-area network (“SAN”) familiar to those skilled in the art. (Likewise, any necessary files for performing the functions attributed to user computers 110 and server computers 130 can be stored locally on the respective computer and/or remotely, as appropriate). In one set of embodiments, one or more of databases 140 can be a relational database that is adapted to store, update, and retrieve data in response to SQL-formatted commands. Databases 140 might be controlled and/or maintained by a database server, as described above, for example.
Data Integration Overview
In this embodiment, data integration system 200 includes information sources 202, information integration 204, and information destinations 206. In general, information flows from information sources 202 to information integration 204 whereby the information may be consumed, made available, or otherwise used by information destinations 206. Data flows may be unidirectional or bidirectional. In some embodiments, one or more data flows may be present in data integration system 200.
Information sources 202 are representative of one or more hardware and/or software elements configured to source data. Information sources 202 may provide direct or indirect access to the data. In this embodiment, information sources 202 include one or more applications 208 and one or more repositories 210.
Applications 208 are representative of traditional applications, such as desktop, hosted, web-based, or cloud-based applications. Applications 208 may be configured to receive, process, and maintain data for one or more predetermined purposes. Some examples of applications 208 include customer relationship management (CRM) applications, financial services applications, government and risk compliance applications, human capital management (HCM), procurement applications, supply chain management applications, project or portfolio management applications, or the like. Applications 208 may include functionality configured for manipulating and exporting application data in a variety of human-readable and machine-readable formats, as is known in the art. Applications 208 may further access and store data in repositories 210.
Repositories 210 are representative of hardware and/or software elements configured to provide access to data. Repositories 210 may provide logical and/or physical partitioning of data. Repositories 210 may further provide for reporting and data analysis. Some examples of repositories 210 include databases, data warehouses, cloud storage, or the like. A repository may include a central repository created by integrating data from one or more applications 208. Data stored in repositories 210 may be uploaded from an operational system. The data may pass through additional operations before being made available in a source.
Information integration 204 is representative of one or more hardware and/or software elements configured to provide data integration services. Direct or indirect data integration services can be provided in information integration 204. In this embodiment, information integration 204 includes data migration 212, data warehousing 214, master data management 216, data synchronization 218, federation 220, and real-time messaging 222. It will be understood that information integration 204 can include one or more modules, services, or other additional elements than those shown in here that provide data integration functionality.
Data migration 212 is representative of one or more hardware and/or software elements configured to provide data migration. In general, data migration 212 provides one or more processes for transferring data between storage types, formats, or systems. Data migration 212 usually provides for manual or programmatic options to achieve a migration. In a data migration procedure, data on or provided by one system is mapped to another system providing a design for data extraction and data loading. A data migration may involve one or more phases, such a design phase where one or more designs are created that relate data formats of a first system to formats and requirements of a second system, a data extraction phase where data is read from the first system, a data cleansing phase, and a data loading phase where data is written to the second system. In some embodiments, a data migration may include a data verification phases to determine whether data is accurately processed in any of the above phases.
Data warehousing 214 is representative of one or more hardware and/or software elements configured to provide databases used for reporting and data analysis. A data warehouse is typically viewed as a central repository of data which is created by integrating data from one or more disparate sources. Data warehousing 214 may include the current storage of data as well as storage of historical data. Data warehousing 214 may include typical extract, transform, load (ETL)-based data warehouse whereby staging, data integration, and access layers house key functions. In one example, a staging layer or staging database stores raw data extracted from each of one or more disparate source data systems. An integration layer integrates disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data is then moved to yet another database, often called the data warehouse database. The data can be arranged into hierarchical groups (often called dimensions) and into facts and aggregate facts. An access layer may be provided to help users or other systems retrieve data. Data warehouses can be subdivided into data marts whereby each data mart stores subsets of data from a warehouse. In some embodiments, data warehousing 214 may include business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.
Master data management 216 is representative of one or more hardware and/or software elements configured to manage a master copy of data. Master data management 216 may include a set of processes, governance, policies, standards and tools that consistently define and manage master data. Master data management 216 may include functionality for removing duplicates, standardizing data, and incorporating rules to eliminate incorrect data from entering a system in order to create an authoritative source of master data. Master data management 216 may provide processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing data throughout an organization to ensure consistency and control in the ongoing maintenance and application use of information.
Data synchronization 218 is representative of one or more hardware and/or software elements configured to synchronize data. Data synchronization 218 may provide for establishing consistency among data from a source to a target and vice versa. Data synchronization 218 may further provide for the continuous harmonization of the data over time.
Federation 220 is representative of one or more hardware and/or software elements configured to consolidate a view of data from constituent sources. Federation 220 may transparently map multiple autonomous database systems into a single federated database. The constituent databases maybe interconnected via a computer network and may be geographically decentralized. Federation 220 provides an alternative to merging several disparate databases. A federated database, or virtual database, for example, may provide a composite of all constituent databases. Federation 220 may not provide actual data integration in the constituent disparate databases but only in the view.
Federation 220 may include functionality that provides a uniform user interface, enabling users and clients to store and retrieve data in multiple noncontiguous databases with a single query—even if the constituent databases are heterogeneous. Federation 220 may include functionality to decompose a query into subqueries for submission to relevant constituent data sources and composite the result sets of the subqueries. Federation 220 can include one or more wrappers to the subqueries to translate them into appropriate query languages. In some embodiments, federation 220 is a collection of autonomous components that make their data available to other members of the federation through the publication of an export schema and access operations.
Real-time messaging 222 is representative of one or more hardware and/or software elements configured to provide messaging services subject to a real-time constraint (e.g., operational deadlines from event to system response). Real-time messaging 222 may include functionality that guarantees an action or response within strict time constraints. In one example, real-time messaging 222 may be tasked with taking some orders and customer data from one database, combining it with some employee data held in a file, and then loading the integrated data into a Microsoft SQL Server 2000 database. Because orders need to be analyzed as they arrive, real-time messaging 222 may pass the orders through to a target database in as close to real time as possible and extract only the new and changed data to keep the workload as small as possible.
Information destinations 206 are representative of one or more hardware and/or software elements configured to store or consume data. In this embodiment, information destinations 206 may provide direct or indirect access to the data. In this embodiment, information destinations 206 include one or more applications 224 and one or more repositories 226.
Applications 224 are representative of traditional applications, such as desktop, hosted, web-based, or cloud-based applications. Applications 224 may be configured to receive, process, and maintain data for one or more predetermined purposes. Some examples of applications 224 include customer relationship management (CRM) applications, financial services applications, government and risk compliance applications, human capital management (HCM), procurement applications, supply chain management applications, project or portfolio management applications, or the like. Applications 224 may include functionality configured for manipulating and importing application data in a variety of human-readable and machine-readable formats, as is known in the art. Applications 224 may further access and store data in repositories 226.
Repositories 226 are representative of hardware and/or software elements configured to provide access to data. Repositories 226 may provide logical and/or physical partitioning of data. Repositories 226 may further provide for reporting and data analysis. Some examples of repositories 226 include databases, data warehouses, cloud storage, or the like. A repository may include a central repository created by integrating data from one or more applications 226. Data stored in repositories 226 may be uploaded or imported through information integration 204. The data may pass through additional operations before being made available at a destination.
Data Integration System
In various embodiments, data integration system 200 provides a new declarative design approach to defining data transformation and integration processes, resulting in faster and simpler development and maintenance. Data integration system 200 thus separates declarative rules from the implementation details. Data integration system 200 further provides a unique E-LT architecture (Extract-Load Transform) for the execution of data transformation and validation processes. This architecture in embodiments eliminates the need for a standalone ETL server and proprietary engine. In some embodiments, data integration system 200 instead leverages the inherent power of RDBMS engines.
In some embodiments, data integration system 200 integrates in one or more middleware software packages, such as the ORACLE FUSION MIDDLEWARE platform and becomes a component of the middleware stack. As depicted in
In this example, one component of data integration system 200 is repositories 302. Repositories 302 are representative of hardware and/or software elements configured to store configuration information about an IT infrastructure, metadata of all applications, projects, scenarios, and execution logs. In some aspects, multiple instances of repositories 302 can coexist in an IT infrastructure, for example Development, QA, User, Acceptance, and Production. Repositories 302 are configured to allow several separated environments that exchange metadata and scenarios (for example: Development, Test, Maintenance and Production environments). Repositories 302 further are configured to act as a version control system where objects are archived and assigned a version number.
In this example, repositories 302 is composed of at least one master repository 304 and one or more work repositories 306. Objects developed or configured for use within data integration system 200 may be stored in one of these repository types. In general, master repository 304 stores the following information: security information including users, profiles and rights, topology information including technologies, server definitions, schemas, contexts, languages and so forth, and versioned and archived objects. The one or more work repositories 306 may contain actual developed objects.
Several work repositories may coexist in data integration system 200 (for example, to have separate environments or to match a particular versioning life cycle). The one or more work repositories 306 store information for models, including schema definition, data stores structures and metadata, fields and columns definitions, data quality constraints, cross references, data lineage, and so forth. The one or more work repositories 306 may further store projects, including business rules, packages, procedures, folders, knowledge modules, variables and so forth, and scenario execution, including scenarios, scheduling information and logs. In some aspects, the one or more work repositories 306 may contain only execution information (typically for production purposes), and be designated as an execution repository.
In various embodiments, repositories 302 store one or more ETL projects. An ETL project defines or otherwise specifies one or more data models that model data attributes of data in a source or target. An ETL project further provides for data quality control as well as defining mappings to move and transform data. Data integrity control ensures the overall consistency of the data. Application data is not always valid for the constraints and declarative rules imposed by a particular source or target. For example, orders may be found with no customer, or order lines with no product, and so forth. Data integration system 200 provides a working environment to detect these constraint violations and to store them for recycling or reporting purposes.
In some embodiments of data integration system 200, there are two different types of controls: Static Control and Flow Control. Static Control implies the existence of rules that are used to verify the integrity of application data. Some of these rules (referred to as constraints) may already be implemented in data servers (using primary keys, reference constraints, etc.) Data integration system 200 allows for the definition and checking of additional constraints, without declaring them directly in a source. Flow Control relates to targets of transformation and integration processes that implement their own declarative rules. Flow Control verifies an application's incoming data according to these constraints before loading the data into a target. Flow control procedures are general referred to as mappings.
An ETL project can be automated into a package that can be deployed for execution in a runtime environment. Accordingly, the automation of data integration flows is achieved by sequencing the execution of the different steps (mappings, procedures, and so forth) in a package and by producing a production scenario containing ready-to-use code for each of these steps. A package is typically made up of a sequence of steps organized into an execution diagram. Packages are the main objects used to generate scenarios for production. They represent the data integration workflow and can perform jobs, such as for example: start a reverse-engineering process on a datastore or a model, send an email to an administrator, download a file and unzip it, define the order in which mappings must be executed, and define loops to iterate over execution commands with changing parameters.
A scenario is designed to put a source component (mapping, package, procedure, variable) into production. A scenario results from the generation of code (SQL, shell, and so forth) for this component. Once generated, the code of the source component is frozen and the scenario is stored inside repositories 302, such as one or more of work repositories 306. A scenario can be exported and then imported into different production environments.
In various embodiments, data integration system 200 is organized around repositories 302 in a modular fashion accessed by Java graphical modules and scheduling agents. Graphical modules can be used to design and build one or more integration processes stored in repositories 302. Administrators, Developers and Operators may use a development studio to access repositories 302. Agents can be used to schedule and coordinate a set of integration tasks associated with an integration process stored in repositories 302. For example, at runtime, an agent deployed on a desktop, web services, or otherwise in communication with a source coordinates the execution of one or more integration processes. The agent may retrieve code stored in master repository 304, connect to various source and target systems, and orchestrate an overall data integration process or scenario.
In this embodiment, data integration system 200 includes desktop 308 that may include one or more of the above discussed graphical modules and/or agents. Desktop 308 is representative of one or more desktop or workstation computing devices, such as personal computers, laptops, netbooks, tablets, and the like. Desktop 308 includes a Java virtual machine (JVM) 310 and Oracle Data Integrator (ODI) Studio 312. Java virtual machine (JVM) 310 is a virtual machine that can execute Java bytecode. JVM 310 is most often implemented to run on an existing operating system, but can also be implemented to run directly on hardware. JVM 310 provides a run-time environment in which Java bytecode can be executed, enabling features such as runtime web service (WS) 314 and agent 316. JVM 310 may include a Java Class Library, a set of standard class libraries (in Java bytecode) that implement the Java application programming interface (API), and other elements that form a Java Runtime Environment (JRE).
Agent 316 is configured to schedule and coordinate a set of integration tasks associated with one or more integration processes stored in repositories 302. For example, at runtime, an agent coordinates the execution of integration processes. The agent may retrieve code stored in master repository 304, connect to various source and target systems, and orchestrate an overall data integration process or scenario.
Referring again to
Operator module 320 is a module configured to view and manage production integration jobs. Operator module 320, thus, manages and monitors data integration processes in production and may show execution logs with error counts, the number of rows processed, execution statistics, the actual code that is executed, and so on. At design time, developers can also use operator module 320 for debugging purposes in connection with designer module 318.
Topology module 322 is a module configured to create and manage connections to datasources and agents. Topology module 322 defines the physical and logical architecture of the infrastructure. Infrastructure or projects administrators may register servers, database schemas and catalogs, and agents in a master repository through topology module 322. Security module 324 is a module configured to manage users and their repository privileges.
In general, a user or process interacts with designer module 318 to create a data integration project having one or more data integration processes for sources and targets 326. Each data integration process includes at least one data integration task. In some embodiments, a data integration tasks is defined by a set of business rules indicative of what bit of data is to be transformed and combined with other bits as well as technical specifics of how the data is actually extracted, loaded, and so on. In preferred embodiments, a data integration tasks is specified using a declarative approach to build data mappings. A mapping is an object that populates one datastore, called the target, which data coming from one or more other datastores, known as sources. In general, columns in the source datastore are linked to the columns in the target datastore through mapping. A mapping can be added into a package as a package step. As discussed above, a package defines a data integration job. A package is created under a project and is made up of an organized sequence of steps, each of which can be a mapping or a procedure. A package can have one entry point and multiple exit points.
In some embodiments, when creating a new mapping, a developer or technical business user interacts with designer 318 to first define which data is integrated and which business rules should be used. For example, the developer may specify what tables are to be joined, filters to be applied, and SQL expressions to be used to transform data. The particular dialect of SQL that is used is determined by the database platform on which the code is to be executed. Then, in a separate step, technical staff can interact with designer 318 to choose the most efficient way to extract, combine, and then integrate this data. For example, the technical staff may use database-specific tools and design techniques such as incremental loads, bulk-loading utilities, slowly changing dimensions, and changed-data capture.
In this embodiment, mappings can be created for sources and targets 326. Sources and targets 326 may include one or more legacy applications 328, one or more files/XML documents 330, one or more applications 332, one or more data warehouses (DW), business intelligence (BI) tools and applications, and enterprise process management (EPM) tools and applications 334, and one or more JVMs 336 (including runtime web service 340 and agent 342).
Orders application 406 is representative of an application for tracking customer orders. An “Orders Application” data model is created to represent data stored in Orders application 406 as well as any data integrity controls or conditions. For example, the “Orders Application” data model may be based on a Hyper Structured Query Language (HSQL) interface and include five datastores, SRC_CITY, SRC_CUSTOMER, SRC_ORDERS, SRC_ORDER_LINES, SRC_PRODUCT, and SRC_REGION.
Parameter file 408 is representative of a flat file (e.g., ASCII) issued from a production system containing a list of sales representatives and the segmentation of ages into age ranges. In this example, a “Parameter” data model is created to represent the data in the flat file. For example, the “Parameter” data model may be based on a file interface and include two datastores, SRC_SALES_PERSON and SRC_AGE_GROUP.
Sales administration application 410 is representative of an application for tracking sales. The sales administration application 410 may be a data warehouse populated with transformations of data from orders application 406 and parameter file 408. A “Sales Administration” data model is created to represent data stored in sales administration application 410 as well as any data integrity controls or conditions or transformations. For example, the “Sales Administration” data model may be based on a Hyper Structured Query Language (HSQL) interface and include six datastores, TRG_CITY, TRG_COUNTRY, TRG_CUSTOMER, TRG_PRODUCT, TRG_PROD_FAMILY, TRG_REGION, and TRG_SALE.
In accordance with an embodiment, the present invention leverages the user's existing infrastructure by enabling the user to customize a data integration process according to the user's particular needs. For example, when a data integration plan is designed, it can be divided into discrete portions which are executable by a single system, referred to as execution units. Once a data integration plan has been divided into a plurality of execution units, the user can be presented with a physical plan based on the user's infrastructure and system resources. This plan can be further customized by the user to change which user systems execute which execution units. For example, a user may be presented with a plan in which a join operation is executed on a first database, and the user may customize the plan by moving the join operation to a second database.
As shown in
Referring again to
In this embodiment, data integration system 200 includes application server 344 that may include one or more of the above discussed agents. Application server 344 is representative of one or more application servers, web-servers, or hosted applications. In this example, application server 344 includes FMW console 346, servlet container 348, web services container 350, and data sources connection pool 352.
FMW console 346 is representative of one or more hardware and/or software elements configured to manage aspects of application server 344, such as information related to servlet container 348, web services container 350, and data sources connection pool 334. For example, FMW console 346 may be a browser-based, graphical user interface used to manage an Oracle WebLogic Server domain. FMW console 346 may include functionality to configure, start, and stop WebLogic Server instances, configure WebLogic Server clusters, configure WebLogic Server services, such as database connectivity (JDBC) and messaging (JMS), configure security parameters, including creating and managing users, groups, and roles, configure and deploy Java EE applications, monitor server and application performance, view server and domain log files, view application deployment descriptors, and edit selected run-time application deployment descriptor elements. In some embodiments, FMW console 346 includes ODI plug-in 354 providing FMW console 346 with access to data integration processes in production and may show execution logs with error counts, the number of rows processed, execution statistics, the actual code that is executed, and so forth.
Servlet container 348 is representative of one or more hardware and/or software elements configured to extend the capabilities of application server 344. Servlets are most often used to process or store data that was submitted from an HTML form, provide dynamic content such as the results of a database query, and manage state information that does not exist in the stateless HTTP protocol, such as filling the articles into the shopping cart of the appropriate customer. A servlet is typically a Java class in Java EE that conforms to the Java Servlet API, a protocol by which a Java class may respond to requests. To deploy and run a servlet, servlet container 348 is used as a component of a web server that interacts with servlets. Accordingly, servlet container 348 may extend functionality provided by public web service 356 and data services 358 of web services container 350 as well as access to data pools provided by data sources connection pool 352. Servlet container 348 is also responsible for managing the lifecycle of servlets, mapping a URL to a particular servlet and ensuring that the URL requester has the correct access rights.
In this example, servlet container 348 includes Java EE application 360 associated with ODI SDK 362, ODI console 364, and runtime web service 366 associated with Java EE agent 368. ODI SDK 362 provides a software development kit (SDK) for data integration and ETL design. ODI SDK 362 enables automation of work that is common and very repetitive allowing a user to script repetitive tasks.
ODI console 364 is a Java Enterprise Edition (Java EE) application that provides Web access to repositories 302. ODI console 364 is configured to allow users to browse Design-Time objects, including projects, models, and execution logs. ODI console 364 may allow users to view flow maps, trace the source of all data, and even drill down to the field level to understand the transformations used to build the data. In addition, end users can launch and monitor scenario s execution through ODI console 364. In one aspect, ODI console 364 provides administrators with the ability to view and edit Topology objects such as Data Servers, Physical and Logical Schemas as well as to manage repositories 302.
Data Scenario Design and Development
As discussed above, a scenario is designed to put a source component (mapping, package, procedure, variable) into production. A scenario results from the generation of code (SQL, shell, and so forth) for this component. A scenario can be exported and then imported into different production environments.
In various embodiments, a user may initiate a session with designer module 318 of ODI Studio 312 and connect to repositories 302. The user may interact with one or more user interface features to create a new data integration project or select from existing data integration projects stored in, for example, master repository 304. In general, designer module 318 is used to manage metadata, to design data integrity checks, and to build transformations. In various embodiments, the main objects handled through designer module 318 are models and projects. Data models contain all of the metadata in a data source or target (e.g., tables, columns, constraints, descriptions, cross-references, etc.). Projects contain all of the loading and transformation rules for a source or target (e.g., mappings, procedures, variables, etc.)
In step 820, one or more data models are created. In step 830, one or more projects are created.
In step 1020, target datastore information is received. For example, a user may interact with one or more user interface features of designer module 318 to provide target datastore information. In one embodiment, the user may drag and drop target datastore information comprising one or more data models from navigation panel 910 onto a mapping or flow panel that visually represents aspects of a selected data model and any associated transforms or data integrity checks.
In step 1030, source datastore information is received. For example, a user may interact with one or more user interface features of designer module 318 to provide source datastore information. In one embodiment, the user may drag and drop source datastore information comprising one or more data models from navigation panel 910 onto the same mapping or flow panel of the target datastore information that visually represents aspects of a selected data model and any associated transforms or data integrity checks.
In various embodiments, the source datastore information and the target data store information may be composed of one or more data models and optionally operations. Some examples of operations can include one or more data set operations (e.g., unions, joins, intersections, etc.), data transformations, data filter operations, constraints, descriptions, cross-references, integrity checks, or the like. In further embodiments, some of these operations may be preconfigured and visually represented in designer module 318. In other embodiments, custom operations may be provided allowing the user to specify logic, mappings, and the like that implement an operation.
In step 1040, mapping information is received. For example, a user may interact with one or more user interface features of designer module 318 to map the source datastore information to the target datastore information. In one embodiment, the user may visually connect attributes of data elements in the source datastore information with attributes of data elements in the target datastore information. This may be done by matching column names of tables in the source datastore information and the target datastore information. In further embodiments, one or more automatic mapping techniques may be used to provide mapping information.
Referring again to
For example, one or more knowledge modules may be proposed for the flow. A knowledge module (KM) is a component that implements reusable transformation and ELT (extract, load, and transform) strategies across different technologies. In one aspect, knowledge modules (KMs) are code templates. Each KM can be dedicated to an individual task in an overall data integration process. The code in KMs appears in nearly the form that it will be executed with substitution methods enabling it to be used generically by many different integration jobs. The code that is generated and executed is derived from the declarative rules and metadata defined in the designer module 318. One example of this is extracting data through change data capture from Oracle Database 10 g and loading the transformed data into a partitioned fact table in Oracle Database 11 g, or creating timestamp-based extracts from a Microsoft SQL Server database and loading this data into a Teradata enterprise data warehouse.
The power of KMs lies in their reusability and flexibility—for example, a loading strategy can be developed for one fact table and then the loading strategy can be applied to all other fact tables. In one aspect, all mappings that use a given KM inherit an changes made to the KM. In some embodiments, five different types of KMs are provided, each of them covering one phase in a transformation process from source to target, such as an integration knowledge module (IKM), a loading knowledge module (LKM), and a check knowledge module CKM.
Referring to
In step 1060, data integration strategies are received. After defining the loading phase, the user defines a strategy to adopt for the integration of the loaded data into a target. To define the integration strategies, the user may select a target object and select a IKM SQL Incremental Update. An IKM is in charge of writing the final, transformed data to a target. When an IKM is started, it assumes that all loading phases for remote servers have already carried out their tasks, such as having all remote source data sets loaded by LKMs into a staging area, or the source datastores are on the same data server as the staging area.
In step 1070, data control strategies are received. In general, an CKM is in charge of checking that records of a data set are consistent with defined constraints. An CKM may be used to maintain data integrity and participates in overall data quality initiative. A CKM can be used in 2 ways. First, to check the consistency of existing data. This can be done on any datastore or within interfaces. In this case, the data checked is the data currently in the datastore. In a second case, data in the target datastore is checked after it is loaded. In this case, the CKM simulates the constraints of the target datastore on the resulting flow prior to writing to the target.
In step 1080, an interface is generated.
Data Integration Scenario Packages and Deployment
As discussed above, automation of data integration flows can be achieved in data integration system 200 by sequencing the execution of the different steps (mappings, procedures, and so forth) in a package and by producing a production scenario containing the ready-to-use code for each of these steps. A package is made up of a sequence of steps organized into an execution diagram. Packages are the main objects used to generate scenarios for production. A scenario is designed to put a source component (mapping, package, procedure, variable) into production. A scenario results from the generation of code (SQL, shell, and so forth) for this component. A scenario can be exported and then imported into different production environments.
In step 1320, step information is received. Package step information includes information identifying a step, elements, properties, components, and the like. In one example, a user may interact with one or more user interface features of designer module 318 to create, identify, or otherwise specify one or more steps for a package. In one embodiment, one or more components are selected and placed on a diagram. These components appear as steps in the package.
In step 1330, step sequence information is received. Package step sequence information includes information identifying an ordering for a step, dependencies, and the like. Once steps are created, the steps are ordered or reorder into a data processing chain. In one example, a user may interact with one or more user interface features of designer module 318 to provide sequencing or ordering for one or more steps of a package. A data processing chain may include a unique step defined as a first step. Generally, each step has one or more termination states, such as success or failure. A step in some states, such as failure or success, can be followed by another step or by the end of the package. In one aspect, in case of some states, such as failure, sequence information may define a number of retries. In another aspect, a package may have but several possible termination steps.
In step 1340, a package is generated.
As discussed above, the automation of data integration flows can be achieved by sequencing the execution of different steps (mappings, procedures, and so forth) in a package. The package can then be produced for a production scenario containing the ready-to-use code for each of the package's steps. In various embodiments, the package is deployed to run automatically in a production environment.
In step 1520, an integration scenario is retrieved. In one embodiment, a package is retrieved from repositories 302. In step 1530, the integration scenario is deployed to one or more agents. In step 1540, the integration scenario is executed by the one or more agents. In one aspect, the integration scenario can be executed in several ways, such as from ODI Studio 312, from a command line, or from a web service. Scenario execution can be viewed and monitored, for example, via operator module 320 and the like as discussed above.
Combined Flow-Based ETL and Entity Relationship-Based ETL
In most data integration systems, a mapping requires an explicit definition of all input and output attributes that form part of a map. In typical flow based ETL tools, connectors are made at the attribute level. This results in a very concise mapping model. However, this also generates a huge number of objects and makes constructing and maintaining maps cumbersome due to the number of attribute level connectors.
In various embodiments, data integration system 200 incorporates one or more techniques for easing the design and maintenance of a mapping. Components can be added to an existing design simply without the need to specify all input and output attributes and allowing component level connectors to be rerouted. In one aspect, a combination of datasets and flow oriented design is provided to handle complexity along with change. Entity relationships can be specified within a logical view of the design, thus allowing datastores, joins, filters and lookups to be added or removed without requiring change to a map in general.
A dataset as used herein generally represents a data flow coming from a group of datastores. Several datasets can be merged into an interface target datastore using operations, such as set-based operators like Union and Intersect. In various embodiments, datasets can be added, removed, and ordered in the logical view of a design. Accordingly, data integration system 200 enables users to combine flow-based ETL and entity relationship-based ETL in a single view. Therefore, data integration system 200 greatly eases the design and maintenance of a mapping. Data integration system 200 further makes adding in components to an existing design simple, typically just needing component level connectors to be rerouted.
In traditional data integration systems, mapping 1600 requires an explicit definition of all input and output attributes that form part of component 1640 representing the JOIN. In contrast, in various embodiments, a map developer can define entity relationships in dataset 1620 to provide how columns of data target TGT_EMPDEPT are populated directly from attributes of data source SRC_EMP represented by component 1610 and attributes of DATASET represented by dataset 1620 that flow through component 1640 and are thus visible to component 1630.
In step 1720, one or more components are received. As discussed above, some types of components influence the shape of the data that flows through a map while other types of components control the flow of the data but don't fundamentally change the shape of the flow. In step 1730, one or more datasets are received. For example, a map designer may add, edit, or remove datasets from a design. A map designer may interact with a relationship editor to specify entity relationships between various attributes in a dataset. In one aspect, data integration system 200 is configured to extract defined entity relationships to determine the attributes that will be exposed to downstream components of a design. In step 1740, a map is generated based on the components and the datasets. In various embodiments, logical views and physical views of a design can be updated to reflect changes to the components and datasets. In various aspects, data integration system 200 automatically generates a physical design based on deriving relationships in a flow's dataset view.
Data integration system 200 further makes adding in components and other datasets to an existing design simple, typically just needing component level connectors to be rerouted. For example, if a filter component were added into a design, changing component level connectors would not require changes attribute assignments of certain downstream components. In another example, adding another dataset allows a map designer to specify or declare entity relationships directly from within the design view of the map.
19A and 19B are simplified block diagrams of logical and physical designs for a combined flow-based and entity-based mapping in one embodiment according to the present invention. In this example, view 1910 of
In this example, view 1920 of
In step 2020, a component definition is received. For example, a component definition may include rules, operations, procedures, variables, sequences, and the like. In step 2030, a dataset definition is received. For example, a map designer may add or edit entity relationships within a flow view of a logical design. In step 2040, a physical design is generated based on deriving relationship information from the flow design.
Accordingly, data integration system 200 enables users to create a logical design which is platform and technology independent. The user can create a logical design that defines, at a high level, how a user wants data to flow between sources and targets. The tool can analyze the logical design, in view of the user's infrastructure, and create a physical design. The logical design can include a plurality of components corresponding to each source and target in the design, as well as operations such as joins or filters, and access points. Each component when transferred to the physical design generates code to perform operations on the data. Depending on the underlying technology (e.g., SQL Server, Oracle, Hadoop, etc.) and the language used (SQL, pig, etc.) the code generated by each component may be different.
Thus, a user of data integration system is not required to pre-define dataset components in the logical design. Data integration system 200 provides tools that allow map designers to declare entity relationships in a dataset view of a logical design. Data integration system 200 is able to decide what attributes are needed at operations represented by predetermined component types. This simplifies both the design and maintenance.
Entity Relational Modeling
Relational database design has been based on entity relational modeling, or E-R modeling. Traditionally, E-R design has been used for describing the static configuration of a problem domain. The more dynamic aspects, such as extracting data out of the data stores and “massaging” them into shape, are generally considered a different problem. Since the mid-1990s, there have been steady efforts toward these so-called “ETL tools.” An ETL tool can help human designers create specifications about dynamic data flows, which are generally called ETL models.
There are many benefits from automating the ETL design process using the E-R model. One such is the productivity of the ETL designer. The E-R model can be more easily made correct than an ETL process. The E-R model also has a standard notational system understood by database engineers. But same cannot be said for any ETL tool. Most, if not all, require steep learning curves on the part of designers. Another benefit is the better adaptability to change. Without the “middleman,” when the E-R model is finalized, so is the ETL process.
In various embodiments, techniques are disclosed to provide an automatic conversion from an E-R model to an ETL model. This is based on the observation that when a database engineer reads an E-R diagram, a data flow model is usually built in in his mind. Using this silent data flow model, the engineer can understand the E-R model and is able to communicate with others. The engineer even creates software based on this model. This phenomenon is more apparent when the E-R model becomes complex. Accordingly, the inventors recognize that there can be one or more hidden data flow models in every E-R model. In one aspect, the equivalent data flow model is provided for the E-R model that has been proven to be accurate in guiding the creation of an automatic conversion system.
As shown in
The “equivalent data flow” model as used herein represents a semantic model for the E-R model. A semantic model is used to define unambiguously what a logical model means. The semantic model can be expressed in vastly different ways, such as natural language, set theoretic notations, algebraic equations, mathematical logic, or algorithmic notations (generally known as operational semantic). In various embodiments, the semantic model for the E-R model is an operational semantic model, referred to as a “CFO model.” In one aspect, defining meaning in operational semantic format provides a twofold benefit, the first being that the operational semantic model is already in a step-by-step form consistent with the data flow model. Second, the operational semantic model is easy for human to understand, compared with other formal semantic models; and it is more rigorous than natural language explanations.
Binary relationships in an E-R model (or diagram) can be trivially mapped to joins in an ETL model. However, a multi-way relationship needs some work, because there is a common misconception around it.
In practice, however, standard E-R notation is not used. Instead, it is more common to see the so-called “Crow's Feet” notation. The difference between the two notational systems is only superficial. Referring to
One common misconception is the mistake of equating a multi-way relationship to a series of binary relationships.
Both of these models share the same problem in that they do not require that the two binary relationships must always hold at the same time. For example, an instance in PET, call it “Pet A” can relate to an instance in PET_TYPE, call it “PT A”, but it is not required that “Pet A” must also relate to an instance from PET_OWNER.
It is, however, possible to model the fact that a pet must participate in two binary relationships at the same time.
In one aspect, there are two special cases where it is possible for the PET entity to absorb the associative entity. First, one possibility is that each PET instance participates in no more than one relationship instance. Second, another possibility is if PET is a weak entity. (The formal definition of a weak entity is an entity that does not have its own primary key.) Suppose PET was a strong entity, its own primary key would have to be used to only identify pets. It could not also be used to identify the relationship instances. For example, if a pet instance participated in more than one relationship instance, there would be a primary key violation in the strong PET entity. On the other hand, if PET is a weak entity, its partial key (not unique) can be combined with the key (either partial or unique) of the ternary relationship. In this case, PET can absorb the ternary relationship.
Thus, without making additional assumptions,
Suppose p, t, o are instances of PET, PET_TYPE, and PET_OWNER, respectively. The following are possible instances in the “Pet-of-Type-and-Owner” relationship:
Here <missing> represents the absence of value from an entity. Whether these candidate tuples are valid relationship instances is determined by a three-way join condition, defined as:
Note: the value <missing> is able to match any other value. So in this example, the tuple (<missing>, t, o) is a valid instance of the relationship because the following condition evaluates to true.
The three example tables for the entities are created by the following statements.
create table PET (
id number,
name varchar2(30),
tid number,
oid number);
create table PET_TYPE (
id number,
name varchar2(30));
create table PET_OWNER (
id number,
name varchar2(30));
The user can also mark entity PET as optional. This is equivalent to having input an E-R model shown in
Since each ANSI join is pair-wise, to join three tables, two joins are needed. Also, since PET is an optional entity, at least one of the two joins must be an outer join. Furthermore, which two tables are joined first is also a factor to consider. Putting all these considerations together, nine permutations are encountered, corresponding to nine possible implementations for the multi-way join using ANSI syntax. These cases are drawn using data flow charts, and shown with their SQL statements and results in Table 1 below.
Flow Chart
SQL
Result
1
##STR00001##
select from
p.name “pet”, t.name “type”, o.name “owner” PET p join PET_TYPE t
pet ----------
type -----------
owner ----------- Jeff
on (p.tid = t.id)
(Comment: Type is nullified.
right outer join PET_OWNER o
This is not a good result.)
on (p.oid = o.id)
2
##STR00002##
select from
p.name “pet”, t.name “type”, o.name “owner” PET p right outer join PET_TYPE t
no rows selected
on (p.tid = t.id)
join PET_OWNER o
on (p.oid = o.id)
3
##STR00003##
select from
p.name “pet”, t.name “type”, o.name “owner” PET p right outer join PET_TYPE t
pet ----------
type -----------
owner ----------- Jeff
on (p.tid = t.id)
(Comment: Type is nullified.
right outer join PET_OWNER o
This is not a good result.)
on (p.oid = o.id)
4
##STR00004##
select from
p.name “pet”, t.name “type”, o.name “owner” PET p join PET_OWNER o
pet ----------
type ----------- Cat
owner -----------
on (p.oid = o.id)
(Comment: Owner is nullified.
right outer join PET_TYPE t
This is not a good result.)
on (p.tid = t.id)
5
##STR00005##
select from
p.name “pet”, t.name “type”, o.name “owner” PET p right outer join PET_OWNER o
no rows selected
on (p.oid = o.id)
join PET_TYPE t
on (p.tid = t.id)
6
##STR00006##
select from
p.name “pet”, t.name “type”, o.name “owner” PET p right outer join PET_OWNER o
pet ----------
type ----------- Cat
owner -----------
on (p.oid = o.id)
(Comment: Owner is nullified.
right outer join PET_TYPE t
This is not a good result.)
on (p.tid = t.id)
7
##STR00007##
select from
p.name “pet”, t.name “type”, o.name “owner” PET_TYPE t join PET_OWNER o
pet ----------
type ----------- Cat
owner ----------- Jeff
on (1 = 1)
(Comment: This is the result we expect.
left outer join PET p
Note that there is no direct join
on (p.tid = t.id and p.oid = o.id)
condition between PET_TYPE and
PET_OWNER. It defaults to true.)
8
##STR00008##
select from
p.name “pet”, t.name “type”, o.name “owner” PET_TYPE t full outer join PET_OWNER o
no rows selected
on (1 = 1)
join PET p
on (p.tid = t.id and p.oid = o.id)
9
##STR00009##
select from
p.name “pet”, t.name “type”, o.name “owner” PET_TYPE t full outer join PET_OWNER o
pet ----------
type ----------- Cat
owner ----------- Jeff
on (1 = 1)
(Comment: The generated SQL is
left outer join PET p
equivalent to the one in implementation
on (p.tid = t.id and p.oid = o.id)
#7.)
From the examination of all the possible implementations, implementation #7 appears to match the expectation for the three-way relationship. Accordingly, in general, multi-way relationship is not equivalent to a series of binary relationships. However, in various embodiments, a multi-way relationship may be implemented using binary joins. Therefore, one aspect, a model is created that is understandable for (casual) human users yet rigorous for use in generating correct data flow implementations. As discussed above, the equivalent data flow model fits in the category of “operational semantic models.” Operating semantic model to unambiguously describe the meaning/intention of a system have created before, however, one for equivalently representing the E-R model provides new opportunities as discussed herein.
The Connect phase: Performs Cartesian product of all the input entities. If an entity is an optional entity (defined shortly), a special tuple with all columns of value <missing> is first added as an extra member of the entity before the Cartesian product is carried out.
The Filter phase: In the filter phase, all the tuples coming out of the Connect phase are categorized into three groups:
Intuitively, Group S1 includes rows that scored straight success. And group S2 passed the join condition because of ignorable missing values from optional entity.
The Output phase: output final result, which is a set of tuples, using the following rule:
A tuple is considered having no material contribution to the final result if it matches one of the tuples in the result set. In checking if two tuples match, we assume <missing> value matches any other value. For example the following two tuples match.
Intuitively, the final output phase performs deduplication on tuples in Groups S1 and S2.
Using the example data in
Here, we use pet_100 to represent the row in PET table with id=100. Notice that the value <missing> is treated as a “valid” pet because PET is an optional entity.
In the second phase, the multi-way join condition
The final phase is trivial for this example since there is no need to do any dedup.
The reason for the implementation #7 in Table 1 being able to return the correct result is it does Cartesian product of all tables before starting evaluating the join condition. It is the only implementation that is consistent with the defined operational semantic model defined. By making sure Cartesian product operation is well completed before rows are filtered in the data flow, the simultaneity property inherent in multi-way relationship is protected against the potentially destructive binary joins.
In some embodiments, users may be able to visually create models by drawing lines between entities that indicate the need for the Connect phase operation. For example, suppose a user only drew a connection between PET and PET_OWNER, but he entered a relationship condition as follows:
Upon seeing the above 3-way relationship condition, a connection may be automatically determined and created between PET and PET_TYPE. This is because the Connect phase operation requires the Cartesian product of all entities involved. The join condition on this derived join is 1=1 only for achieving Cartesian product.
Supposing also that the user drew an additional line, from PET_OWNER to PET_TYPE, forming a circle among the three entities. In one aspect, the new line that created the circle can be ignored, since all the entities in the relationship have been sufficiently connected. To a human user, he may think a line means “a binary relationship,” but by staying truthful to the operational semantic model
After the entities are connected, the diagram can then be transformed into a tree of binary joins where the join node for PET and PET_TYPE carries the 1=1 condition. And the multi-way join condition is delayed to the last join node. In the whole process, the join condition is not dissembled, but rather, it is maximally delayed to ensure all the rows from all tables have opportunities to interact.
In contrast, if the join condition is split into two parts (which is syntactically permitted), and assigned the two subconditions to two join nodes, the operational semantic model would be violated because the Filter phase would be started before the Connect phase has completed.
Accordingly, since the operational semantic model is specified in a detailed, step-by-step manner, it can be easily transformed into a programmatic implementation using any existing programming languages. There is no need to only use SQL to implement it.
Conclusion
Bus subsystem 3020 provides a mechanism for letting the various components and subsystems of computer system 3000 communicate with each other as intended. Although bus subsystem 3020 is shown schematically as a single bus, alternative embodiments of the bus subsystem may utilize multiple busses.
Storage subsystem 3030 may be configured to store the basic programming and data constructs that provide the functionality of the present invention. Software (code modules or instructions) that provides the functionality of the present invention may be stored in storage subsystem 3030. These software modules or instructions may be executed by processor(s) 3010. Storage subsystem 3030 may also provide a repository for storing data used in accordance with the present invention. Storage subsystem 3030 may comprise memory subsystem 3040 and file/disk storage subsystem 3050.
Memory subsystem 3040 may include a number of memories including a main random access memory (RAM) 3042 for storage of instructions and data during program execution and a read only memory (ROM) 3044 in which fixed instructions are stored. File storage subsystem 3050 provides persistent (non-volatile) storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a Compact Disk Read Only Memory (CD-ROM) drive, a DVD, an optical drive, removable media cartridges, and other like storage media.
Input devices 3060 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a barcode scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and mechanisms for inputting information to computer system 3000.
Output devices 3070 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices, etc. The display subsystem may be a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), or a projection device. In general, use of the term “output device” is intended to include all possible types of devices and mechanisms for outputting information from computer system 3000.
Network interface subsystem 3080 provides an interface to other computer systems, devices, and networks, such as communications network 3090. Network interface subsystem 3080 serves as an interface for receiving data from and transmitting data to other systems from computer system 3000. Some examples of communications network 3090 are private networks, public networks, leased lines, the Internet, Ethernet networks, token ring networks, fiber optic networks, and the like.
Computer system 3000 can be of various types including a personal computer, a portable computer, a workstation, a network computer, a mainframe, a kiosk, or any other data processing system. Due to the ever-changing nature of computers and networks, the description of computer system 3000 depicted in
Although specific embodiments of the invention have been described, various modifications, alterations, alternative constructions, and equivalents are also encompassed within the scope of the invention. The described invention is not restricted to operation within certain specific data processing environments, but is free to operate within a plurality of data processing environments. Additionally, although the present invention has been described using a particular series of transactions and steps, it should be apparent to those skilled in the art that the scope of the present invention is not limited to the described series of transactions and steps.
Further, while the present invention has been described using a particular combination of hardware and software, it should be recognized that other combinations of hardware and software are also within the scope of the present invention. The present invention may be implemented only in hardware, or only in software, or using combinations thereof.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that additions, subtractions, deletions, and other modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.
Various embodiments of any of one or more inventions whose teachings may be presented within this disclosure can be implemented in the form of logic in software, firmware, hardware, or a combination thereof. The logic may be stored in or on a machine-accessible memory, a machine-readable article, a tangible computer-readable medium, a computer-readable storage medium, or other computer/machine-readable media as a set of instructions adapted to direct a central processing unit (CPU or processor) of a logic machine to perform a set of steps that may be disclosed in various embodiments of an invention presented within this disclosure. The logic may form part of a software program or computer program product as code modules become operational with a processor of a computer system or an information-processing device when executed to perform a method or process in various embodiments of an invention presented within this disclosure. Based on this disclosure and the teachings provided herein, a person of ordinary skill in the art will appreciate other ways, variations, modifications, alternatives, and/or methods for implementing in software, firmware, hardware, or combinations thereof any of the disclosed operations or functionalities of various embodiments of one or more of the presented inventions.
The disclosed examples, implementations, and various embodiments of any one of those inventions whose teachings may be presented within this disclosure are merely illustrative to convey with reasonable clarity to those skilled in the art the teachings of this disclosure. As these implementations and embodiments may be described with reference to exemplary illustrations or specific figures, various modifications or adaptations of the methods and/or specific structures described can become apparent to those skilled in the art. All such modifications, adaptations, or variations that rely upon this disclosure and these teachings found herein, and through which the teachings have advanced the art, are to be considered within the scope of the one or more inventions whose teachings may be presented within this disclosure. Hence, the present descriptions and drawings should not be considered in a limiting sense, as it is understood that an invention presented within a disclosure is in no way limited to those embodiments specifically illustrated.
Accordingly, the above description and any accompanying drawings, illustrations, and figures are intended to be illustrative but not restrictive. The scope of any invention presented within this disclosure should, therefore, be determined not with simple reference to the above description and those embodiments shown in the figures, but instead should be determined with reference to the pending claims along with their full scope or equivalents.
Allan, David, Lau, Kwok-hung (Thomas), Gong, Yu (Jeff)
Patent | Priority | Assignee | Title |
10620923, | Aug 22 2016 | Oracle International Corporation | System and method for dynamic, incremental recommendations within real-time visual simulation |
10620924, | Aug 22 2016 | Oracle International Corporation | System and method for ontology induction through statistical profiling and reference schema matching |
10705812, | Aug 22 2016 | Oracle International Corporation | System and method for inferencing of data transformations through pattern decomposition |
10740346, | Sep 18 2017 | Agile Handover and Automation Solutions, LLC | System and method for automating information handover from facility project to operations/maintenance |
10776086, | Aug 22 2016 | Oracle International Corporation | System and method for metadata-driven external interface generation of application programming interfaces |
10783161, | Dec 15 2017 | International Business Machines Corporation | Generating a recommended shaping function to integrate data within a data repository |
11137987, | Aug 22 2016 | Oracle International Corporation | System and method for automated mapping of data types for use with dataflow environments |
11347482, | Aug 22 2016 | Oracle International Corporation | System and method for dynamic lineage tracking, reconstruction, and lifecycle management |
11526338, | Aug 22 2016 | Oracle International Corporation | System and method for inferencing of data transformations through pattern decomposition |
11537369, | Aug 22 2016 | Oracle International Corporation | System and method for dynamic, incremental recommendations within real-time visual simulation |
11537370, | Aug 22 2016 | Oracle International Corporation | System and method for ontology induction through statistical profiling and reference schema matching |
11537371, | Aug 22 2016 | Oracle International Corporation | System and method for metadata-driven external interface generation of application programming interfaces |
11734238, | May 07 2021 | Bank of America Corporation | Correcting data errors for data processing fault recovery |
11789967, | May 07 2021 | Bank of America Corporation | Recovering from data processing errors by data error detection and correction |
RE49254, | Jul 10 2014 | U S BANK TRUST COMPANY, NATIONAL ASSOCIATION | System and method for master data management |
Patent | Priority | Assignee | Title |
5193182, | Apr 27 1990 | Computer Associates Think, Inc | Computer system for defining logical operations on design data including retrieve entity-set, send, receive, signal, when, reference to entity-set, reference to entity method, connect and disconnect |
5659723, | Dec 20 1991 | International Business Machines Corporation | Entity/relationship to object oriented logical model conversion method |
5937410, | Oct 16 1997 | Johnson Controls Technology Company | Method of transforming graphical object diagrams to product data manager schema |
7343585, | Jan 30 2002 | Oracle International Corporation | Operator approach for generic dataflow designs |
7350191, | Apr 22 2003 | Noetix Corporation | Computer implemented system and method for the generation of data access applications |
7849438, | May 27 2004 | Sprint Communications Company L.P. | Enterprise software development process for outsourced developers |
9020945, | Jan 25 2013 | HUMANA INC | User categorization system and method |
20050050068, | |||
20050228808, | |||
20050278152, | |||
20060064178, | |||
20060265611, | |||
20070067711, | |||
20070130180, | |||
20070203923, | |||
20070214171, | |||
20080133455, | |||
20080243767, | |||
20080270350, | |||
20080281849, | |||
20100005366, | |||
20100174583, | |||
20110047525, | |||
20110055147, | |||
20110295792, | |||
20120054147, | |||
20120096426, | |||
20130103705, | |||
20130111375, | |||
20130132854, | |||
20130151491, | |||
20130191306, | |||
20130227573, | |||
20140022257, | |||
20140344817, | |||
CN100373297, | |||
CN102349081, | |||
CN1763696, | |||
EP2997513, | |||
IN7688CHENP2015, | |||
JP2005327232, | |||
JP2006031687, | |||
JP2011034246, | |||
JP2012027690, | |||
WO2007083371, | |||
WO2014186058, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 02 2013 | Oracle International Corporation | (assignment on the face of the patent) | / | |||
Oct 08 2013 | ALLAN, DAVID | Oracle International Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 031373 | /0101 | |
Oct 08 2013 | LAU, KWOK-HUNG THOMAS | Oracle International Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 031373 | /0101 | |
Oct 08 2013 | GONG, YU JEFF | Oracle International Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 031373 | /0101 |
Date | Maintenance Fee Events |
Aug 10 2022 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Feb 26 2022 | 4 years fee payment window open |
Aug 26 2022 | 6 months grace period start (w surcharge) |
Feb 26 2023 | patent expiry (for year 4) |
Feb 26 2025 | 2 years to revive unintentionally abandoned end. (for year 4) |
Feb 26 2026 | 8 years fee payment window open |
Aug 26 2026 | 6 months grace period start (w surcharge) |
Feb 26 2027 | patent expiry (for year 8) |
Feb 26 2029 | 2 years to revive unintentionally abandoned end. (for year 8) |
Feb 26 2030 | 12 years fee payment window open |
Aug 26 2030 | 6 months grace period start (w surcharge) |
Feb 26 2031 | patent expiry (for year 12) |
Feb 26 2033 | 2 years to revive unintentionally abandoned end. (for year 12) |