Techniques are provided for automatic placement of cache operations in a dataflow. An exemplary method obtains a graph representation of a dataflow of operations; determines a number of executions and a computational cost of the operations, and a computational cost of a caching operation to cache a dataset generated by an operation; establishes a dataflow state structure recording values for properties of the dataflow operations for a number of variations of caching various dataflow operations; determines a cache gain factor for dataflow operations as an estimated reduction in the accumulated cost of the dataflow by caching an output dataset of a given operation; determines changes in the dataflow state structure by caching an output dataset of a different operation in the dataflow; and searches the dataflow state structures to determine the output datasets to cache based on a total dataflow execution cost.
|
1. A method, comprising the steps of:
obtaining a representation of a dataflow comprised of a plurality of operations as a directed graph, wherein vertices in said directed graph represent operations and edges in said directed graph represent data dependencies between said operations;
determining, using at least one processing device, a number of executions of said operations, a computational cost of said operations and a computational cost of a caching operation to cache a given dataset generated by at least one of said operations based on a size of said given dataset and a cost of said caching operation, wherein said computational cost of a given operation comprises an individual cost of executing the given operation itself and an accumulated cost of additional operations required to generate an input dataset for the given operation, wherein said given operation is represented in a data structure comprising said individual cost of executing the given operation itself, said accumulated cost of additional operations required to generate an input dataset for the given operation and said computational cost of said caching operation to cache said given dataset generated by said given operation;
establishing, using said at least one processing device, a dataflow state structure for each of a number of variations of caching one or more operations of the dataflow, wherein a given dataflow state structure records values for a plurality of properties of the operations in the dataflow, given zero or more existing cached operations in a given dataflow state, wherein said given dataflow state structure comprises a list of the accumulated costs of the operations in the given dataflow state, a list of the number of executions of said operations in the given dataflow state, and a list of a cache cost of the operations in the given dataflow state;
determining, using said at least one processing device, for each of said dataflow states, a cache gain factor for each operation in the dataflow as an estimated reduction in the accumulated cost of the dataflow by caching an output dataset of the given operation;
determining, using said at least one processing device, a change in said dataflow state structure by caching an output dataset of a different operation in said dataflow, wherein said change propagates changes in said list of the accumulated costs of the operations in the changed dataflow state structure and said list of the number of executions of said operations in the changed dataflow state structure, and applies the cache cost of the different operation to said list of said cache cost of the operations in the changed dataflow state structure; and
dynamically searching, using said at least one processing device, a plurality of said dataflow state structures to automatically determine a combination of said output datasets of a subset of said operations in said dataflow to cache based on a total execution cost for the dataflow.
13. A system, comprising:
a memory; and
at least one processing device, coupled to the memory, operative to implement the following steps:
obtaining a representation of a dataflow comprised of a plurality of operations as a directed graph, wherein vertices in said directed graph represent operations and edges in said directed graph represent data dependencies between said operations;
determining, using said at least one processing device, a number of executions of said operations, a computational cost of said operations and a computational cost of a caching operation to cache a given dataset generated by at least one of said operations based on a size of said given dataset and a cost of said caching operation, wherein said computational cost of a given operation comprises an individual cost of executing the given operation itself and an accumulated cost of additional operations required to generate an input dataset for the given operation, wherein said given operation is represented in a data structure comprising said individual cost of executing the given operation itself, said accumulated cost of additional operations required to generate an input dataset for the given operation and said computational cost of said caching operation to cache said given dataset generated by said given operation;
establishing, using said at least one processing device, a dataflow state structure for each of a number of variations of caching one or more operations of the dataflow, wherein a given dataflow state structure records values for a plurality of properties of the operations in the dataflow, given zero or more existing cached operations in a given dataflow state, wherein said given dataflow state structure comprises a list of the accumulated costs of the operations in the given dataflow state, a list of the number of executions of said operations in the given dataflow state, and a list of a cache cost of the operations in the given dataflow state;
determining, using said at least one processing device, for each of said dataflow states, a cache gain factor for each operation in the dataflow as an estimated reduction in the accumulated cost of the dataflow by caching an output dataset of the given operation;
determining, using said at least one processing device, a change in said dataflow state structure by caching an output dataset of a different operation in said dataflow, wherein said change propagates changes in said list of the accumulated costs of the operations in the changed dataflow state structure and said list of the number of executions of said operations in the changed dataflow state structure, and applies the cache cost of the different operation to said list of said cache cost of the operations in the changed dataflow state structure; and
dynamically searching, using said at least one processing device, a plurality of said dataflow state structures to automatically determine a combination of said output datasets of a subset of said operations in said dataflow to cache based on a total execution cost for the dataflow.
9. A computer program product, comprising a non-transitory machine-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by at least one processing device perform the following steps:
obtaining a representation of a dataflow comprised of a plurality of operations as a directed graph, wherein vertices in said directed graph represent operations and edges in said directed graph represent data dependencies between said operations;
determining, using at least one processing device, a number of executions of said operations, a computational cost of said operations and a computational cost of a caching operation to cache a given dataset generated by at least one of said operations based on a size of said given dataset and a cost of said caching operation, wherein said computational cost of a given operation comprises an individual cost of executing the given operation itself and an accumulated cost of additional operations required to generate an input dataset for the given operation, wherein said given operation is represented in a data structure comprising said individual cost of executing the given operation itself, said accumulated cost of additional operations required to generate an input dataset for the given operation and said computational cost of said caching operation to cache said given dataset generated by said given operation;
establishing, using said at least one processing device, a dataflow state structure for each of a number of variations of caching one or more operations of the dataflow, wherein a given dataflow state structure records values for a plurality of properties of the operations in the dataflow, given zero or more existing cached operations in a given dataflow state, wherein said given dataflow state structure comprises a list of the accumulated costs of the operations in the given dataflow state, a list of the number of executions of said operations in the given dataflow state, and a list of a cache cost of the operations in the given dataflow state;
determining, using said at least one processing device, for each of said dataflow states, a cache gain factor for each operation in the dataflow as an estimated reduction in the accumulated cost of the dataflow by caching an output dataset of the given operation;
determining, using said at least one processing device, a change in said dataflow state structure by caching an output dataset of a different operation in said dataflow, wherein said change propagates changes in said list of the accumulated costs of the operations in the changed dataflow state structure and said list of the number of executions of said operations in the changed dataflow state structure, and applies the cache cost of the different operation to said list of said cache cost of the operations in the changed dataflow state structure; and
dynamically searching, using said at least one processing device, a plurality of said dataflow state structures to automatically determine a combination of said output datasets of a subset of said operations in said dataflow to cache based on a total execution cost for the dataflow.
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
7. The method of
8. The method of
10. The computer program product of
11. The computer program product of
12. The computer program product of
14. The system of
15. The system of
16. The system of
17. The system of
18. The system of
19. The system of
20. The system of
|
The field relates generally to information processing systems, and more particularly to the placement of cache operations in such information processing systems.
In some dataflows, a given action can have multiple executions during the dataflow, with various dependent transformations. To improve the performance of such dataflows, some dataflow engines provide mechanisms to persist the output of a transformation using a caching operation, thereby avoiding the re-execution of precedent operations. The caching operation indicates that the dataset produced by an operation should be kept in memory for future reuse, without the need for re-computation.
The use of a caching operation (potentially) avoids the increased cost incurred by multiple actions in a dataflow. In complex dataflows, however, comprised of tens to hundreds of operations and control flows, deciding which datasets to cache is not trivial. Thus, the decision to cache a dataset requires considerable effort from the users to estimate a number of metrics.
A need therefore exists for improved techniques for automatic placement of cache operations for such dataflows.
Illustrative embodiments of the present disclosure provide methods and apparatus for automatic placement of cache operations in dataflows. In one embodiment, a method comprises obtaining a representation of a dataflow comprised of a plurality of operations as a directed graph, wherein vertices in the directed graph represent operations and edges in the directed graph represent data dependencies between the operations; determining, using at least one processing device, a number of executions of the operations, a computational cost of the operations and a computational cost of a caching operation to cache a given dataset generated by at least one of the operations based on a size of the given dataset and a cost of the caching operation, wherein the computational cost of a given operation comprises an individual cost of executing the given operation itself and an accumulated cost of additional operations required to generate an input dataset for the given operation, wherein the given operation is represented in a data structure comprising the individual cost of executing the given operation itself, the accumulated cost of additional operations required to generate an input dataset for the given operation and the computational cost of the caching operation to cache the given dataset generated by the given operation; establishing, using the at least one processing device, a dataflow state structure for each of a number of variations of caching one or more operations of the dataflow, wherein a given dataflow state structure records values for a plurality of properties of the operations in the dataflow, given zero or more existing cached operations in a given dataflow state, wherein the given dataflow state structure comprises a list of the accumulated costs of the operations in the given dataflow state, a list of the number of executions of the operations in the given dataflow state, and a list of a cache cost of the operations in the given dataflow state; determining, using the at least one processing device, for each of the dataflow states, a cache gain factor for each operation in the dataflow as an estimated reduction in the accumulated cost of the dataflow by caching an output dataset of the given operation; determining, using the at least one processing device, a change in the dataflow state structure by caching an output dataset of a different operation in the dataflow, wherein the change propagates changes in the list of the accumulated costs of the operations in the changed dataflow state structure and the list of the number of executions of the operations in the changed dataflow state structure, and applies the cache cost of the different operation to the list of the cache cost of the operations in the changed dataflow state structure; and dynamically searching, using the at least one processing device, a plurality of the dataflow state structures to automatically determine a combination of the output datasets of a subset of the operations in the dataflow to cache based on a total execution cost for the dataflow.
These and other illustrative embodiments described herein include, without limitation, apparatus, systems, methods and computer program products comprising processor-readable storage media.
Illustrative embodiments of the present disclosure will be described herein with reference to exemplary information processing systems and associated processing devices. It is to be appreciated, however, that embodiments of the disclosure are not restricted for use with the particular illustrative system and device configurations shown. Accordingly, the term “information processing system” as used herein is intended to be broadly construed, so as to encompass, for example, processing systems comprising cloud computing and storage systems, as well as other types of processing systems comprising various combinations of physical and virtual processing resources. An information processing system may therefore comprise, for example, at least one data center that includes one or more clouds hosting multiple tenants that share cloud resources.
One or more embodiments of the disclosure provide a computational approach to automatically find efficient cache placement strategies in dataflows. A formal model is provided for the cache placement problem and methods are provided for (i) domain independent estimation of dataflow performance; (ii) evaluation of cache placement options; and (iii) automatic searching for multiple cache placements.
Introduction
Large-scale data processing frameworks, such as the Spark™ data processing engine and Flink™ stream processing framework, both from the Apache Software Foundation, are currently widely adopted in the industry and academia. These frameworks employ a programming model in which the user defines a dataflow of operations that specify transformations on the input data.
These operations are lazy (or late) evaluated, as they define a logical plan that is only enacted when an action operation is executed, e.g., an operation that requires returning the results of the defined computation to the coordinating process (referred to as the driver program, in the Spark™ data processing engine).
A side effect of the lazy execution model is that dataflows with more than one action occur in multiple executions of their dependent transformations. This increases the cost of executing the dataflow substantially.
To improve the performance of dataflows running under a lazy execution model, some dataflow engines provide mechanisms to persist the output of a transformation, avoiding the re-execution of precedent operations. This is called a caching operation. In this scenario, a user informs the system about the persistence strategy by altering the persistence mode of a dataset produced by an operation.
The caching operation does not modify the lazy execution model, and merely indicates that the dataset produced by an operation should be kept in memory for future reuse, without the need for re-computation. Therefore, the use of caching operation (potentially) avoids the increased cost incurred by multiple actions in a dataflow.
However, in complex dataflows, comprised of tens to hundreds of operations and control flows, deciding which datasets to cache is not trivial. The user must consider, among others, the following metrics:
(i) the number of executions of each operation;
(ii) the computational cost of operations; and
(iii) the computational cost incurred to cache a dataset.
Thus, the decision to cache a dataset requires considerable effort from the users to estimate those metrics. Another problem is that the data size and its features (e.g., data types and cardinalities) impact on the costs of operations, requiring the user to estimate the datasets characteristics at design time. Related to this issue, when deciding which datasets are meant to be cached, the user also needs to have in mind the memory limitations to avoid unnecessary disk access. Finally, aiming at improving the dataflow performance, it is possible to cache multiple datasets. Therefore, the decision involving multiple datasets to be cached becomes a combinatorial problem.
One or more embodiments of the disclosure provide for the automatic placement of cache operations in complex dataflows. The term cache placement refers to the decision of which operation results to cache. To achieve this goal, a formal model is defined in a section entitled “A Formal Model for the Cache Placement Problem” for the representation of dataflows. Mathematical models are defined in a section entitled “Estimation Functions,” for the estimation of properties of the dataflow. A section entitled “Cache Gain Factor Computation” describes a metric for the potential benefits of caching operations. An algorithm is provided in a section entitled “Automatic Search for Multiple Cache Placement,” for the search of the substantially best strategy for cache placement.
In large-scale data analysis, keeping data in memory increases the processing speed significantly. In contrast to the on-disk approach, in-memory processing frameworks eliminate disk operations by maintaining intermediate data in memory. This execution model is adopted for a variety of reasons, and may lead to better computation resource usage while avoiding disk I/O (input/output) overhead. See, for example, Y. Wu et al., “HAMR: A Dataflow-Based Real-Time In-Memory Cluster Computing Engine,” Int. J. High Perform. Comput. Appl. (2016), incorporated by reference herein in its entirety.
Current in-memory frameworks provide a programming model in which the user creates a dataflow defining a set of operations over the input data. Most frameworks classify operations in two categories: transformations and actions. Transformations are lazy operations that produce new datasets, while actions launch a computation that returns a value to the coordinating program. The lazy evaluation of transformations implies that their execution is delayed until demanded by an action.
To improve the performance of a dataflow under the lazy execution model, a user can set the persistence mode of a dataset produced by a transformation through the cache operation. This operation leads to the in-memory storage of the dataset produced by the associated transformation. By doing so, the associated transformation (and all the ones preceding it in an execution path) are executed only once, instead of once for each action following that transformation.
Consider an example that exemplifies the cache placement problem. The example assumes a workflow defined using the Apache Spark™ framework. The example may, however, be mapped to other in-memory dataflow frameworks by a person of ordinary skill in the art.
Operation
Description
Transformations
Project
Returns a dataset with
selected columns
Select
Projects a set of expressions
and returns a new dataset
Filter
Returns a dataset with rows
filtered by predicate
Sort(asc)
Returns a dataset with
rows in ascending
order by key
Sort(desc)
Returns a dataset with
rows in descending
order by key
Actions
Union All
Returns a merged dataset
from two or more
other datasets
Count
Returns the number of rows
of a given dataset
Intuitively, the dataset (or datasets) that are most reused by succeeding transformations and actions are the best candidates to be cached. Considering this criterion, in the dataflow of
To analyze whether such an intuition is correct or not, consider an experiment using as input a dataset containing customer ratings for products in a retail setting. The dataset consists of nearly three million rows with a total size of 1.4 GB. The dataflow was executed in a machine with 96 GB RAM (random access memory) memory and a 32-core processor. The selected dataflow execution framework was Apache Spark™ version 1.6.2, in standard configuration. The performance improvement of using caching operations is calculated by computing the difference between the elapsed-time needed to run the dataflow without any cached dataset and the elapsed-time to run the dataflow with different combinations of cached datasets. The candidate datasets for caching are the ones generated by transformations T1, T2, T3 and T5, since these datasets are requested by more than one action in the dataflow. Note that in a lazy execution model, a dataset participating in more than one path of transformation to an action will be requested multiple times. For example, dataset ds1 will be requested three times by action A3, in the running example, as well as once by action A1 and once by action A2.
The total time to run the dataflow without cache was 113.09 seconds. The following table shows the performance results when caching different datasets, where dsi denotes the dataset ds generated by the transformation Ti.
Spark local configuration: driver memory 48 GB;
executor instances 32; executor memory 10 GB;
maxResultSize 20 GB
Time
Execution
Reduction
Cache placement
time (s)
(%)
cache ds1
46.60
58.8
cache ds2
46.33
59.0
cache ds3
46.69
58.7
cache ds5
70.98
37.2
cache ds1 and ds2
48.54
57.1
cache ds1 and ds3
47.91
57.6
cache ds1 and ds5
35.75
68.4
cache ds2 and ds3
47.89
57.7
cache ds2 and ds5
36.19
68.0
cache ds3 and ds5
35.39
68.7
The experiment indicates that, caching dataset ds5 alongside either dataset ds1, ds2 and ds3 are generally good options. The substantially best results are obtained by the caching of datasets ds3 and ds5. It is important to point out that caching only dataset ds5 (a promising option according to the aforementioned intuitive notions) does not provide considerable time reduction. The example illustrates that, even in a simple dataflow, finding the best cache placement is a non-trivial task. Considering complex dataflows, comprised of tens to hundreds of operations and control flows, this task is even more challenging.
Cache Placement for Complex Dataflows
In the current programming model of in-memory dataflow frameworks, the decision to cache datasets produced by transformations is a manual activity performed by the users. However, as described above in the running example, this decision is hard, even more so when the dataflow involves various transformations and actions, and/or includes loops or iterations, on which the operations may access only a fragment of the data, or when each iteration may require data generated by the previous one. In summary, the cache placement must consider:
(i) the number of enactments of each transformation: The estimation of the number of invocations of a transformation involves considering the structure of the dataflow graph, which requires a considerable amount of effort and expertise from users in complex cases.
(ii) the computational cost of transformations: Each transformation has a specific algorithmic cost to produce a dataset and also an accumulated cost considering the execution path required to generate the input datasets. When analyzing the substantially best cache placement, both computational costs (individual and accumulated) must be considered. To infer such costs at design time is not trivial.
(iii) the computational cost to cache the dataset: The cache operation incurs a computational cost, which is related to the size of the dataset to be cached. The user needs to consider whether caching a dataset is costlier than the transformation that produces it.
Therefore, the decision to cache a dataset requires considerable effort to estimate, at design time, the number of operation executions, the computational cost of transformations (individual and accumulated), and the computational cost to cache the produced datasets.
Data Size and Data Features Impact on Dataflow Performance
To perform analysis on large datasets, a user defines a dataflow that processes input data along a set of operations. The computational cost of each operation is significantly affected by the size of the data, its structure, cardinality and the way it is partitioned on the distributed file system. Consequently, the dataflow execution performance depends on the size and features of both input and output datasets that are manipulated by the intermediate transformations during the dataflow execution.
In complex dataflows, comprised of many transformations and control flows, each one having several input datasets with different sizes and features, a manual analysis at design time of the dataflow performance is not feasible. In this scenario, efficient strategies for estimating the dataflow performance are a relevant problem that needs to be addressed.
Multiple Cache Placements
In a complex dataflow, multiple datasets are candidates for caching. The optimal cache placement strategy may involve the caching of multiple datasets, configuring a combinatorial problem.
Considering dataflows comprised of a large number of operations, the decision of which datasets to cache becomes unfeasible to perform manually. Given that caching heavily influences the execution cost of a dataflow, there is an evident need for an automatic approach to define optimal or near-optimal cache placement.
Memory Constraint
In-memory computing frameworks keep intermediate processed data in memory, by default. However, if there is not enough RAM (random access memory), frameworks spill the data to disk, which reduces the dataflow performance drastically. Therefore, limited RAM memory is a constraint that must be considered when analyzing cache placement, since a cached dataset occupies memory space. For multiple cache placements, when adding a new cache, the user needs to deal with previously cached datasets, aiming at not including unnecessary disk I/O (input/output) overhead.
Another issue related to this constraint is that the majority of in-memory computing frameworks are written in a garbage-collected language. See, for example, M. Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” 9th USENIX Conf. on Networked Systems Design and Implementation, 2-22 (2012), incorporated by reference herein in its entirety. Garbage collection achieves reasonable performance by collecting garbage in large batches; consequently, the memory usage for the temporary storage of garbage must also be taken into account. See, for example, C. Reiss, “Understanding Memory Configurations for In-Memory Analytics,” (University of California, Berkeley, EECS Department; 2016), incorporated by reference herein in its entirety.
A formal model is initially defined for the representation of the dataflow, which lends itself to the description of the disclosed methods for the estimation of the number of executions and costs of operations and the automatic search for the best cache placement strategy.
A Formal Model for the Cache Placement Problem
A dataflow is represented as a graph where the vertices are the operations and the edges represent the data dependencies between them. A dataflow graph G is defined as a tuple G=(O, E), where O={o1, o2, . . . , on} is a collection of operations, and E={e1, e2, . . . , em} is the set of directed edges. Operations are either of type transformation or of the type action. Actions are the operations that kick-start the actual execution in the lazy evaluation model, and are defined to not have any subsequent operations. Referring back to
An operation o∈O receives a set Do of datasets as input, executes an algorithm over it, and produces a dataset do.
Each edge e:(p→f)∈E denotes a dependency between two operations p, f∈O. Operation f is dependent on an operation p when the dataset dp produced by p is an input to f. Thus, f can only be executed after the execution of p.
Each operation o∈O has a set of properties po:
po={type,num_executions,cost_individual,cost_total,cost_cache,cached}.
When a dataflow is defined, the type of the dataflow is the only known property of each operation. A set of all actions is defined as A={a∈O|a.type==action} and a set of all transformations is defined as T={t∈O|t.type==transformation}, with A∩T=Ø.
The property num_executions holds the number of required executions of an operation. This property is directly related to the number of possible paths from the operation until the leaves of the graph (the number of possible operation paths, as described in a section entitled “Number of Executions.”).
The cost properties cost_individual, cost_total and cost_cache represent different computation costs associated to the execution of the operation:
The cached property holds the information necessary for the cache placement strategy to be enacted. The cached property can be either true or false, defining whether the dataset produced by the operation is meant to be cached. For ease of explanation, “caching a dataset produced by an operation” is referred to interchangeably as “caching an operation” in the remainder of this document.
For a given operation o∈O, the following is additionally defined:
Po={p∈T|∃e:(p→o)∈E},
as the set of transformations (no actions) directly preceding o, required and sufficient for the generation of all d∈Do. Similarly, the following is also defined:
Fo={f∈O|∃e:(o→f)∈E},
as the set of operations (including actions) that directly follow o, i.e., operations that require do as input. For example, in the graph 100 of
Finally, the problem of cache placement is defined for a complex dataflow as how to automatically assign a value for the cached property of each operation typed as a transformation in the dataflow, taking into account (i) the impact of data size and its features on the dataflow performance, (ii) the possibility of multiple cache placements, and (iii) memory constraints. This assignment should substantially minimize the execution cost for the dataflow as a whole. An algorithm that addresses the automatic assignment of cache operations (cache placement) is described in the section entitled “Automatic Search for Multiple Cache Placement.” The algorithm relies on the other properties of the operations as input, which are computed through the estimation functions described in the following section.
Estimation Functions
In this section, the functions for the estimation of the number of executions and costs of operations are described. This corresponds to the assignment of values for the num_executions, cost_individual, cost_total and cost_cache properties for each operation in the dataflow. Based on these values, the total dataflow cost metric is defined.
A. Number of Executions
The number of executions of a given operation is determined by the number of execution paths in which the operation appears. Therefore, the number of executions of a transformation oi∈T is defined as the number of paths from the transformation to each action operation. Assuming a function NumPaths(oi, oj) which gives the number of paths between oi and oj:
oi.num_executions=Σo
A path from an operation to another operation is a sequence of operations that establishes a transitive dependency relationship between these two operations. All actions are executed just once, and the number of paths from an action to itself is defined as 1.
B. Execution Cost
As mentioned above, the computational cost of each operation oi is affected by the size, structure, and cardinality of its input datasets Do
Assuming that such models provide a function OperationCost(oi, D) which estimates the computational cost of operation oi given the initial input datasets of the dataflow D, the property cost_individual of that operation is determined as follows:
oi.cost_individual=OperationCost(oi,D).
The computational cost is an abstraction of the execution time of an operation in a given context. In this disclosure, brief experiments were performed for provenance collection of the cost of individual operations.
Thus,
The total cost of an operation oi∈O is the accumulated cost of all previous operations required to generate the input datasets D. Therefore, the total cost of an operation oi is given by:
oi.cost_total=oi.cost_individual+Σo
where Po
C. Cache Cost
The cache operation has a computational cost proportional to the time necessary to perform the caching, which is dependent on the size of the dataset. Cached datasets occupy potentially large chunks of memory, and this occupied memory is no longer available for the computation of following operations. Thus, the cache cost, like the execution cost, represents an abstraction of both execution time and of the relation between the necessary and available resources (in this case, specifically, the required and available memory) for the execution of the dataflow.
In a similar manner as the computation cost of an operation, this size can be estimated by models built from provenance data before the operation is executed. Approaches and techniques which could provide such models are described in the patent applications referenced above.
Formally, caching the results of operation oi incurs a cost that is proportional to the size of do
oi.cost_cache=CachingCost(oi,D).
In the running example, it is assumed, from experiments for provenance collection with a 1.4 GB dataset, a cache cost of 1.57 for each transformation; except for transformation T6 whose cache costs 5.09, which results from the larger, and more cache expensive, dataset generated by transformation T6.
D. Dataflow Cost
Operations with type action are dataflow leaf nodes whose total cost represents the cost of all its execution paths. Therefore, the total dataflow cost is defined as follows:
The cost of a dataflow G assumes that the total cost of all actions a∈A are already estimated. In the running example, the total dataflow cost is 25.6+19.5+115.7=160.79, from actions A1, A2 and A3, respectively.
Cache Gain Factor Computation
The cache gain factor of a given transformation is defined as the potential decrease, in the total cost of the dataflow, from caching the output dataset of the given transformation. This concept is useful for the automatic computation of the substantially best combination of cache placements, described in the following sections, and also configures a naïve solution for a single cache placement: caching the operation with the highest immediate cache gain.
The following method is provided for the computation of the cache gain factor of a transformation t. The method relies on the estimation functions described above:
CacheGain(t)=((t.num_executions−1)*t.cost_total)−t.cost_cache).
The reasoning for this definition, in one or more embodiments, is as follows. The caching of a given transformation's output dataset dt will spare its re-execution t.num_executions−1 times. Thus, the total cost of the dataflow will potentially decrease by the total cost of the operation t.cost_total multiplied by the number of spared re-executions.
Finally, consider the computational cost of the caching itself, as well as the impact of keeping the resulting dataset in memory for future computations. This is done through the estimation function t.cost_cache. This value is subtracted from the potential gain from caching the operation so that transformations with highly costly caches (either computationally or due to very large outputs) are disfavored.
Overall, the exemplary algorithm identifies the best potential caches in costly transformations whose outputs are reused many times and whose memory footprint is relatively smaller. This motivates the use of the cache gain factor as a straightforward heuristic in the automatic search for the best combination of cache placements, described in the following section.
The following table shows the estimated cache gain factor for each transformation in the running example dataflow 100, ranked from greatest to lowest.
Recall that operations T4 and T6 are executed just once in the dataflow 100, and therefore are not considered as options for caching in the above table.
Automatic Search for Multiple Cache Placement
In this section, the method for the automatic search for the substantially best combination of multiple cache placements in a dataflow is described. This corresponds to the automatic assignment of a value for the cached property of each transformation in the dataflow.
Since the caching of any one operation effectively modifies the properties of the other operations that dictate the cost of the dataflow, this configures a combinatorial problem. To this end, a method for the automatic search for multiple cache placements is provided.
The disclosed method relies on the concepts described in previous sections. Initially, the data structures that support the method are defined, which were designed for one or more embodiments with the efficiency of the search process in mind. Algorithms are also disclosed that implement the methods for updating the properties of an operation given one or more cache placements, and for the heuristic search of the best cache placements.
A. Dataflow List
Recall the definition of the dataflow as a graph G=(O, E), with a set of operations O={o1, o2, . . . , on} and directed edges E. As mentioned above, the type of an operation is known a priori. Thus assume, for ease of explanation, an implicit ordering such that the first k operations are transformations, i.e., o1, . . . , ok∈T, and the remaining operations are actions, i.e., ok+1, . . . , on∈A, for 0≤k<n. For the purpose of the disclosed methods, the relevant information of the dataflow is stored in a list, as follows:
=[t1,t2, . . . ,tk,tk+1,tk+2, . . . ,tn].
where ti, 1≤i≤n is a tuple representing the corresponding operation oi∈O. Each such tuple is of the following format:
ti=(p,f,ci,ct,cc).
This tuple roughly corresponds to the set of properties of an operation, as described in the section entitled “A Formal Model for the Cache Placement Problem,” except for the type and cached properties. While obtaining a value for the cached property is a goal of the disclosed method, the type property does not need to be defined since it is implicit in the ordering in . The list representation of the dataflow 100 of
Fields ci, ct and cc correspond to the cost_individual, cost_total and cost_cache properties, respectively. These costs of the operations are obtained by the estimation functions described above.
Fields p and f are lists containing the indexes in of elements in Po
Take, in the example in
Formally, this means that if an edge e:(oi→oi+1) exists in E, the list entry for ti+1.p will contain the index (i) (i.e., the operation represented by the i-th element of is a preceding operation to oi+1). Conversely, the list entry ti.f will contain (i+1), meaning that the i+1-th element of is a tuple-representation of an operation that follows oi.
In the exemplary algorithms described below, the dataflow list structure is a static representation of the dataflow graph and the initial values of the properties of the operations. It is defined once and remains unchanged for the execution of the method.
B. Dataflow State For the dynamic handling of the cache placements, the state data structure is defined. A state holds the current values for the properties of the operations given zero or more cached transformations already established. The state where zero transformations are cached is called the initial state of the dataflow, in which the values of all operation properties are identical to their values in the dataflow list . The states that represent the dataflow with one or more cached operations, however, hold different values for properties of one or more operations.
The state structure supports both the update of operation properties and the search methods. A state is represented with a minimal structure, in one or more embodiments, since it may be necessary to hold many such states in memory for the search.
A state S is a tuple of the form:
S=(Lt,Le,Lc),
where Lt, Le and Lc are lists of the same length as and:
As previously mentioned, the initial state is a different representation of the initial values of the properties of the operations in . The following array indicates the initial state for the running example dataflow 100:
It is noted that in the initial state, no caches have been placed and therefore Lc is a list of zeroes. In the discussion herein, cached operations are identified by the fact that they have a non-zero applied caching cost in Lc.
With this at hand, the state cost is defined as the dataflow cost (the sum of the cost_total for the actions, as described above) plus the summation of all values in Lc. This represents that the total cost of the dataflow with a certain number of applied caches is the cost of the actions in that context plus the cost of performing the caching operations themselves.
C. Cache Placement and Propagation of Operation Properties
As shown in
First, (line 2) the new state's applied cache costs is updated to reflect the caching of the operation indexed by i.
Assume, in the running example, a call to the algorithm for the caching of operation T5, with i=5, from the initial state, s. Recall that the cache cost for transformation T5 is 1.57; and that the total cost of T5 in the initial state is 16.4. The applied cache cost for T5 in the resulting state s′ is therefore 1.57+16.4=17.97. After this step, state s′ is as follows, with the change in the applied cache cost highlighted with boldface text:
The exemplary GenCachedState process 500 then updates the number of executions and the total costs for other operations in that state (lines 3-6). This process starts by assigning to variable d_execs the number of executions of the transformation to be saved by the caching (line 3). In the running example of the caching of T5, d_execs=2.
The value state s′, generated by the exemplary GenCachedState process 500, is then passed (in line 4) as part of the arguments to an auxiliary algorithm, the PropagateDiscountExecs process 530, as discussed further below in conjunction with
Similarly, a difference of the total execution cost for the cached transformation is taken as variable d_cost (line 5) and passed (in line 6) to a similar auxiliary algorithm, the PropagateDiscountCost process 560, as discussed further below in conjunction with
As shown in
In the sequence, the algorithm 530 updates s′ with the result of a recursive call (line 4) for each operation preceding i (line 2) which is not cached (line 3). The resulting state from the chain of recursive calls in the running example is represented below:
The executions of operations T2 (which precedes T5) and T1 (which precedes T2) are updated.
This semantically represents that, for the purposes of calculating the number of executions of operations, T5 is now a leaf node. Recall the execution computations of T1 in
D. Search for Multiple Cache Placements
The exemplary LocalBeamCacheSearch process 700 exemplifies the concept of a beam search, in which the search space is pruned to a small number of promising candidate states. See, for example, P. Norvig, “Paradigms of Artificial Intelligence Programming: Case studies in Common LISP,” (Morgan Kaufmann Publishers, Inc.; 1992), incorporated by reference herein in its entirety.
The terminology adopted is as follows. The LocalBeamCacheSearch process 700 starts with an initial state s in the open states list, O. Each state in the open list, O, is expanded, generating a set of new candidate states, C, which are themselves later expanded in a search-like approach. The LocalBeamCacheSearch process 700 will return the substantially best state among all generated candidates; i.e., the state with the substantially lowest cost (as defined in the section entitled “Dataflow State.”
All expanded states are put into the closed states list, C, used to make sure no states are expanded twice. The search space is pruned by discarding some of the candidate states generated from every expanded state—only a limited number of the best new candidate states that are not yet closed are put into the open list; and also by limiting the maximum number of states in the open states list at any time.
The global substantially best state, best, is initially the initial state, s (line 1). Every expanded state is checked against the global substantially best (line 7), and becomes the global substantially best (line 8) if the expanded state results in a lower total cost for the dataflow. When no more states remain in the open list, the global substantially best is returned as the solution (line 13).
A more formal description follows. The exemplary LocalBeamCacheSearch process 700 works by iteratively generating new candidate states in a list of open states O, adding them to the closed states list C, and holding the best global state found in variable best. The LocalBeamCacheSearch process 700 receives as argument the dataflow , the initial state s and two parameters that prune the search space. The first parameter is beam, a maximum number of candidate states to be generated from any single state. The second parameter is limit, which defines the maximum number of states to be kept in the open states list.
The exemplary LocalBeamCacheSearch process 700 proceeds as follows. The best state is initialized as the initial state s (line 1). The open list O contains s initially (line 2) and the closed list is empty (line 3).
The loop (lines 4-12) configures the main part of the exemplary LocalBeamCacheSearch process 700. While there are still candidates to be considered in the open list o, the first entry is removed and stored in variable s′ (line 5). This is done through a call to an auxiliary algorithm first, which removes and returns the first element of a list structure.
This expanded state s′ is then appended to the closed list (line 6). This means that no equivalent state will be further expanded as the generation of new candidates disregards states that are already closed, as discussed further below in the description of a NewCandidates process 800 in conjunction with
Next, the expanded state s′ is compared to the best state. If the total dataflow execution cost in the current state is lower than the total dataflow cost in the best state, the best state so far is updated to reflect that the current state is the new best (lines 7 and 8). The total dataflow cost in a state is computed through an auxiliary algorithm StateCost, a straightforward implementation of the computation for the state cost described above.
A call to NewCandidates (line 9) returns the best beam possible states following s′. These candidate states are substantially guaranteed to not be closed states (i.e., previously expanded). Further details are given in the description of the NewCandidates process 800 in conjunction with
The new candidates are concatenated with the open states list (line 10), and the open states list is sorted by descending total dataflow execution cost (line 11). This is done through a call to auxiliary method SortByCost which reorders the list so that states representing best cache placements are placed upfront.
Finally, the loop ends with the remaining open states list pruned to a maximum of limit entries. When the loop terminates the best state with the lowest total dataflow cost is returned. The operations that are to be cached can be identified by the non-zero applied caching cost values in best. Lc.
As shown in
Next, the exemplary NewCandidates process 800 collects a list I of the indexes of the operations whose cache gains are positive (line 2). This list configures the possible caches that are estimated to result in a net decrease in the total dataflow cost.
The list of new candidates N is initialized as empty (line 3) and filled in the loop (lines 4-7). For each possible cache, a state s′ representing that caching is generated (line 5). This is done through a call to the GenCachedState process 500 of
Many service providers and large industries design their business processes as workflows. Many activities of those processes are completely automated as a set of scripts and computer programs, including domain-specific algorithms developed for years. For instance, in the Oil and Gas industry, seismic processing workflows are key in the exploration process to discover new reservoirs. The data-intensive nature of these workflows makes them natural candidates to run on modern dataflow engines such as the Spark™ or Flink™ dataflow engines referenced above.
However, these workflows are complex and incorporate much of the technical domain knowledge, which makes them hard to be manually optimized by the general user. The choice of parameters and the input data might influence the behavior of the execution in such a way that is hard for the user to predict how output data will be actually produced. For instance, many filters can be applied to seismic data to evince geological faults and the choice of filters will impact in the size of intermediate data and the final seismic cube.
In the present disclosure, a method is disclosed that can automatically define the optimal cache placement in such dataflows, as long as accurate cost models for the execution of the operations are provided. The method frees the user from the task of explicitly defining when and where these cache operations should take place in the dataflow, a very time consuming and costly task, in particular when the features of the datasets vary significantly.
Furthermore, this task is error-prone and not trivial even for experienced users. On the one hand, a poor decision of cache placement can actually hinder dataflow performance, incurring in higher costs of execution. On the other hand, an optimized dataflow, yielded by the method described in this disclosure, means significant savings in resource allocation and execution time.
One or more embodiments of the disclosure provide methods and apparatus for automatic placement of cache operations. In one or more embodiments, a formal model is employed to define a representation of dataflows for the cache placement problem. The disclosed model defines the representation of operation ordering dependencies and properties. Among other benefits, the disclosed model allows evaluation of cache placement alternatives, including the evaluation of the properties and parameters needed to evaluate the datasets that should be cached.
In some embodiments, an exemplary method estimates the impact on the dataflow performance when caching datasets. The impact is measured using the disclosed formal dataflow model and a set of disclosed estimation functions. The formal foundations substantially define the elements that should be considered when evaluating the cache placement. These estimation functions provide information on the number of executions, the costs of executing and the cost of caching the resulting dataset of an operation.
In this manner, one or more embodiments of the disclosure automatically define multiple cache placements that configure a substantially global optimization of the dataflow. The exemplary method takes into account the fact that caching one operation affects the cost of execution operations that precede it and follow it in the dataflow.
At least one embodiment substantially guarantees that only the costs of operations affected by a cache placement are recalculated, without the need for the recomputation of an entire dataflow graph. The disclosed exemplary model represents a dataflow cache placement state detached from the dataflow graph, which substantially reduces the size of the state in memory and allows the exemplary method to keep many such states as candidates at any given time. Both these optionally aspects support and enable the tractability of larger problems.
The dataflows often require an extensive use of memory as an intermediate space for storing data between dataflow operations. In particular, data-intensive dataflows that are I/O-bound need a substantially optimized use of memory in order to avoid swapping operations. In one or more embodiments of the disclosure, the automatic evaluation of cache placement allows the memory space set for cache placement to be substantially optimized.
The foregoing applications and associated embodiments should be considered as illustrative only, and numerous other embodiments can be configured using the techniques disclosed herein, in a wide variety of different applications.
It should also be understood that the disclosed techniques for automatic placement of cache operations, as described herein, can be implemented at least in part in the form of one or more software programs stored in memory and executed by a processor of a processing device such as a computer. As mentioned previously, a memory or other storage device having such program code embodied therein is an example of what is more generally referred to herein as a “computer program product.”
The disclosed techniques for automatic placement of cache operations may be implemented using one or more processing platforms. One or more of the processing modules or other components may therefore each run on a computer, storage device or other processing platform element. A given such element may be viewed as an example of what is more generally referred to herein as a “processing device.”
As noted above, illustrative embodiments disclosed herein can provide a number of significant advantages relative to conventional arrangements. It is to be appreciated that the particular advantages described above and elsewhere herein are associated with particular illustrative embodiments and need not be present in other embodiments. Also, the particular types of information processing system features and functionality as illustrated and described herein are exemplary only, and numerous other arrangements may be used in other embodiments.
In these and other embodiments, compute services can be offered to cloud infrastructure tenants or other system users as a PaaS offering, although numerous alternative arrangements are possible.
Some illustrative embodiments of a processing platform that may be used to implement at least a portion of an information processing system comprises cloud infrastructure including virtual machines implemented using a hypervisor that runs on physical infrastructure. The cloud infrastructure further comprises sets of applications running on respective ones of the virtual machines under the control of the hypervisor. It is also possible to use multiple hypervisors each providing a set of virtual machines using at least one underlying physical machine. Different sets of virtual machines provided by one or more hypervisors may be utilized in configuring multiple instances of various components of the system.
These and other types of cloud infrastructure can be used to provide what is also referred to herein as a multi-tenant environment. One or more system components such as an automatic cache operation placement device, or portions thereof, are illustratively implemented for use by tenants of such a multi-tenant environment.
Cloud infrastructure as disclosed herein can include cloud-based systems such as Amazon Web Services, GCP and Microsoft Azure™. Virtual machines provided in such systems can be used to implement at least portions of an automatic cache operation placement platform in illustrative embodiments. The cloud-based systems can include object stores such as Amazon™ S3, GCP Cloud Storage, and Microsoft Azure™ Blob Storage.
In some embodiments, the cloud infrastructure additionally or alternatively comprises a plurality of containers implemented using container host devices. For example, a given container of cloud infrastructure illustratively comprises a Docker container or other type of LXC. The containers may run on virtual machines in a multi-tenant environment, although other arrangements are possible. The containers may be utilized to implement a variety of different types of functionality within the automatic cache placement devices. For example, containers can be used to implement respective processing devices providing compute services of a cloud-based system. Again, containers may be used in combination with other virtualization infrastructure such as virtual machines implemented using a hypervisor.
Illustrative embodiments of processing platforms will now be described in greater detail with reference to
Referring now to
The cloud infrastructure 900 may encompass the entire given system or only portions of that given system, such as one or more of client, servers, controllers, or computing devices in the system.
Although only a single hypervisor 904 is shown in the embodiment of
An example of a commercially available hypervisor platform that may be used to implement hypervisor 904 and possibly other portions of the system in one or more embodiments of the disclosure is the VMware® vSphere™ which may have an associated virtual infrastructure management system, such as the VMware® vCenter™. As another example, portions of a given processing platform in some embodiments can comprise converged infrastructure such as VxRail™, VxRack™, VxBlock™, or Vblock® converged infrastructure commercially available from VCE, the Virtual Computing Environment Company, now the Converged Platform and Solutions Division of Dell EMC of Hopkinton, Mass. The underlying physical machines may comprise one or more distributed processing platforms that include storage products, such as VNX™ and Symmetrix VMAX™, both commercially available from Dell EMC. A variety of other storage products may be utilized to implement at least a portion of the system.
In some embodiments, the cloud infrastructure additionally or alternatively comprises a plurality of containers implemented using container host devices. For example, a given container of cloud infrastructure illustratively comprises a Docker container or other type of LXC. The containers may be associated with respective tenants of a multi-tenant environment of the system, although in other embodiments a given tenant can have multiple containers. The containers may be utilized to implement a variety of different types of functionality within the system. For example, containers can be used to implement respective compute nodes or cloud storage nodes of a cloud computing and storage system. The compute nodes or storage nodes may be associated with respective cloud tenants of a multi-tenant environment of system. Containers may be used in combination with other virtualization infrastructure such as virtual machines implemented using a hypervisor.
As is apparent from the above, one or more of the processing modules or other components of the disclosed automatic cache placement systems may each run on a computer, server, storage device or other processing platform element. A given such element may be viewed as an example of what is more generally referred to herein as a “processing device.” The cloud infrastructure 900 shown in
Another example of a processing platform is processing platform 1000 shown in
The processing device 1002-1 in the processing platform 1000 comprises a processor 1010 coupled to a memory 1012. The processor 1010 may comprise a microprocessor, a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other type of processing circuitry, as well as portions or combinations of such circuitry elements, and the memory 1012, which may be viewed as an example of a “processor-readable storage media” storing executable program code of one or more software programs.
Articles of manufacture comprising such processor-readable storage media are considered illustrative embodiments. A given such article of manufacture may comprise, for example, a storage array, a storage disk or an integrated circuit containing RAM, ROM or other electronic memory, or any of a wide variety of other types of computer program products. The term “article of manufacture” as used herein should be understood to exclude transitory, propagating signals. Numerous other types of computer program products comprising processor-readable storage media can be used.
Also included in the processing device 1002-1 is network interface circuitry 1014, which is used to interface the processing device with the network 1004 and other system components, and may comprise conventional transceivers.
The other processing devices 1002 of the processing platform 1000 are assumed to be configured in a manner similar to that shown for processing device 1002-1 in the figure.
Again, the particular processing platform 1000 shown in the figure is presented by way of example only, and the given system may include additional or alternative processing platforms, as well as numerous distinct processing platforms in any combination, with each such platform comprising one or more computers, storage devices or other processing devices.
Multiple elements of system may be collectively implemented on a common processing platform of the type shown in
For example, other processing platforms used to implement illustrative embodiments can comprise different types of virtualization infrastructure, in place of or in addition to virtualization infrastructure comprising virtual machines. Such virtualization infrastructure illustratively includes container-based virtualization infrastructure configured to provide Docker containers or other types of LXCs.
As another example, portions of a given processing platform in some embodiments can comprise converged infrastructure such as VxRail™, VxRack™, VxBlock™, or Vblock® converged infrastructure commercially available from VCE, the Virtual Computing Environment Company, now the Converged Platform and Solutions Division of Dell EMC.
It should therefore be understood that in other embodiments different arrangements of additional or alternative elements may be used. At least a subset of these elements may be collectively implemented on a common processing platform, or each such element may be implemented on a separate processing platform.
Also, numerous other arrangements of computers, servers, storage devices or other components are possible in the information processing system. Such components can communicate with other elements of the information processing system over any type of network or other communication media.
As indicated previously, components of an information processing system as disclosed herein can be implemented at least in part in the form of one or more software programs stored in memory and executed by a processor of a processing device. For example, at least portions of the functionality of pseudo code shown in
It should again be emphasized that the above-described embodiments are presented for purposes of illustration only. Many variations and other alternative embodiments may be used. For example, the disclosed techniques are applicable to a wide variety of other types of information processing systems, compute services platforms, and automatic cache placement platforms. Also, the particular configurations of system and device elements and associated processing operations illustratively shown in the drawings can be varied in other embodiments. Moreover, the various assumptions made above in the course of describing the illustrative embodiments should also be viewed as exemplary rather than as requirements or limitations of the disclosure. Numerous other alternative embodiments within the scope of the appended claims will be readily apparent to those skilled in the art.
Gottin, Vinicius Michel, Condori, Edward José Pacheco, Dias, Jonas F., Ciarlini, Angelo E. M., Porto, Fábio André Machado, Souto, Yania Molina, Pires, Paulo de Figueiredo, da Cunha Costa, Bruno Carlos, Vieira, Wagner dos Santos
Patent | Priority | Assignee | Title |
11113302, | Apr 23 2019 | SALESFORCE COM, INC | Updating one or more databases based on dataflow events |
11675803, | Apr 23 2019 | Salesforce, Inc. | Updating one or more databases based on dataflow events |
Patent | Priority | Assignee | Title |
7260819, | Feb 22 2002 | Oracle International Corporation | System and method for software application scoping |
20030061272, | |||
20040243764, |
Date | Maintenance Fee Events |
Nov 16 2022 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Jun 18 2022 | 4 years fee payment window open |
Dec 18 2022 | 6 months grace period start (w surcharge) |
Jun 18 2023 | patent expiry (for year 4) |
Jun 18 2025 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jun 18 2026 | 8 years fee payment window open |
Dec 18 2026 | 6 months grace period start (w surcharge) |
Jun 18 2027 | patent expiry (for year 8) |
Jun 18 2029 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jun 18 2030 | 12 years fee payment window open |
Dec 18 2030 | 6 months grace period start (w surcharge) |
Jun 18 2031 | patent expiry (for year 12) |
Jun 18 2033 | 2 years to revive unintentionally abandoned end. (for year 12) |