In general, this disclosure describes techniques for applying a distributed pipeline model in a distributed computing system to cause processing nodes of the distributed computing system to process data according to a distributed pipeline having an execution topology, specified within a pipeline statement, to perform a task.
|
18. A method comprising:
receiving, by a compiler device in communication with a plurality of servers connected by a network, a command comprising syntax elements defining a plurality of stages, wherein each of the stages comprises a plurality of operations to be processed by corresponding, different processes executing on any one of the plurality of servers, the command further comprising syntax elements defining a pipeline topology for a distributed pipeline, the pipeline topology including an intra-server branch for the plurality of operations of each stage of the plurality of stages and an inter-server branch that binds operations of different stages of the plurality of stages, and wherein a particular syntax element identifies a second stage of the plurality of stages as a recipient for result data of a first stage of the plurality of stages;
distributing, by the compiler device to each of the plurality of servers, configuration data generated from the command to implement the distributed pipeline;
configuring, in each server of the plurality of servers and based on the configuration data, a first sub-pipeline for a first stage of the distributed pipeline, wherein the first sub-pipeline configured in the server binds, using one or more input/output channels, the plurality of operations of the first stage of the plurality of stages, the bound operations of the first stage to be processed by corresponding, different processes executing on the server, and wherein the particular syntax element causes each server of the plurality of servers to output result data of the first sub-pipeline configured in the server to a particular server of the plurality of servers that is configured with a second sub-pipeline of the distributed pipeline; and
configuring, in the particular server of the plurality of servers and based on the configuration data, a second sub-pipeline for a second stage of the distributed pipeline, wherein the second sub-pipeline configured in the particular server binds the plurality of operations of the second stage of the plurality of stages to the plurality of operations of the first stage configured in each server of the plurality of servers.
19. A method comprising: obtaining, by a plurality of computing devices connected by a network, each of the plurality of computing devices representing or executing one or more of a plurality of processing nodes, a pipeline statement defining a distributed pipeline, the pipeline statement comprising a first stage statement (1) defining a first stage to include a first sub-pipeline of the distributed pipeline, the first sub-pipeline comprising an ordered list of a first plurality of operations, each of the first plurality of operations to be performed by a corresponding, different process of a first processing node of the plurality of processing nodes and (2) including a syntax element identifying a second stage as a recipient for result data of the first sub-pipeline of the first stage, the pipeline statement also comprising a second stage statement defining the second stage to include a second sub-pipeline of the distributed pipeline, the second sub-pipeline comprising an ordered list of a second plurality of operations, each of the second plurality of operations to be performed by a corresponding, different process of a second processing node of the plurality of processing nodes,
wherein a first computing device of the plurality of computing devices comprises the first processing node, and
wherein a second computing device of the plurality of computing devices comprises the second processing node, the second computing device different from the first computing device;
configuring, based at least on a first pipeline setup specification generated from the pipeline statement, the first sub-pipeline in the first processing node by instantiating respective, different first processes for the first plurality of operations defined by the first stage statement to perform the first plurality of operations, wherein configuring the first sub-pipeline comprises binding at least one pair of the first processes using one or more input/output channels;
configuring, based on the syntax element, the first processing node to send result data of the first sub-pipeline of the first stage to the second sub-pipeline of the second stage; and
configuring, based at least on a second pipeline setup specification generated from the pipeline statement, the second sub-pipeline in the second processing node by instantiating respective, different second processes for the second plurality of operations defined by the second stage statement to perform the second plurality of operations and to input, to the second sub-pipeline in the second processing node, the result data of the first sub-pipeline of the first stage.
1. A distributed computing system comprising:
a plurality of computing devices connected by a packet-based network, each of the plurality of computing devices representing or executing one or more of a plurality of processing nodes, the plurality of computing devices configured to obtain a pipeline statement defining a distributed pipeline, the pipeline statement comprising a first stage statement (1) defining a first stage to include a first sub-pipeline of the distributed pipeline, the first sub-pipeline comprising an ordered list of a first plurality of operations, each of the first plurality of operations to be performed by a corresponding, different process of a first processing node of the plurality of processing nodes and (2) including a syntax element identifying a second stage as a recipient for result data of the first sub-pipeline of the first stage, the pipeline statement also comprising a second stage statement defining the second stage to include a second sub-pipeline of the distributed pipeline, the second sub-pipeline comprising an ordered list of a second plurality of operations, each of the second plurality of operations to be performed by a corresponding, different process of a second processing node of the plurality of processing nodes,
wherein a first computing device of the plurality of computing devices comprises the first processing node, the first processing node comprising processing circuitry and configured to, based at least on a first pipeline setup specification generated by processing the pipeline statement, configure the first sub-pipeline in the first processing node by instantiating respective, different first processes for the first plurality of operations defined by the first stage statement to perform the first plurality of operations,
wherein to configure the first sub-pipeline, the first processing node is configured to bind at least one pair of the first processes using one or more input/output channels,
wherein the first processing node is further configured to, based on the syntax element, send result data of the first sub-pipeline of the first stage to the second sub-pipeline of the second stage, and
wherein the second computing device comprises the second processing node, the second computing device different from the first computing device, the second processing node comprising processing circuitry and configured to:
based at least on a second pipeline setup specification generated by processing the pipeline statement, configure the second sub-pipeline in the second processing node by instantiating respective, different second processes for the second plurality of operations defined by the second stage statement to perform the second plurality of operations; and
input, to the second sub-pipeline in the second processing node, the result data of the first sub-pipeline of the first stage.
2. The distributed computing system of
wherein the second sub-pipeline in the second processing node comprises an operating system pipe to bind together the second plurality of operations.
3. The distributed computing system of
a compiler computing device configured to process the pipeline statement to generate the first pipeline setup specification and the second pipeline setup specification.
4. The distributed computing system of
5. The distributed computing system of
6. The distributed computing system of
wherein the second stage statement includes a stage identifier for the second stage, and wherein the syntax element identifying the second stage as a recipient for result data of the first sub-pipeline of the first stage comprises the stage identifier for the second stage as an operation of the first sub-pipeline, and wherein the distributed computing system is configured to, based on the syntax element, generate the first pipeline setup specification to configure the first sub-pipeline in the first processing node to send result data of the first sub-pipeline in the first processing node to the second processing node.
7. The distributed computing system of
wherein the plurality of computing devices comprises a compiler computing device configured to obtain the pipeline statement and to process the pipeline statement to generate the first pipeline setup specification and the second pipeline setup specification.
8. The distributed computing system of
9. The distributed computing system of
wherein a final process of the first processes comprises a send operator to send, via a communication channel, the result data for the first sub-pipeline to the second sub-pipeline of the second stage.
10. The distributed computing system of
a third processing node configured to, based at least on a third pipeline setup specification generated from the pipeline statement, configure the first sub-pipeline in the third processing node to send result data of the first sub-pipeline in the third processing node to the second processing node.
11. The distributed computing system of
wherein the second processing node is configured to, based at least on the syntax element and the second pipeline setup specification generated by processing the pipeline statement, configure the first sub-pipeline in the second processing node to provide result data of the first sub-pipeline in the second processing node as input to the second sub-pipeline in the second processing node.
12. The distributed computing system of
wherein the first stage statement defines the first stage to specify a first one or more processing nodes to perform the first sub-pipeline, the first one or more processing nodes including the first processing node, and
wherein the distributed computing system is configured to, based on the first stage statement specifying a first one or more processing nodes, generate the first pipeline setup specification for the first processing node to configure the first sub-pipeline in the first processing node.
13. The distributed computing system of
wherein the first stage statement includes one or more syntax elements that specify the first one or more processing nodes to perform the first sub-pipeline, and
wherein the distributed computing system is configured to, based on the one or more syntax elements, generate a pipeline setup specification for configuring the first sub-pipeline in each of the first one or more processing nodes.
14. The distributed computing system of
wherein the second stage statement defines the second stage to specify a second one or more processing nodes to perform the second sub-pipeline, the second one or more processing nodes including the second processing node, and
wherein the distributed computing system is configured to, based on the second stage statement, generate the second pipeline setup specification for the second processing node to configure the second sub-pipeline in the second processing node.
15. The distributed computing system of
16. The distributed computing system of
17. The distributed computing system of
wherein the first processing node is configured to execute the first sub-pipeline in the first processing node to generate first result data and to send the first result data to the second processing node via a communication channel,
wherein the second processing node is configured to execute the second sub-pipeline in the second processing node to process the first result data to generate second result data and to output the second result data.
|
This application claims the benefit of U.S. Provisional Application No. 63/049,920, filed Jul. 9, 2020, which is incorporated by reference herein in its entirety.
The disclosure relates to a computing system, and in particular, to distributed processing within a computing system.
Nodes executing on computing devices may be interconnected to form a networked distributed computing system to exchange data and share resources. In some examples, a plurality of nodes executing on computing devices are interconnected to collectively execute one or more applications to perform a job. Nodes may include bare metal servers, virtual machines, containers, processes, and/or other execution element having data processing capabilities for the distributed computing system. Each of the nodes may individually perform various operations for the distributed computing system, such as to collect, process, and export data, and the nodes may communicate with one another to distribute processed data.
In general, this disclosure describes techniques for applying a distributed pipeline model in a distributed computing system to cause processing nodes of the distributed computing system to process data according to a distributed pipeline having an execution topology, specified within a pipeline statement, to perform a task. For example, a computing device may receive a pipeline statement for a task. The pipeline statement includes a plurality of stage statements, each of the stage statements describing a corresponding stage that is to execute a set of one or more operations. One or more of the stage statements also specify topology information for the corresponding stages. For example, a stage statement may specify that the corresponding stage includes a sub-pipeline to be performed by a specified one or more processing nodes of the distributed computing system. The stage statement may specify that the specified one or more processing nodes are to execute one or more operations of the sub-pipeline. In some cases, the stage statement also specifies a next stage that is to receive the output of the corresponding stage for the stage statement. In some cases, the pipeline statement is human-readable text to allow an operator to easily arrange the operations within stages and to arrange the stages within the pipeline statement to specify a distributed pipeline having an overall execution topology for the task being executed by the distributed computing system. The computing device that receives a pipeline statement may be or may execute one of the processing nodes of the distributed computing system. In such cases, this processing node may be referred to as the origin.
The computing device processes the pipeline statement to cause processing nodes of the distributed computing system to process data according to the execution topology specified within the pipeline statement. For example, the computing device may output individual commands, at least some of the pipeline statement, stage information, configuration data, or other control information to the processing nodes to cause the nodes to process data according to the execution topology. Processing nodes thereafter begin executing operators for the operations within a stage and bind the operators together using input/output channels, such as standard streams (stdin, stdout) or queues. Operators may be processes, for instance, and operators bound within a stage may be referred to as a sub-pipeline. If a stage statement specifies a next stage as the final operation for the corresponding stage, one or more processing nodes that perform the corresponding stage may bind the final operation of the stage to one or more processing nodes configured to perform the next stage. Binding operations for multiple stages between processing nodes may include creating a communication channel between the processing nodes that operates over a network, such as a socket. The inter-node bindings from a stage to the next stage may cause the data output by the final operation of the stage to fan-out (i.e., output by one processing node and received by multiple processing nodes) or to fan-in (i.e., output by multiple processing nodes and received by one processing node) to the next stage.
In one example, a pipeline statement may include a first stage specifying one or more operations to be processed by each of a plurality of processing nodes and a second stage specifying one or more operations to be processed by a fan-in node of the processing nodes, such as the origin. The plurality of processing nodes may generate a first sub-pipeline for the first stage that binds operators of the first stage to process the data to perform the corresponding operations and then send the results of the operators to the second stage. As one example, each of the processing nodes may orchestrate queues and interfaces for its local operators and generate a send operation to send the results of its local operators to the fan-in node. The fan-in node may generate a second sub-pipeline for the second stage that binds operators of the second stage and receives the results of the operations of the first stage output by the plurality of processing nodes. As one example, the fan-in node may orchestrate queues and interfaces for its local operators to first receive the results of the operations of the first stage and also to execute its local operators to perform the operations of the second stage.
The techniques described herein may provide one or more technical advantages that provide at least one practical application. For example, the techniques may extend the conventional single-device Unix-style pipeline model to realize a distributed pipeline model in which processing nodes of a distributed computing system set up sub-pipelines, at least partially independently, for an overall distributed pipeline and process data according to an execution topology, specified within a pipeline statement, to perform a task. By automatically generating a distributed pipeline to bind operations to be processed according to a topology, the techniques may allow a user to specify, in the form of a comparatively simple pipeline statement, all operations and a topology for a distributed pipeline, thereby avoiding the often complex and cumbersome task of configuring each of the nodes of the distributed computing system with not only the operations to be performed, but also the inter-node communication channels for the topology among the nodes. The specified topology may exploit parallelism both within and among processing nodes of the distributed computing system to at least partially concurrently execute operations of a sub-pipeline defined for a stage in the pipeline statement.
In some examples, a distributed computing system comprises a plurality of computing devices configured to obtain a pipeline statement, the pipeline statement comprising a first stage statement defining a first stage to include a first sub-pipeline and specifying a second stage as a recipient for result data of the first sub-pipeline, the pipeline statement also comprising a second stage statement defining the second stage to include a second sub-pipeline, wherein the plurality of computing devices comprises a first processing node configured to, based at least on a first pipeline setup specification generated by processing the pipeline statement, configure the first sub-pipeline in the first processing node to send result data of the first sub-pipeline in the first processing node to a second processing node, and wherein the plurality of computing devices comprises the second processing node configured to, based at least on a second pipeline setup specification generated by processing the pipeline statement, configure the second sub-pipeline in the second processing node and to input, to the second sub-pipeline in the second processing node, the result data of the first sub-pipeline in the first processing node.
In some examples, a distributed computing system comprises a plurality of devices configured to execute respective collaborative programs to: receive a pipeline statement, wherein the pipeline statement comprises a plurality of stages, wherein each of the stages comprises one or more operations to be processed by one or more of the plurality of devices, and wherein the one or more operations of the stages are to be processed in different topologies, and generate a sub-pipeline of a distributed pipeline to bind the one or more operations of the stages and output result data from a final stage of the stages.
In some examples, a method comprises receiving, by a device of a plurality of devices connected by a network, a command comprising a plurality of stages, wherein each of the stages comprises one or more operations to be processed by one or more of the plurality of devices, and wherein the one or more operations of the stages are to be processed in different topologies; distributing, by the device and to other devices of the plurality of devices, the command such that the other devices of the plurality of devices each generate a first sub-pipeline of a distributed pipeline, wherein the first sub-pipeline binds the one or more operations processed in a first topology of the different topologies; and generating, by the device, a second sub-pipeline of the distributed pipeline, wherein the second sub-pipeline binds the one or more operations processed in a second topology of the different topologies.
In some examples, a method comprises obtaining, by a distributed computing system comprising a first processing node and a second processing node, a pipeline statement comprising a first stage statement defining a first stage to include a first sub-pipeline and specifying a second stage as a recipient for result data of the first sub-pipeline, the pipeline statement also comprising a second stage statement defining the second stage to include a second sub-pipeline; configuring, based at least on a first pipeline setup specification generated from the pipeline statement, the first sub-pipeline in the first processing node to send result data of the first sub-pipeline in the first processing node to a second processing node; and configuring, based at least on a second pipeline setup specification generated from the pipeline statement, the second sub-pipeline in the second processing node to input, to the second sub-pipeline in the second processing node, the result data of the first sub-pipeline in the first processing node.
The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Like reference characters refer to like elements throughout the figures and text.
In the example of
Each of devices 6 is a real or virtual processing node and executes or represents at least one processing node 8 within the distributed computing system (sometimes referred to more simply as “distributed system”). Each of devices 6 may be, for example, a workstation, laptop computer, tablet, rack-mounted server, other real or virtual server, a container or virtual machine. In some cases, any of devices 6 may represent “Internet-of-Things” (IoT) devices, such as cameras, sensors, televisions, appliances, etc. As one example implementation, devices 6 may represent devices of a smart-city grid, where each device is configured to collect and/or process data. In this example, devices 6B-6N may represent sensor devices (e.g., smart meters) to collect and/or process data, and device 6A may represent a device including sensor devices and a Central Processing Unit (“CPU”) to append the data, process the data, and/or export the data. In some examples of distributed computing system 2, devices 6 may be heterogeneous in that devices 6 may represent devices having different hardware and/or software. For instance, device 6A may be a desktop workstation, while device 6B may be a rack-mounted server in a cloud computing center, while device 6C may be a wireless sensor, and so forth. Hardware environments of devices 6 may also be heterogenous. For example, devices 6 may include any one or more of CPUs with varying numbers of processing cores, ASICs, field-programmable devices, Graphics Processing Units (GPUs), or other processing circuitry.
In this example, devices 6A-6N include processing nodes 8A-8N (collectively, “processing nodes 8”), respectively. Processing nodes 8 execute programs 9A-9N (collectively, “programs 9”), respectively, to process commands. Commands may represent partial execute steps/processing for performing an overall task. Commands are included in pipeline statement 20 and may in some cases be text-based. A task for the distributed computing system 2 may include one or more operations corresponding to commands, such as retrieving data, processing data, exporting data, or the like. A command may refer to a particular one of programs 9 executed by each processing nodes 8. In some examples, each of processing nodes 8 represents an in-memory storage engine that manages data stored in a shared memory segment or partition that can be accessed directly by the storage engine. A single device 6 may in some cases execute multiple instances of processing nodes 8, e.g., as multiple virtual machines. However, in the case of a bare metal server, for instance, one of devices 6 may be one of processing nodes 8 in that the device executes programs 9 directly on hardware.
Programs 9 may include a command-line interface (CLI), graphical user interface (GUI), other interface to receive and process text-based commands, such as a shell. Programs 9 may alternatively or additionally include a suite of different processes, applications, scripts, or other executables that may be combined in a series to perform a larger task. As an example, programs 9A may be arranged, to realize a larger task specified using a pipeline, within device 8A and executed according a single-device Unix-style pipeline model in which programs 9A are chained together by their standard streams in order that the output data of one of programs 9A is directed as input data to the input of another of programs 9A, which processes the input data. The output data of this program is directed as input data to the input of another of programs 9A, and so on until the final one of programs 9A in the pipeline outputs (e.g., to an interface, another program, or to storage) the final output data. Programs 9 may also refer to different routines, functions, modules, or other sets of one or more operations executed by one or more processes. For example, programs 9A may refer to a single shell, executed by processing node 8A, that can receive one or more operators arranged in a pipeline, execute corresponding operations, and arrange the streams of these operations to realize a pipeline.
In accordance with the techniques described herein, processing nodes 8 collaboratively execute programs 9 according to a distributed pipeline generated from a pipeline statement 20. The distributed pipeline includes selected programs from programs 9A-9N arranged in a pipeline topology that may include both intra-processing node branches and inter-processing node branches. In the illustrated example, the pipeline topology includes inter-processing node branches illustrated in
In the example of
Pipeline statement 20 specifies a plurality of stage statements, each of the stage statements describing a corresponding stage that is to execute a set of one or more operations. One or more of the stage statements also specifies topology information for the corresponding stages. For example, a stage statement may specify that the corresponding stage includes a specified one or more nodes 8 of the distributed system. The stage statement may specify that the specified one or more nodes 8 are to execute a specified one or more operations. In some cases, the stage statement also specifies a next stage that is to receive the output of the corresponding stage for the stage statement. Pipeline statement 20 may be human-readable text to allow user 4 to easily arrange the operations within stages and to arrange the stages within pipeline statement 20 to specify an overall execution topology for the task being executed by the distributed computing system 2.
The node that receives pipeline statement 20 may be or may execute one of processing nodes 8, as in the example of
Processing node 8A receives pipeline statement 20. In response, processing node 8A processes pipeline statement 20 to determine, based on the syntax of pipeline statement 20, that one or more operations of the pipeline statement 20 are to be processed on different processing nodes 8. For example, the syntax of pipeline statement may include a plurality of stage statements, each specifying one or more operations for a corresponding stage. In the example of
Pipeline statement 20 may provide, by its structure, syntax, and semantics, topological information describing configuration of a cluster of processing nodes 8 or other distributed grid, partition cardinality, and operator sequencing abstracted into a pattern. The distributed pipeline described by pipeline statement 20 may be non-linear, circular, realize a feedback loop, and/or based on processing node capabilities (e.g., available hardware and/or software resources).
Each stage statement may specify one or more processing nodes 8 to perform the operations specified in the stage statement. Each stage statement may alternatively or additionally specify one or more data structures with respect to which the corresponding stage is to be executed. Such data structures may include database/table partitions, for instance.
For example, a stage statement may indicate that all processing nodes 8 are to perform the operation in the stage statement. The syntax element to indicate all processing nodes 8 may include “all nodes” or “all partitions” as in the stage statement for first stage statement 324 of pipeline statement 302 described below,” or other syntax element. By indicating “on all partitions” for a stage, a stage statement implicitly indicates that each of the processing nodes managing any of the “all partitions” should execute the stage with respect to the partitions it is managing. A stage statement may include syntax to explicitly specify the one or more nodes that are to perform the stage with respect to an explicitly specified one or more data structures. For example, stage statement “stage stageC on all partitions on origin-node:” specifies that the origin-node is to execute operators of the stage with respect to all partitions, such as partitions of a database table. The specified set of nodes in the previous statement may be distinct from the statement “stage stageC on all partitions”, which does not explicitly specify any one or more nodes and therefore implicitly specified all processing nodes that manage any of the “all partitions”.
As another example, a stage statement may indicate a single processing node is to perform the operation in the stage statement. The syntax element to indicate a particular, single processing node 8 may be a reference to or name of a particular processing node 8 or may indicate the origin node, e.g., by designating the origin node using the syntax element “origin” or “origin-node” for instance.
Each stage statement may have a corresponding stage identifier that is usable as a reference to the corresponding stage. The syntax elements for defining a stage may include “stage [stage identifier] on [nodes]”, where [stage identifier] can be used elsewhere in pipeline statement 20 as a reference to the stage and [nodes] specifies the one or more processing nodes 8 to perform the operations specified in the stage statement. That is, a first stage statement may refer to a second stage statement (and corresponding stage) included in pipeline statement 20. For example, the first stage statement may include the stage identifier for the second stage statement as an operation within the first stage statement to which result data to be output by the first stage corresponding to the first stage statement should be sent. In other words, the second stage is the recipient of result data generated by the first stage and the result data generated by the first stage is input data to the second stage. In the example pipeline statement 302, for instance, the stage statement for first stage statement 324 refers to ‘outputCsv’ that is the stage identifier for second stage 326 (stage ‘outputCsv’ in pipeline statement 302). In effect, second stage 326 is a next operation after first stage statement 324 in the overall distributed pipeline.
Because different stage statements within pipeline statement 20 may specify different numbers of processing nodes, pipeline statement 20 may specify a topology that includes fan-in and/or fan-out processing between different stages by processing nodes 8. For example, a first stage statement for a first stage in pipeline statement 20 may specify “all” processing nodes 8 and refer to a second stage defined by a second stage statement, in pipeline statement 20, that specifies a single processing node 8. This pipeline statement 20 specifies a topology that includes fan-in processing from the first stage to the second stage. As another example, a first stage statement for a first stage in pipeline statement 20 may specify a single processing node 8 and refer to a second stage defined by a second stage statement, in pipeline statement 20, that specifies two or more processing nodes 8. This pipeline statement 20 specifies a topology that includes fan-out processing from the first stage to the second stage. Other example topologies, such as one-to-one or many-to-many from stage to stage are also possible when defined in this way in pipeline statement 20. A fan-in topology may be useful where many of devices 6 are responsible for respective sets of data being processing, such as IoT devices responsible for sensor data, distributed database servers or storage engines responsible for database partitions, and so forth. Specifying a fan-in topology using pipeline statement 20 generates a topology that collects data from these many devices for processing at a reduced number of devices. In some cases, the collected data has already been partially processed with an earlier stage specified in pipeline statement 20.
Pipeline statement 20 may also include a syntax element indicating pipeline statement 20 is a pipeline statement, such “pipeline” or “pipeline statement”. Pipeline statement 20 may also include a syntax element or otherwise be associated with an identifier for pipeline statement 20, such as a name for pipeline statement 20. Pipeline statement 20 may include syntax elements to indicate a beginning and end of pipeline statement 20, such as “begin” and “end”, other text elements, braces, brackets, parentheses, comments, etc. Pipeline statement 20 may be entered into a CLI, stored to a file, executed by a script that references pipeline statement 20, and/or input by other means to distributed system 2. Pipeline statement may be user input. Pipeline statement 20 may include syntax elements to enable parameterizing pipeline statement 20. For example, syntax elements “parameters $table $file” in pipeline statement may enable a user to invoke stored pipeline statement 20 using a CLI or other interface and pass in values for the $table and $file parameters. In such cases, the interface may replace the parameters with the values in each location in a pipeline statement template that includes these parameters and pass the pipeline statement with the replaced values as pipeline statement 20 to a processing node for processing. Alternative schemes for passing values for pipeline statement template parameters to a processing node are contemplated, e.g., using an argument vector or environment variables.
In some examples, a formal grammar for the syntax of a pipeline statement may be as follows:
pipelineStatement :=
PIPELINE <pipelineName>
BEGIN
stageDefinition ...
END ‘;’
stageDefinition :=
STAGE <stageName> ON topology ‘:’
<operator> [ <arg> ... ] [ ‘|’ <operator> [ <arg> ... ] ] ...
[ <stageName> ]
topology :=
ALL NODES
| ALL PARTITIONS
| ALL PARTITIONS ON ORIGIN
| ORIGIN
| SPECIFIC NODE node
| SPECIFIC PARTITION partition
| NODE WITH device
The pipeline statement specifies a PIPELINE with a name <pipelineName>. The pipeline statement includes begin and end operators (BEGIN and END in this example). The definition of the stages (stageDefinition) includes one or more stage statements that may conform to stageDefinition in the above grammar. Other grammars are contemplated. As with other example grammars for specifying a pipeline topology described herein, the pipeline statement in the above grammar includes specifies one or more stages and, for each stage, the topology for the stage (including a next stage in some cases) and the ordered list of operators of the sub-pipeline for the stage.
Example topologies for stages are indicated in the above grammar. These include: ALL NODES (execute the sub-pipeline for the stage on all processing nodes); ALL PARTITIONS (execute the sub-pipeline for the stage with respect to all partitions of the data set); ALL PARTITIONS ON ORIGIN (execute the sub-pipeline for the stage with respect to the all of the partitions associated with the origin processing node); ORIGIN (execute the sub-pipeline for the stage on the origin processing node); SPECIFIC NODE node (execute the sub-pipeline for the stage on the identified node); SPECIFIC PARTITION partition (execute the sub-pipeline for the stage with respect to the identified partition partition); and NODE WITH device (execute the sub-pipeline for the stage with respect to the node with the identified device device).
An example statement with the above grammar is as follows and is explained in further detail elsewhere in this description:
pipeline export
begin
stage scanBuffers on all partitions:
scanOp LineItem | outputCsv
stage outputCsv on origin:
toCsvOp | writeFileOp “/tmp/LineItem.csv”
end ;
As noted above, other grammars are possible. For example, a shorthand form for a grammar could observe and uphold the original elegance of the Unix pipe/filter syntax and extend that syntax into an arbitrary directed graph of distributed parallelized sub-pipelines. For instance, a shorthand form of the previously-described grammar could be:
pipelineStatementShort :=
<operator> [ <arg> ... ] [ pipe <operator> [ <arg> ... ] ] ...
pipe :=
‘|’ - pipe to next operator in sub-pipeline
| ‘|*’ - short for fan-out across all partitions
| ‘*|’ - short for fan-in from all partitions to origin
| ‘|+’ - short for fan-out across all nodes
| ‘+|’ - short for fan-in from all nodes to origin
| ‘|~’ - short for all partitions on origin
| ‘|?’ - short for node with named device
An example statement with the above grammar is as follows and is a shorthand restatement of the prior example that conformed to the earlier grammar:
In this example, programs 9 that compiles the above shorthand pipeline statement detects the semantics such that is understood to be executed with respect to all partitions (*|) and to fan-in to origin, similar to the two stage statements of pipeline export. This shorthand grammar also includes syntax to specify one or more stages and, for each stage, the topology for the stage (including a next stage in some cases) and the ordered list of operators of the sub-pipeline for the stage.
Processing node 8A may communicate with other processing nodes 8B-8N to set up a distributed pipeline 12 to realize pipeline statement 20. Processing nodes 8 may collaborate to generate distributed pipeline 12. To generate distributed pipeline 12, processing nodes 8 bind operations not only among programs 9 but also between stages and across processing nodes 8.
To set up distributed pipeline 20, processing node 8A communicates pipeline setup specifications 22B-22N (collectively, “pipeline setup specifications 22” including pipeline setup specification 22A for processing node 8A) to respective processing nodes 8A-8N. In some cases, each of pipeline setup specifications 22 may be a duplicate of pipeline statement 20 or a partial duplicate of text in pipeline statement 20 that is relevant to the corresponding processing node 8. In some cases, each of pipeline setup specifications 22 may include configuration commands generated by processing node 8A. The configuration commands may include, for instance, a sub-pipeline provided as a list of operations to be performed by the one of processing nodes 8 that receives any of the pipeline setup specifications 22 and a list of one or more destination processing nodes for output generated by performing the sub-pipeline. Each of pipeline setup specifications 22 may include locations of data to be processed. As already noted, pipeline setup specification 22A may be processed by processing node 8A to configure, on processing node 8A, one or more sub-pipelines for execution by processing node 8A and at least a partial topology involving processing node 8A for distributed pipeline 12. In this way, processing nodes 8 may exchange information regarding the overall task to perform, the location of data to be processed by operations for completing that task, and an interconnection topology among processing node 8 of distributed system 2.
Each of processing nodes 8 that obtains any of pipeline setup specifications 22 by receiving and/or generating any of pipeline setup specifications 22 may process the corresponding pipeline setup specification to configure aspects of distributed pipeline 12. In the example of
As used herein, the term “sub-pipeline” refers to a set of one or more operations bound in a pipeline and performed by any one of processing nodes 8 to execute the operations of a single stage. A sub-pipeline is a sub-pipeline of an overall distributed pipeline defined by pipeline statement 20. A sub-pipeline of a stage defined by a stage statement of pipeline statement 20 may receive input data from a previous sub-pipeline of a stage defined by a previous stage statement in pipeline statement 20, or from an input interface or device such as a storage device. A sub-pipeline of a stage defined by a stage statement of pipeline statement 20 may send output data to a subsequent sub-pipeline of another stage defined by a subsequent stage statement in pipeline statement 20, or to an output interface or device such as a CLI or a storage device. While a stage refers to the collective set of sub-pipelines performed by all processing nodes 8 specified in the corresponding stage statement, a sub-pipeline is a set of operations performed by a single processing node 8. In some cases, any single processing node 8 may execute parallel sub-pipelines with respect to multiple data partitions or streams for which the single processing node 8 is responsible.
As part of processing any of pipeline setup specifications 22 and configuring distributed pipeline 12, processing nodes 8 also bind stages to one another with communication channels. A communication channel may include, for instance, a network socket, a messaging system or bus, a remote procedure call, an HTTP channel, or other communication channel for communicating data from one of devices 6 to another of devices 6. A communication channel may also include intra-device communication “channels,” such as an operating system pipe or other message bus, a function call, inter-process communication, and so forth. Each of devices 6 may be configured with network information (e.g., IP address or network name) for devices 6. To set up a communication channel 13 with another device 6, a device 6 may create a socket to the other device 6 using the network information. Communication channel 13 setup may, however, take a variety of forms commensurate with the type of communication channel being setup. Communication channels 13 may be of different types between different pairs of devices 6, particularly in a heterogenous distributed system 2.
Any of processing nodes 8 may configure a send operation to send results of its local operations for a sub-pipeline (e.g., for a first stage) to another one or more processing nodes 8 (including in some cases, the sending processing node) to be processed by subsequent one or more sub-pipelines (e.g., for a second stage). In the example of
Any of processing nodes 8 may configure a receive operation to receive results of a sub-pipeline (e.g., for a first stage), executed by other processing node 8 or by itself, to be processed by a subsequent sub-pipeline (e.g., for a second stage) of the processing node. In the example of
Processing nodes 8 that share a communication channel for inter-stage communications may coordinate, during communication channel setup, to ensure that the input to a communication channel is directed by the receiving processing node 8 to the correct sub-pipeline for execution for the subsequent stage. The communication channel may be associated with an identifier, such as a port, to associate the communication channel with a sub-pipeline. Alternatively, data sent via the communication channel may be tagged or otherwise associated with an identifier to associate the data with a sub-pipeline or with overall distributed pipeline 12. In this latter techniques, different inter-stage bindings may use the same socket, for instance.
When processing node 8A receives results 24 from devices 6B-6N, processing node 8A may aggregate results 24, including the result 24A of operations of the first stage that was locally processed by processing node 8A. In some examples, processing node 8A further processes the aggregated results according to a sub-pipeline for a subsequent stage to generate final results 26. Processing node 8A may export the final results 26 to user 4, such as by displaying final results 26 to a display terminal or writing the aggregated results to a storage device. Alternatively, or additionally, processing node 8A may generate a send operation to send the results to any location, such as to a dynamic query system as described in U.S. Provisional Patent Application No. 62/928,108, entitled “Dynamic Query Optimization,” filed Oct. 30, 2019, the entire contents of which is incorporated by reference herein.
By processing pipeline statement 20 to configure and execute a distributed pipeline 12 in this way, the techniques of this disclosure may extend the conventional single-device Unix-style pipeline model to realize a distributed pipeline model in which processing nodes 8 of distributed system 2 set up sub-pipelines, at least partially independently, for overall distributed pipeline 12 and process data according to an execution topology, specified within pipeline statement 20, to perform a task. By automatically generating a distributed pipeline 12 to bind operations to be processed according to a topology, the techniques may allow user 4 to specify, in the form of a comparatively simple pipeline statement 20, all operations and a topology for a distributed pipeline, thereby avoiding the often complex and cumbersome task of configuring each of the processing nodes 8 of distributed system 2 with not only the operations to be performed, but also the inter-node communication channels 13 for the topology among the processing nodes 8.
In some cases, processing node 204 comprises a storage engine that primarily performs database-related operations, and program 206 may represent a storage engine shell or other shell program. The storage engine shell may offer a high-level programming language that provides a Unix-based shell that can access services of an operating system of device 202. Any of program 206 may also include one or more storage engine utilities.
Program 206 may receive a pipeline statement or a pipeline setup specification, compile it, configure sub-pipelines and communication channels, distribute requests, and move large amounts of data on a grid to process data. Program 206 may obtain, process, and output data in one or more different formats, such as comma-separated value, binary, eXtensible Markup Language (XML), and so forth.
In accordance with techniques of this disclosure, program 206 may provide an elegant and expressive replacement for existing job specification languages or task definitions that require specific instructions from a user to machines to set up a distributed pipeline. Processing node 204 may represent any of processing nodes 8 of
In some examples, to compile a pipeline statement to generate pipeline setup specifications usable by processing nodes to configure states, program 206 may use a descriptor. A descriptor is a nested data structure (e.g., in JavaScript Object Notation or XML format) that includes information for mapping syntax of a pipeline statement to specific objects within the distributed computing system. A processing node 204 may receive a pipeline statement and compile it using the descriptor to generate stages, more specifically, to generate pipeline setup specifications with which processing nodes configure sub-pipelines to implement corresponding stages. The descriptor may specify: (1) the distributed system or grid as a set of named (2) processing nodes specified by a name and/or network address or other reachability information for inter-node communication. Each processing node includes (3) a set of one or more units, where a unit is either a partition of a data structure/data sub-set and the set of available operators for the unit; or a specific capability (e.g. a GPU device) and the set of available operators for the unit. Program 206 may compile the pipeline statement to generate and distribute pipeline setup specification to cause certain pipeline operators to execute by processing nodes that have specialized capabilities, such as GPUs, multi-core processors, ASIC or other discrete logic co-processor, particular programs, or other hardware or software resources.
If a stage statement within a pipeline statement indicates that a sub-pipeline is to execute on a particular one or more nodes, e.g., SPECIFIC NODE ‘node’, then program 206 may use the descriptor to map ‘node’ to one of the nodes specified in the descriptor. If a sub-pipeline is to execute with respect to ALL PARTITIONS, then program 206 may use the descriptor to map all units that specify a partition to the nodes that includes a partition and distribute the pipeline setup specification to such nodes. If a stage statement specifies a particular operator, then program 206 may use the descriptor to map the operator to the set of operators for the various units and thence to the nodes that include a unit that has an available operator for the unit.
By using a distributed pipeline model in a potentially heterogenous grid, with free form structure, processing node 204 may receive a pipeline statement, compile it, and cooperate with the operating system of device 202 to configure aspects of a corresponding distributed pipeline on device 202. For example, program 206 may compile a pipeline statement and automatically instantiate and configure sub-pipelines and configuration channels for the distributed pipeline by requesting operations, queues, and sockets, e.g., from the operating system. In other words, program 206 orchestrates the operating system of behalf of the user to accomplish user intents expressed in the pipeline statement. The program 206 may thereby in effect becomes a high-level programming language to a user to, for instance, move and process large amounts of data with a potentially heterogenous set of computing devices. This may be particularly advantageous when the devices are heterogenous because of the difficulty of otherwise manually configuring devices with different operating systems, different configuration and communication mechanisms, and otherwise different ways of being administered.
In some cases, processing node 204 operating as a storage engine provides services to a database platform. Processing node 204 may, for instance, create tables; insert rows; read rows; sort rows; aggregate rows; obtain data from external sources such as pipes, files, streams, and so forth; compress files; perform some calculations; or other processing function that may be associated with storage and database platforms. Example database platform include Hadoop, Greenplum, MySQL or any database providing support for external services.
Processing node 204 may perform common database operations, e.g., joins, group by, aggregation, sorting, order by, filter expressions, and other data operations, processing, and/or analysis. Processing node 204 may invoke cloud services, perform artificial intelligence or machine learning operations, custom algorithms or plugin, standard Unix utilities such as GREP or AWK—with arguments for an operation supplied by program 206, having compiled the pipeline statement, to the operator where each operator is dynamically programmable. In some cases, when a sub-pipeline is executed, program 206 dynamically instantiates operators from a local or remote library. In this way, program 206 is not a fixed program on the device 202 but extends capabilities of device 202 to offer dynamic programming with a relatively simple pipeline statement.
In the example of
Program 206 of processing node 204 may receive a pipeline statement or a pipeline setup specification and process it to generate sub-pipeline 200 to bind operations 208. For example, program 206 orchestrates interfaces between operations 208. For example, program 206 may configure queues 212A-212N (collectively, “queues 212”) for passing output data from an operation as input data to a subsequent operation in sub-pipeline 200. In some examples, queues 212 may enqueue pointers to memory buffers (e.g., the memory location that stores data output by an operation for input to another operation). Queues 212 may be described herein and considered as the queue for the next operator that performs the next operation 208. In some cases, program 206 may configure input interfaces 216A-216N (collectively, “input interfaces 216”) for some of the operations 208 to obtain data and output interfaces 218A-218N (collectively, “output interfaces 218”) for some of the operations 208 to output results of operations 208. Interfaces 216 and 218 may represent Application Programming Interfaces (APIs) or other interfaces. Interfaces may represent or include read/write commands, queuing or IPC operations, or other operations. In some cases, message and data passing between operators for operations may be via another form of inter-process communication, such as via OS pipes.
As one example, scan operation 208A uses input interface 216A to read or obtain the data from table 210. Scan operation 208A uses output interface 218A to output the results (or a pointer to a memory location storing the results) of scan operation 208A to queue 212A. Search operation 208B uses input interface 216B to obtain data from queue 212A such as to obtain the results of operation 208A stored in queue 212A. In some examples, search operation 208B may use input interface 216B to obtain a pointer stored in queue 212A that resolves to a memory location storing the results of operation 208A. The search operation 208B searches the data and stores the results (or a pointer to a memory location storing the results) of search operation 208B in queue 212B via output interface 218B. Sort operation 208C uses input interface 216C to obtain the results of search operation 208B stored in queue 212B. The sort operation 208C sorts the data obtained from queue 212B and stores the results (or a pointer to a memory location storing the results) of sort operation 208C in queue 212N via output interface 218C. Export operation 208N uses input interface 216N to obtain the results of sort operation 208C stored in queue 212N. The send operation 208N then uses output interface 218N to send the data resulting from the sort operation 208C to device 214, which may be a storage device, network interface, or user interface for instance. In this way, the queues and interfaces enable results of an operation to be bound into a sub-pipeline. Device 214 may in some cases be an example of a communication channel 13 for outputting result data from sub-pipeline 200 to another processing node. In this way, program 206 may configure sub-pipeline 200 as part of an distributed pipeline that is distributed among multiple nodes, with multiple stages bound together in a topology.
Devices 202 include processing nodes 204A-204N (collectively, “processing nodes 204”), respectively, each running at least one program 206 that may collaborate to generate a distributed pipeline 220 to bind the operations processed by each of devices 202 in accordance with a topology defined by a pipeline statement. A processing node 204 may manage one or more partitions.
In the example of
The pipeline statement may include a first stage statement that specifies one or more operations to be processed by each of devices 202A-202N as part of a first stage, and a second stage statement that specifies one or more operations processed by device 202N as part of a second stage. The first stage statement may also reference the second stage statement as a destination for results of the first stage. In this example, the first stage may include a scan operation to read from the distributed table and a search operation to search for specific text from the rows returned from the table, and the second stage of the command may include a convert to comma separated values (CSV) operation and an export operation (e.g., to export data to a CSV file).
Program 206N of processing node 204N may process the pipeline statement and determine, from the syntax of the stage statements therein, that the pipeline statement includes operations to be processed in different stages according to a distributed pipeline 220. In response, device 202N generates and distributes pipeline setup specifications to each of devices 202A-202C, respectively, such that each device may configure aspects of distributed pipeline 220.
In this case, to configure aspects of distributed pipeline 220, each of programs 206A-206N may configure in the corresponding processing node 204 a first sub-pipeline of the distributed pipeline 220 for the first stage. For example, program 206A may orchestrates queues and configure interfaces to bind operations 208A1-208AN. Program 206A may also configure a send operation (e.g., operation 208AN) that, when executed, sends the result of the local operations of the first sub-pipeline of processing node 204A to device 202N. In effect, the final operation of a sub-pipeline may be to output the result of the sub-pipeline via a communication channel as the input to another sub-pipeline or to output the result to a device. The result may be input to a queue or other input interface for the first operator of receiving sub-pipeline.
Programs 206B and 206C similarly configure first sub-pipelines for processing nodes 204B and 204C, respectively. Program 206N may similarly configure a first sub-pipeline for processing node 204N as also shown in
Program 206N of device 202N may configure a second sub-pipeline of the distributed pipeline 220. For example, program 206N configures a receive operation (e.g., operation 209N1) to receive the results from devices 202A-202C, and orchestrates queues and generates interfaces to bind operations 209N1-209NK. Processing node 204N executing receive operation 209N1 receives result data for the first stage on multiple communication channels (not shown) with multiple processing nodes 204A-204C (and in some cases with processing node 204N). Receive operation 208N1 may include aggregating the result data for processing on the second sub-pipeline for the second stage. For example, receive operation 208N1 may include performing, by an operator, socket recv( ) calls to obtain buffered socket data and adding the buffered socket data (typically as a pointer) to the queue for the next operator for operation 209N2. In this way, distributed pipeline 220 may bind the operations of the first stage (e.g., operations 208A1-208AN, operations 208B1-208BN, operations 208C1-208CN, and in some cases operations 208N1-208NN (not shown)) to the operations of the second stage (e.g., operations 209N1-209NK).
In the example of
Processing nodes 204 collaborate to generate distributed pipeline 304 by configuring sub-pipelines for the stages on the processing nodes and binding first stage 334 to second stage 336 in response to the text “outputCsv” that is a reference to second stage 336 appearing as a final operation in first stage statement 324. For example, when processing pipeline statement 302, program 206A of processing node 204A may determine that there are multiple stages in pipeline statement 302. In response, the program may distribute pipeline setup specifications generated from pipeline statement 302 to the set of collaborative programs 206 executing on devices 202A-202N.
Programs 206A-206N may each generate a sub-pipeline of distributed pipeline 304 for the corresponding processing node 204 to execute for the first stage 334, and program 206A of origin processing node 204A may generate a second sub-pipeline of distributed pipeline 304 for the second stage 336.
For the first sub-pipeline for the first stage 323, each of programs 206 may orchestrate queues and generate interfaces to bind its local operations. For example, program 206A may orchestrate one or more queues to enqueue the results of scan operation 314A with respect to partition 310A for input to the next operator, send 316A. Program 206A may also orchestrate one or more queues to enqueue the results of scan operation 314B with respect to partition 310B for input to the next operator, send 316B. Program 206A may generate an input interface for scan operation 314A to read data stored in buffers 312A associated with partition 310A and an input interface for scan operation 314B to read data stored in buffers 312B associated with partition 310B. Program 206A may generate an output interface for scan operation 314A to send results of the scan operations to the one or more queues for send 316B. Program 206A may generate an output interface for scan operation 314B to send results of the scan operations to the one or more queues for send 316B. Programs 206B and 206N may configure similar queues, operators, and interfaces for corresponding buffers 312, scan operations 314, and send operations 316. Data flowing through buffers may be self-describing, not self-describing, or otherwise (e.g., structured, semi-structured, unstructured).
Send operations 316A-316Z (collectively, “send operations 316”) send the results of scan operations 314 (which are the results of first stage 334 as a whole) to processing node 204A to process the operations in second stage 326. For example, program 206A may configure send operation 316A to send the output of scan operation 314A, and send operation 316B to send the output of scan operation 314B. Program 206B may configure send operation 316C to send the output of scan operation 314C, and send operation 316D to send the output of scan operation 314D. Program 206N may configure send operation 316Y to send the output of scan operation 314Y, and send operation 316Z to send the output of scan operation 314Z. Send operations 316 may be example instances of send operation 208N. Programs 206 may configure input interfaces for queues by which the send operations 316 may obtain the results from corresponding scan operations 314. Programs 206 may configure communication channels between operators for send operations 316 and respective operators for respective receive operations 318.
For the second sub-pipeline for second stage 336, program 206A of origin processing node 204A may configure a receive operation 318 to receive the results from each of devices 202B-202N. Receive operation 318 may represent multiple receive operations for multiple communications channels between processing node 204A and each of processing nodes 204 to receive data sent with send operations 316.
Program 206A may also orchestrate queues and generate interfaces for the operations to be performed by the origin processing node 204A. For example, program 206A may generate an input interface and queues by which receive operation 318 may obtain data from processing nodes 202A-202N. Program 206A may configure one or more buffers 319 to store the result data of the first stage 334 received from each of processing nodes 204 and output interfaces by which receive operation 318 may store the result data to buffer 319.
Program 206A may generate an input interface by which convert to CSV (“toCsvOp”) operation 320 may read data from buffer 319. For example, toCsvOp 320 may use an input interface to read the data (e.g., in binary format) stored in buffer 319. Program 206A may process toCsvOp 320 and convert the data in binary format to CSV format. Program 206A may output the results from the toCsvOp 320 to a store to file operation (“writeFileOp”) 322 (referred to as “writeFileOp 322” or “export operation 322”). Program 206A may configure an output interface or output device to which the writeFileOp 322 may send the CSV. In this case, the output device is a file “/Tmp/LineItem.csv” indicated in second stage statement 326.
In the example of
In the first stage 354 (stage “scanBuffers”) corresponding to first stage statement 344, each of processing nodes 204 that manages a partition 310 performs the operation “scanOp LineItem|toCsvOp” with respect to the “LineItem” table implemented with partitions 310 and outputs the result to the second stage 356 (stage “outputCsv”) corresponding to second stage statement 346 and performed by processing node 204A. Programs 206 of processing nodes 204 collaborate to generate distributed pipeline 340 by configuring the sub-pipelines of first stage 354 and second stage 356 and binding the sub-pipeline of first stage 354 to the sub-pipeline of second stage 356. For example, when processing pipeline statement 332, program 206A may determine that there are multiple stages in pipeline statement 332. In response, program 206A may compile pipeline statement 332 to generate and distribute pipeline setup specifications to the programs 206B-206N of devices 202B-202N.
As with the example in
As noted above, pipeline statement 332 of
This example demonstrates a technical advantage of the techniques of this disclosure whereby a user can rapidly experiment with and intelligently compose pipeline statements according to a high-level specification language. Reordering operators within and among stage statements can lead to significant performance improvements for completing the same overall task, and the user can do such reordering by simply rearranging syntax elements within a pipeline statement, for example, rather than manually configuring individual sub-pipelines within processing nodes and an execution topology among processing nodes.
In the example of
Once distributed pipeline 304 is configured in this way, processing nodes 204 perform distributed pipeline 304 by performing sub-pipelines for the stages, including sending result data for sub-pipelines to other processing nodes to be processed by subsequent sub-pipelines in distributed pipeline 304 (408). Processing nodes 204 output result data from performing distributed pipeline 304, e.g., to storage, interface, or other device (410).
As shown in the specific example of
Processors 502, in one example, are configured to implement functionality and/or process instructions for execution within computing device 500. For example, processors 502 may be capable of processing instructions stored in storage device 508. Examples of processors 502 may include, any one or more of a microprocessor, a controller, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or equivalent discrete or integrated logic circuitry.
One or more storage devices 508 may be configured to store information within computing device 500 during operation. Storage device 508, in some examples, is described as a computer-readable storage medium. In some examples, storage device 508 is a temporary memory, meaning that a primary purpose of storage device 508 is not long-term storage. Storage device 508, in some examples, is described as a volatile memory, meaning that storage device 508 does not maintain stored contents when the computer is turned off. Examples of volatile memories include random access memories (RAM), dynamic random-access memories (DRAM), static random-access memories (SRAM), and other forms of volatile memories known in the art. In some examples, storage device 508 is used to store program instructions for execution by processors 502. Storage device 508, in one example, is used by software or applications running on computing device 500 to temporarily store information during program execution.
Storage devices 508, in some examples, also include one or more computer-readable storage media. Storage devices 508 may be configured to store larger amounts of information than volatile memory. Storage devices 508 may further be configured for long-term storage of information. In some examples, storage devices 508 include non-volatile storage elements. Examples of such non-volatile storage elements include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories.
Computing device 500, in some examples, also includes one or more communication units 506. Computing device 500, in one example, utilizes communication units 506 to communicate with external devices via one or more networks, such as one or more wired/wireless/mobile networks. Communication units 506 may include a network interface card, such as an Ethernet card, an optical transceiver, a radio frequency transceiver, or any other type of device that can send and receive information. Other examples of such network interfaces may include 3G, 4G and Wi-Fi radios. In some examples, computing device 500 uses communication unit 506 to communicate with an external device.
Computing device 500, in one example, also includes one or more user interface devices 510. User interface devices 510, in some examples, are configured to receive input from a user through tactile, audio, or video feedback. Examples of user interface devices(s) 510 include a presence-sensitive display, a mouse, a keyboard, a voice responsive system, video camera, microphone, or any other type of device for detecting a command from a user. In some examples, a presence-sensitive display includes a touch-sensitive screen.
One or more output devices 512 may also be included in computing device 500. Output device 512, in some examples, is configured to provide output to a user using tactile, audio, or video stimuli. Output device 512, in one example, includes a presence-sensitive display, a sound card, a video graphics adapter card, or any other type of device for converting a signal into an appropriate form understandable to humans or machines. Additional examples of output device 512 include a speaker, a cathode ray tube (CRT) monitor, a liquid crystal display (LCD), or any other type of device that can generate intelligible output to a user.
Computing device 500 may include operating system 516. Operating system 516, in some examples, controls the operation of components of computing device 500. For example, operating system 516, in one example, facilitates the communication of one or more applications 522 with processors 502, communication unit 506, storage device 508, input device 504, user interface device 510, and output device 512. Application 522 may also include program instructions and/or data that are executable by computing device 500.
Processing node 524 may include instructions for causing computing device 500 to run one or more programs 526 to perform the techniques described in the present disclosure. Programs 526 may represent an example instance of any of programs 9 of
Program 526 may generate a pipeline setup specification for processing node 524. In some cases, computing device 500 receives a pipeline setup specification, via communication unit(s) 506 or input device(s), generated by another processing node. In any event, program 526 processes the pipeline setup specification to instantiate operators and bind operators together using input/output channels to generate a sub-pipeline of the distributed pipeline, and in some cases to bind the sub-pipeline to another sub-pipeline performed by another computing device.
As one example, program 526 may issue low-level calls to operating system 516 to allocate memory (e.g., queues) for the local operations of a sub-pipeline. Program 526 may also configure interfaces between the operations and the queues. Program 526 may also generate low-level calls to the kernel to configure a receive operator to receive results of operations processed by other devices.
Once any specified sub-pipelines and communication channels are configured, programs 526 may perform the sub-pipelines by executing operators for the corresponding operations of the sub-pipelines. In this way, computing device 500 operates as part of a distributed system to configure and execute a distributed pipeline.
The techniques described in this disclosure may be implemented, at least in part, in hardware, software, firmware or any combination thereof. For example, various aspects of the described techniques may be implemented within one or more processors, including one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or any other equivalent integrated or discrete logic circuitry, as well as any combinations of such components. The term “processor” or “processing circuitry” may generally refer to any of the foregoing logic circuitry, alone or in combination with other logic circuitry, or any other equivalent circuitry. A control unit comprising hardware may also perform one or more of the techniques of this disclosure.
Such hardware, software, and firmware may be implemented within the same device or within separate devices to support the various operations and functions described in this disclosure. In addition, any of the described units, modules or components may be implemented together or separately as discrete but interoperable logic devices. Depiction of different features as modules or units or engines is intended to highlight different functional aspects and does not necessarily imply that such modules or units must be realized by separate hardware or software components. Rather, functionality associated with one or more modules or units may be performed by separate hardware or software components, or integrated within common or separate hardware or software components.
The techniques described in this disclosure may also be embodied or encoded in a computer-readable medium, such as a computer-readable storage medium, containing instructions. Instructions embedded or encoded in a computer-readable storage medium may cause a programmable processor, or other processor, to perform the method, e.g., when the instructions are executed. Computer readable storage media may include random access memory (RAM), read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), flash memory, a hard disk, a CD-ROM, a floppy disk, a cassette, magnetic media, optical media, or other computer readable media.
Yamarti, Alka, Huetter, Raymond John, McIntyre, Craig Alexander
Patent | Priority | Assignee | Title |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Feb 15 2021 | HUETTER, RAYMOND JOHN | BORAY DATA TECHNOLOGY CO LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 055324 | /0455 | |
Feb 17 2021 | YAMARTI, ALKA | BORAY DATA TECHNOLOGY CO LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 055324 | /0455 | |
Feb 17 2021 | MCINTYRE, CRAIG ALEXANDER | BORAY DATA TECHNOLOGY CO LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 055324 | /0455 | |
Feb 18 2021 | Boray Data Technology Co. Ltd. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Feb 18 2021 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Feb 26 2021 | SMAL: Entity status set to Small. |
Date | Maintenance Schedule |
Dec 19 2026 | 4 years fee payment window open |
Jun 19 2027 | 6 months grace period start (w surcharge) |
Dec 19 2027 | patent expiry (for year 4) |
Dec 19 2029 | 2 years to revive unintentionally abandoned end. (for year 4) |
Dec 19 2030 | 8 years fee payment window open |
Jun 19 2031 | 6 months grace period start (w surcharge) |
Dec 19 2031 | patent expiry (for year 8) |
Dec 19 2033 | 2 years to revive unintentionally abandoned end. (for year 8) |
Dec 19 2034 | 12 years fee payment window open |
Jun 19 2035 | 6 months grace period start (w surcharge) |
Dec 19 2035 | patent expiry (for year 12) |
Dec 19 2037 | 2 years to revive unintentionally abandoned end. (for year 12) |