Table of Contents

Data Processing - Data Flow (ETL | Workflow | Pipeline)

About

A data flow is a workflow specialized for data processing

Any system where the data moves between code units and triggers execution of the code could be called dataflow

This page is not about wiki/Dataflow_architecture which is a computer architecture

A data flow engine has the following features:

There is no program-counter to keep track of what should be executed next, data arrival triggers the code to execute. There is no need to worry about locks because the data is local and can only be accessed by the code it was sent to.

Characteristics

A data flow program is a directed graph where:

The flow of data is explicit, often visually illustrated as a line or pipe.

Actor

An Actor model applied to a data flow engine can be seen as:

Data-driven

Data-driven:

Parallel

At the lowest level, dataflow is both a programming style and a way to manage parallelism.

As an operation runs as soon as all of its inputs become valid, dataflow engines are inherently parallel and can work well in large, decentralized systems.

Since the operations are only concerned with the availability of data inputs, they have no hidden state to track, and are all “ready” at the same time.

Loop

Loop: to guarantee that a program executes correctly, it's essential that tokens from an other iterations do not take over one another.

Two implementations to guarantee the executions of loop correctly:

Engine

Data Flow basic tasks sequence (Feedback interpreter):

A dataflow engine might be implemented as a hash table where:

When any operation completes, the program scans down the list of operations until it finds the first operation where all inputs are currently valid, and runs it. When that operation finishes, it will typically output data, thereby making another operation become valid.

For parallel operation, only the list needs to be shared; it is the state of the entire program. Thus the task of maintaining state is removed from the programmer and given to the language's runtime.

Library / Tool

Visualization

To represent conditions or iterations as a set of nodes can easily result in a complex graph, nontrivial to understand. The complexity of interpreting a visual representation can end up being higher than reading textual source code.

Documentation / Reference