Stream - Processing

Event Centric Thinking

About

Stream processing is the reactive design of data processing.

  • Something happened (Event),
  • Subscribe to it (Streams)

Streaming Processing is also known as :

It's called stream processing because the events are based on a continuous arrival of data, logically name data streams.

Model

A stream processing flow is a combination of:

  • an input source
  • an output destination.
  • and a sequence of data (that may be a byte of an object). In the general sense, it's called a message.
  • that may be chained to form a pipeline

A pipeline is a query on a stream.

Pipeline

Streams are a flexible system to implement a pipeline of data transformations.

  • filter/map (stateless)
  • Aggregate
  • Stream / Table join (change log streams)
  • Window join (message buffer, co-partionning)
  • Reprocessing (Batch and streaming convergence)

System Characteristics

It has a low memory and processing overhead.

This performance comes at a cost

  • All content to read/write has to be processed in the exact same order as input comes in (or output is to go out).
  • No random access. For this, you need to build an auxiliary structure such as a map or a tree.

Problems

A messaging system is a fairly low-level piece of infrastructure—it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer.

Consider the counting example, above (count page views and update a dashboard).

  • What happens when the machine that your consumer is running on fails, and your current counter values are lost? How do you recover?
  • Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? (Unless you are careful, your counts will be incorrect.)
  • What if you want to count page views grouped by the page URL? How do you distribute the computation across multiple machines if it’s too much for a single machine to handle?

Stream processing is a higher level of abstraction on top of messaging systems, and it’s meant to address precisely this category of problems.

See Samza is a stream processing framework that aims to help with these problems.

Implementation

As most of asynchronous system, stream processing is implemented with an event loop

Queue:

  • Multiple input processor

Routing:

  • Category variable

Processor selection through:

  • Category variable
  • Data Type

Structure Flow:

  • Flow Content + Json Object for extra attributes (Ex: ResultSet + TestId)

Loop through an explicit processor (ie if and then go to step N)?

  • Lazy instantiation: The operations are performed with an execute() call. Whether the program is executed locally or on a cluster depends on the type of execution environment. The lazy evaluation lets construct execution plan that may differ from the lexical (code) definition.

It is our firm belief that in the future, the majority of data applications, including real-time analytics, continuous analytics, and historical data processing, will treat data as what it really is: unbounded streams of events.

Stream vs Batch

See Stream vs Batch

Software

see Stream - (Software|Library) 1)





Discover More
Diagram Of Lambda Architecture Generic
Data Processing - Lambda Architecture (batch and stream processing)

nathanmarzNathan Marz wrote a blog post describing the Lambda Architecture: How to beat the CAP theorem Lambda architecture is a data-processing architecture designed to handle massive quantities of...
Event Centric Thinking
Incremental Computation

incremental computation is the ability to compute only the state difference from the last execution. stream processing is an incremental computation.
Kafka Commit Log Messaging Process
Kafka (Event Hub)

Apache Kafka is a broker application that stores the message as a distributed commit log. The entire data storage system is just a transaction log. |data feeds Data Systems are exposing data, ...
Event Centric Thinking
Stream - How to capture change of state - Change Data Capture (CDC)

This page is how to capture the changes of state in order to create a stream. The stream processing will then perform an incremental computation Capturing data changes is far from a trivial task, and...
Event Centric Thinking
Stream - Samza

LinkedIn stream processing framework that provides powerful, reliable tools for working with data in Kafka. (LinkedIn created Apache Kafka to be the data exchange backbone of its organisation.) See StreamTask...
Stream Vs Batch
Stream vs Batch

This article talks Stream Processing vs Batch Processing. The most important difference is that: in batch processing the size (cardinality) of the data to process is known whereas in a stream processing,...
Card Puncher Data Processing
What is Data Processing (Data Integration)?

Card puncher Data processing is a more general term for manipulating data whereas data integration is the integration...



Share this page:
Follow us:
Task Runner