Stream - Processing

1 - About

Something happened (Event), subscribe to it (Streams)

Stream processing = transformations on stream data

A pipeline is a query on a stream.

Streaming Processing is also known as Incremental Processing.

It has a low memory and processing overhead.

This performance comes at a cost

  • All content to read/write has to be processed in exact same order as input comes in (or output is to go out).
  • No random access. For this, you need to build an auxiliary structure as a map or a tree.

As a result, Streaming is most commonly used to process a lot of data.

An stream processing flow is a combination of:

  • an input source
  • an output destination.
  • and a sequence of data (that may be a byte of an object). In the general sense, it's called a message.
  • that then may be chained to form a pipeline

Streams are a flexible system to implement a pipeline of data transformations.

3 - Problems

A messaging system is a fairly low-level piece of infrastructure—it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer.

Consider the counting example, above (count page views and update a dashboard).

  • What happens when the machine that your consumer is running on fails, and your current counter values are lost? How do you recover?
  • Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? (Unless you are careful, your counts will be incorrect.)
  • What if you want to count page views grouped by the page URL? How do you distribute the computation across multiple machines if it’s too much for a single machine to handle?

Stream processing is a higher level of abstraction on top of messaging systems, and it’s meant to address precisely this category of problems.

See Samza is a stream processing framework that aims to help with these problems.

4 - Implementation

Queue:

  • Multiple input processor

Routing:

  • Category variable

Processor selection through:

  • Category variable
  • Data Type

Structure Flow:

  • Flow Content + Json Object for extra attributes (Ex: ResultSet + TestId)

Loop through an explicit processor (ie if and then go to step N) ?

  • Lazy instantiation: The operations are performed with an execute() call. Whether the program is executed locally or on a cluster depends on the type of execution environment. The lazy evaluation lets construct execution plan that may differ from the lexical (code) definition.

Data Processing - Event Loop

It is our firm belief that in the future, the majority of data applications, including real-time analytics, continuous analytics, and historical data processing, will treat data as what it really is: unbounded streams of events.

5 - Stream vs Batch

6 - Pattern

  • filter / map (stateless)
  • Aggregate
  • Stream / Table join (change log streams)
  • Window join (message buffer, co-partionning)
  • Reprocessing (Batch and streaming convergence)

7 - Software

8 - Documentation / Reference


Data Science
Data Analysis
Statistics
Data Science
Linear Algebra Mathematics
Trigonometry

Powered by ComboStrap