Stream - Samza

Event Centric Thinking


LinkedIn stream processing framework that provides powerful, reliable tools for working with data in Kafka.

(LinkedIn created Apache Kafka to be the data exchange backbone of its organisation.) See StreamTask

Samza works on a Samza that is comprises of three different systems:

  • YARN,
  • Kafka,
  • and ZooKeeper.

Stream Definition

A stream in Samza is:

  • a partitioned,
  • ordered-per-partition,
  • replayable,
  • multi-subscriber,
  • lossless

sequence of messages.

Streams are not just inputs and outputs to the system, but also buffers. The input to the next processing stage is simply the files produced by the earlier stage.

This is the same model than in Hadoop where the processing stages are MapReduce jobs, and the output of a processing stage is a directory of files on HDFS.


The benefit of this model are:

  • strong isolation of processing stages from each other. Jobs are loosely coupled and there is no need of backpressure
  • All stages are multi-subscriber. Others jobs can consume it, and build on it
  • Debugging flows is easy, as you can manually inspect the output of any stage.
  • Software Design - Recovery (Restartable) - Each job need only be concerned with its own inputs and outputs, and in the case of a fault, each job can be recovered and restarted independently. There is no need for central control over the entire dataflow graph.


High Level Api


An application written using Samza’s High Level Api implements the StreamApplication interface.

The interface provides a single method named describe(), which allows us to define our inputs, the processing logic and outputs for our application.

StreamApplication { 

Discover More
Event Centric Thinking
(Stream|Pipe|Message Queue|Event Processing)

From an abstract point of view, a stream is a sequence of aninfinite cardinality (size) delivered at unknown time intervals. list Streams: are inputs and outputs of operations may be also buffers...
Event Centric Thinking
Stream - (Software|Library)

Software, Library around the notion of stream Distributed stream processing frameworks such as: Samza - Linkedin, Storm - team (Yahoo!) Flink Amazon...

Share this page:
Follow us:
Task Runner