Stream - Samza

Event Centric Thinking

About

LinkedIn stream processing framework that provides powerful, reliable tools for working with data in Kafka.

(LinkedIn created Apache Kafka to be the data exchange backbone of its organisation.) See StreamTask

Samza works on a Samza that is comprises of three different systems:

  • YARN,
  • Kafka,
  • and ZooKeeper.

Stream Definition

A stream in Samza is:

  • a partitioned,
  • ordered-per-partition,
  • replayable,
  • multi-subscriber,
  • lossless

sequence of messages.

Streams are not just inputs and outputs to the system, but also buffers. The input to the next processing stage is simply the files produced by the earlier stage.

This is the same model than in Hadoop where the processing stages are MapReduce jobs, and the output of a processing stage is a directory of files on HDFS.

Benefits

The benefit of this model are:

  • strong isolation of processing stages from each other. Jobs are loosely coupled and there is no need of backpressure
  • All stages are multi-subscriber. Others jobs can consume it, and build on it
  • Debugging flows is easy, as you can manually inspect the output of any stage.
  • Software Design - Recovery (Restartable) - Each job need only be concerned with its own inputs and outputs, and in the case of a fault, each job can be recovered and restarted independently. There is no need for central control over the entire dataflow graph.

API

High Level Api

Application

An application written using Samza’s High Level Api implements the StreamApplication interface.

The interface provides a single method named describe(), which allows us to define our inputs, the processing logic and outputs for our application.

StreamApplication { 
  describe(StreamApplicationDescriptor);
}





Discover More
Event Centric Thinking
Stream - (Software|Library)

Software, Library around the notion of stream Distributed stream processing frameworks such as: Samza - Linkedin, Storm - team (Yahoo!) Flink Amazon...
Event Centric Thinking
What is a Stream? Also known as Pipe, Message Queue or Event Processing

A stream is: a sequence of aninfinite cardinality (size) delivered atunknown time intervals. list Streams of data user activity on a website sensor readings from devices (IOT) order...



Share this page:
Follow us:
Task Runner