Software Design - (Fault Tolerance|Resilience)

Card Puncher Data Processing


Fault tolerance (or resilience) is the ability to recover from errors (fault), regardless of whether those errors resulted from:

  • hardware issues,
  • software issues,
  • general systems issues (network latency, out-of-space errors),
  • or human mistakes.

A system tolerant of every possible kind of fault is not feasible.

See also: Software Design - Recovery (Restartable) (same thing ?)


When talking about fault tolerance, the following terms are often used:

  • At least once: this means, in a word counting example, that over-counting after failures is possible
  • Exactly once: this means that counts are the same with or without failures
  • End to end exactly once: this means that counts published to an external system will be the same with or without failures.



fault tolerance is generally provided via a mechanism called checkpoints, essentially taking a consistent snapshot periodically without ever stopping the computation.


Svepoints makes checkpointing mechanism available directly to the user. Savepoints are checkpoints that are triggered externally by the user. Savepoints make it possible to “version” applications by taking consistent snapshots of the state at well-defined time points, and then rerunning the application (or a different version of the application code from that time point). In practice, savepoints are essential for production use, enabling easy debugging, code upgrades, what-if simulations, and A/B testing.

Documentation / Reference

Discover More
Card Puncher Data Processing
Application - Fault (Crash)

Things that can go wrong in a system are called faults. A fault is usually defined as one component of the system deviating from its specification. It is impossible to reduce the probability of a fault...
Card Puncher Data Processing
Application - Fault Handling

fault handling
Data System Architecture
Data Management - (Transaction|Request|Commit|Redo) Log

(Transaction|Request|commit) logs are structured log file store all changes made to the data as they occur. They permits the implementation of : transaction isolation undoable operation. recovery...
Card Puncher Data Processing
Data Processing - Replication

Replication: Having a copy of the same data on multiple machines (nodes) in order to increase : Feature Example Performance serve reads in parallel, distributing application workloads across multiple...
Two Physical Drives
Drive - RAID Technology Overview

Redundant array of independent disks (RAID) is the technology of grouping several physical drives in a computer into one or morelogical drives. Each logical drive appears to the operating system as...
Kafka Commit Log Messaging Process
Kafka (Event Hub)

Apache Kafka is a broker application that stores the message as a distributed commit log. The entire data storage system is just a transaction log. |data feeds Data Systems are exposing data, ...
Kafka Commit Log Messaging Process
Kafka - Fault Tolerance

in Kafka Leader data stored in zookeeper In-Sync replicas number (mini.insync.replicas) determine the number of replicas per partition which have to be in sync.
Map Reduce One Picture
Map Reduce (MR) Framework

Map reduce is a distributed execution . The MapReduce programming model (and a corresponding system) was proposed in a 2004 paper from a team at Google as a simpler abstraction for processing very large...
Card Puncher Data Processing
Software Design - Recovery (Restartable)

In really big system, there is always something that will go wrong. And it’s not possible to master all the different scenario that will arise. The file is delivered a little bit later, a mapping is...
Card Puncher Data Processing
Software Development - (Stateless|Stateful)

Stateless or state-full refers to the fact that a unit of program (process, function, procedure) have a state or not (Ie variable that may change). stateless Parallel aggregate operations over...

Share this page:
Follow us:
Task Runner