About
Fault tolerance (or resilience) is the ability to recover from errors (fault), regardless of whether those errors resulted from:
- hardware issues,
- software issues,
- general systems issues (network latency, out-of-space errors),
- or human mistakes.
A system tolerant of every possible kind of fault is not feasible.
See also: Software Design - Recovery (Restartable) (same thing ?)
Articles Related
Term
When talking about fault tolerance, the following terms are often used:
- At least once: this means, in a word counting example, that over-counting after failures is possible
- Exactly once: this means that counts are the same with or without failures
- End to end exactly once: this means that counts published to an external system will be the same with or without failures.
Implementation
Checkpoint
fault tolerance is generally provided via a mechanism called checkpoints, essentially taking a consistent snapshot periodically without ever stopping the computation.
Savepoint
Svepoints makes checkpointing mechanism available directly to the user. Savepoints are checkpoints that are triggered externally by the user. Savepoints make it possible to “version” applications by taking consistent snapshots of the state at well-defined time points, and then rerunning the application (or a different version of the application code from that time point). In practice, savepoints are essential for production use, enabling easy debugging, code upgrades, what-if simulations, and A/B testing.