Software Design - (Fault Tolerance|Resilience)

About

Fault tolerance (or resilience) is the ability to recover from errors (fault), regardless of whether those errors resulted from:

hardware issues,
software issues,
general systems issues (network latency, out-of-space errors),
or human mistakes.

A system tolerant of every possible kind of fault is not feasible.

See also: Software Design - Recovery (Restartable) (same thing ?)

Articles Related

Term

When talking about fault tolerance, the following terms are often used:

At least once: this means, in a word counting example, that over-counting after failures is possible
Exactly once: this means that counts are the same with or without failures
End to end exactly once: this means that counts published to an external system will be the same with or without failures.

Implementation

Checkpoint

fault tolerance is generally provided via a mechanism called checkpoints, essentially taking a consistent snapshot periodically without ever stopping the computation.

Savepoint

Svepoints makes checkpointing mechanism available directly to the user. Savepoints are checkpoints that are triggered externally by the user. Savepoints make it possible to “version” applications by taking consistent snapshots of the state at well-defined time points, and then rerunning the application (or a different version of the application code from that time point). In practice, savepoints are essential for production use, enabling easy debugging, code upgrades, what-if simulations, and A/B testing.

About

Articles Related

Term

Checkpoint

Savepoint

Documentation / Reference