Fault tolerance (or resilience) is the ability to recover from errors (fault), regardless of whether those errors resulted from:
A system tolerant of every possible kind of fault is not feasible.
See also: Software Design - Recovery (Restartable) (same thing ?)
When talking about fault tolerance, the following terms are often used:
fault tolerance is generally provided via a mechanism called checkpoints, essentially taking a consistent snapshot periodically without ever stopping the computation.
Svepoints makes checkpointing mechanism available directly to the user. Savepoints are checkpoints that are triggered externally by the user. Savepoints make it possible to “version” applications by taking consistent snapshots of the state at well-defined time points, and then rerunning the application (or a different version of the application code from that time point). In practice, savepoints are essential for production use, enabling easy debugging, code upgrades, what-if simulations, and A/B testing.