About
Things that can go wrong in a system are called faults. A fault is usually defined as one component of the system deviating from its specification.
It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault tolerance mechanisms that prevent faults from causing failures.
Software that deliberately causes faults — for example, randomly killing individual processes without warning — is known as chaos monkey.
Recovery of fault is called fault tolerance and is generally implemented via a rollback journal that keeps track of the changed sector.