Table of Contents

Application - Fault (Crash)

About

Things that can go wrong in a system are called faults. A fault is usually defined as one component of the system deviating from its specification.

It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault tolerance mechanisms that prevent faults from causing failures.

Software that deliberately causes faults — for example, randomly killing individual processes without warning — is known as chaos monkey.

Recovery of fault is called fault tolerance and is generally implemented via a rollback journal that keeps track of the changed sector.

See Counter - Error Rate