Application - Fault (Crash)

Card Puncher Data Processing


Things that can go wrong in a system are called faults. A fault is usually defined as one component of the system deviating from its specification.

It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault tolerance mechanisms that prevent faults from causing failures.

Software that deliberately causes faults — for example, randomly killing individual processes without warning — is known as chaos monkey.

Recovery of fault is called fault tolerance and is generally implemented via a rollback journal that keeps track of the changed sector.

See Counter - Error Rate

Discover More
Card Puncher Data Processing
Application - Fault Handling

fault handling
Scale Counter Graph
Counter - Error Rate

Errrors rate is a SLI metrics that captures the number of errors (resources (memory, disk) or process (timeout)) that occurs. The error rate metrics is generally expressed as a rate of errors (per unit...
Card Puncher Data Processing
Software Design - (Fault Tolerance|Resilience)

Fault tolerance (or resilience) is the ability to recover from errors (fault), regardless of whether those errors resulted from: hardware issues, software issues, general systems issues (network...
Card Puncher Data Processing
System Property - Reliability

The system should continue to: work correctly (performing the correct function at the desired performance) even in the face of adversity (hardware or software faults, and even human error). Since...
Data System Architecture
Transactions - Write-Ahead Logging (Rollback journal) - WAL

Write-Ahead Logging (WAL) is a rollback journal implementation. This implementation writes change directly to the rollback journal whereas the traditional rollback journal writes changes to the original...

Share this page:
Follow us:
Task Runner