Alerting is the science of detection
Deviations from prediction on Time serie are a powerful way to tell when there is a problem and to trigger alerts when a threshold is reached.
See also:
An alert should communicate in plain language:
Severity / Urgency | Alert Type | Description | Example |
---|---|---|---|
High / Urgent | Page (as in wiki/Pager) | Immediate human intervention, interrupts a recipient’s work, sleep, or personal time, whatever the hour | Response times exceeding a SLA: Service Level Agreements, no acceptable throughput, latency, or error rates. |
Moderate | Notification | Eventual human intervention, notifies someone who can fix the problem in a non-interrupting way such as email or chat. | Data store is running low on disk space |
Low | Log / Record | Attention needed in the future, does not notify anyone automatically | Transient issues could be to blame, such as network congestion, often go away on their own. |
Moderate and low alert won’t wake anyone in the middle of the night or disrupt an engineer’s flow.
See also: Priority vs Severity.
Note that depending on severity, a notification may be more appropriate than a page, or vice versa:
Github has no priority, nor ordering. See Issues 2.0: The Next Generation It revolves around three major pillars:
A symptom (oftentimes user-facing problems) may have any number of different causes.
Page on a symptom (User experience, such as slow website responses) and notify on potential causes of the symptom, such as high load on your web servers. The users will not know or care about server load if the website is still responding quickly.
Data | Alert | Trigger |
---|---|---|
Work metric: Throughput | Page | value is much higher or lower than usual, or there is an anomalous rate of change |
Work metric: Success | Page | the percentage of work that is successfully processed drops below a threshold |
Work metric: Errors | Page | the error rate exceeds a threshold |
Work metric: Latency | Page | work takes too long to complete (e.g., performance violates internal SLA) |
Resource metric: Utilization | Notification | approaching critical resource limit (e.g., free disk space drops below a threshold) |
Resource metric: Saturation | Record | number of waiting processes exceeds a threshold |
Resource metric: Errors | Record | number of errors during a fixed period exceeds a threshold |
Resource metric: Availability | Record | the resource is unavailable for a percentage of time that exceeds a threshold |
Event: Work-related | Page | critical work that should have been completed is reported as incomplete or failed |
A fixed boundary as a floor or ceiling that characterizes normal behavior which, if crossed, indicates a deviation from normal behavior.
Static thresholds are insufficient in accurately capturing deviations in oscillating signals.
Time serie forecasting method. The bound is then no longer static and can “move” with the input signal.
A model that requires only the most recent observation to be kept is suitable for real-time alerting. such as Time Series - Exponential smoothing
Many alert can be fired for the same root cause. Real problems are often lost in a sea of noisy alarms.
They can be:
Many alerts are therefore often aggregated to show a real state of the system.
Duplicate detection may have several rules such as: