About
Alerting is the science of detection
Deviations from prediction on Time serie are a powerful way to tell when there is a problem and to trigger alerts when a threshold is reached.
See also:
Articles Related
Format
Subject
An alert should communicate in plain language:
- Service A is down
- 90% of all web requests are taking more than 0.5s to process and respond.
Property
Urgency / Severity
Severity / Urgency | Alert Type | Description | Example |
---|---|---|---|
High / Urgent | Page (as in wiki/Pager) | Immediate human intervention, interrupts a recipient’s work, sleep, or personal time, whatever the hour | Response times exceeding a SLA: Service Level Agreements, no acceptable throughput, latency, or error rates. |
Moderate | Notification | Eventual human intervention, notifies someone who can fix the problem in a non-interrupting way such as email or chat. | Data store is running low on disk space |
Low | Log / Record | Attention needed in the future, does not notify anyone automatically | Transient issues could be to blame, such as network congestion, often go away on their own. |
Moderate and low alert won’t wake anyone in the middle of the night or disrupt an engineer’s flow.
See also: Priority vs Severity.
Note that depending on severity, a notification may be more appropriate than a page, or vice versa:
Label, Milestone, Assignments
Github has no priority, nor ordering. See Issues 2.0: The Next Generation It revolves around three major pillars:
- Assignments, Each issue can be assigned to a collaborator.
- Labels (tag, Gmail Label). One issue can be tagged with different labels.
- and milestones. “package” several issues into a milestone.
Symptom vs Cause
A symptom (oftentimes user-facing problems) may have any number of different causes.
Page on a symptom (User experience, such as slow website responses) and notify on potential causes of the symptom, such as high load on your web servers. The users will not know or care about server load if the website is still responding quickly.
Example
Data | Alert | Trigger |
---|---|---|
Work metric: Throughput | Page | value is much higher or lower than usual, or there is an anomalous rate of change |
Work metric: Success | Page | the percentage of work that is successfully processed drops below a threshold |
Work metric: Errors | Page | the error rate exceeds a threshold |
Work metric: Latency | Page | work takes too long to complete (e.g., performance violates internal SLA) |
Resource metric: Utilization | Notification | approaching critical resource limit (e.g., free disk space drops below a threshold) |
Resource metric: Saturation | Record | number of waiting processes exceeds a threshold |
Resource metric: Errors | Record | number of errors during a fixed period exceeds a threshold |
Resource metric: Availability | Record | the resource is unavailable for a percentage of time that exceeds a threshold |
Event: Work-related | Page | critical work that should have been completed is reported as incomplete or failed |
Detection
- The Wheeler rules (equivalent to the Western Electric zone tests)
- The Nelson rules
Threshold
Horizontal Line
A fixed boundary as a floor or ceiling that characterizes normal behavior which, if crossed, indicates a deviation from normal behavior.
Static thresholds are insufficient in accurately capturing deviations in oscillating signals.
Time series Model
Time serie forecasting method. The bound is then no longer static and can “move” with the input signal.
A model that requires only the most recent observation to be kept is suitable for real-time alerting. such as Time Series - Exponential smoothing
Control Chart Limit
Tool/Library
- Apache Common Reporting (for Java Object)
- scobal/seyren - Alerting dashboard for Graphite (Java)
Aggregation / Duplicate Detection
Many alert can be fired for the same root cause. Real problems are often lost in a sea of noisy alarms.
They can be:
- a duplicate of a an existing one
- or a new one caused by a chain reaction (correlated / cascade).
Many alerts are therefore often aggregated to show a real state of the system.
Duplicate detection may have several rules such as:
- same environment
- same resource attributes
- same severity.
- timeframe
- or simply through a aggregate key property of the alert.