Alerting is the science of detection
An alert should communicate in plain language:
- Service A is down
- 90% of all web requests are taking more than 0.5s to process and respond.
Urgency / Severity
|Severity / Urgency||Alert Type||Description||Example|
|High / Urgent||Page (as in wiki/Pager)||Immediate human intervention, interrupts a recipient’s work, sleep, or personal time, whatever the hour||Response times exceeding a SLA: Service Level Agreements, no acceptable throughput, latency, or error rates.|
|Moderate||Notification||Eventual human intervention, notifies someone who can fix the problem in a non-interrupting way such as email or chat.||Data store is running low on disk space|
|Low||Log / Record||Attention needed in the future, does not notify anyone automatically||Transient issues could be to blame, such as network congestion, often go away on their own.|
Moderate and low alert won’t wake anyone in the middle of the night or disrupt an engineer’s flow.
See also: Priority vs Severity.
Note that depending on severity, a notification may be more appropriate than a page, or vice versa:
Label, Milestone, Assignments
Github has no priority, nor ordering. See Issues 2.0: The Next Generation It revolves around three major pillars:
- Assignments, Each issue can be assigned to a collaborator.
- Labels (tag, Gmail Label). One issue can be tagged with different labels.
- and milestones. “package” several issues into a milestone.
Symptom vs Cause
A symptom (oftentimes user-facing problems) may have any number of different causes.
Page on a symptom (User experience, such as slow website responses) and notify on potential causes of the symptom, such as high load on your web servers. The users will not know or care about server load if the website is still responding quickly.
|Work metric: Throughput||Page||value is much higher or lower than usual, or there is an anomalous rate of change|
|Work metric: Success||Page||the percentage of work that is successfully processed drops below a threshold|
|Work metric: Errors||Page||the error rate exceeds a threshold|
|Work metric: Latency||Page||work takes too long to complete (e.g., performance violates internal SLA)|
|Resource metric: Utilization||Notification||approaching critical resource limit (e.g., free disk space drops below a threshold)|
|Resource metric: Saturation||Record||number of waiting processes exceeds a threshold|
|Resource metric: Errors||Record||number of errors during a fixed period exceeds a threshold|
|Resource metric: Availability||Record||the resource is unavailable for a percentage of time that exceeds a threshold|
|Event: Work-related||Page||critical work that should have been completed is reported as incomplete or failed|
A fixed boundary as a floor or ceiling that characterizes normal behavior which, if crossed, indicates a deviation from normal behavior.
Static thresholds are insufficient in accurately capturing deviations in oscillating signals.
Time series Model
Time serie forecasting method. The bound is then no longer static and can “move” with the input signal.
Control Chart Limit
- Apache Common Reporting (for Java Object)
- scobal/seyren - Alerting dashboard for Graphite (Java)
Aggregation / Duplicate Detection
Many alert can be fired for the same root cause. Real problems are often lost in a sea of noisy alarms.
They can be:
- a duplicate of a an existing one
- or a new one caused by a chain reaction (correlated / cascade).
Many alerts are therefore often aggregated to show a real state of the system.
Duplicate detection may have several rules such as:
- same environment
- same resource attributes
- same severity.
- or simply through a aggregate key property of the alert.