Monitoring - (Alert|Anomalie) Detection




An alert should communicate in plain language:

  • Service A is down
  • 90% of all web requests are taking more than 0.5s to process and respond.


Urgency / Severity

Severity / Urgency Alert Type Description Example
High / Urgent Page (as in Pager) Immediate human intervention, interrupts a recipient’s work, sleep, or personal time, whatever the hour Response times exceeding a SLA: Service Level Agreements, no acceptable throughput, latency, or error rates.
Moderate Notification Eventual human intervention, notifies someone who can fix the problem in a non-interrupting way such as email or chat. Data store is running low on disk space
Low Log / Record Attention needed in the future, does not notify anyone automatically Transient issues could be to blame, such as network congestion, often go away on their own.

Moderate and low alert won’t wake anyone in the middle of the night or disrupt an engineer’s flow.

See also: Priority vs Severity.

Note that depending on severity, a notification may be more appropriate than a page, or vice versa:

Label, Milestone, Assignments

Github has no priority, nor ordering. See Issues 2.0: The Next Generation It revolves around three major pillars:

  • Assignments, Each issue can be assigned to a collaborator.
  • Labels (tag, Gmail Label). One issue can be tagged with different labels.
  • and milestones. “package” several issues into a milestone.

Symptom vs Cause

A symptom (oftentimes user-facing problems) may have any number of different causes.

Page on a symptom (User experience, such as slow website responses) and notify on potential causes of the symptom, such as high load on your web servers. The users will not know or care about server load if the website is still responding quickly.


Data Alert Trigger
Work metric: Throughput Page value is much higher or lower than usual, or there is an anomalous rate of change
Work metric: Success Page the percentage of work that is successfully processed drops below a threshold
Work metric: Errors Page the error rate exceeds a threshold
Work metric: Latency Page work takes too long to complete (e.g., performance violates internal SLA)
Resource metric: Utilization Notification approaching critical resource limit (e.g., free disk space drops below a threshold)
Resource metric: Saturation Record number of waiting processes exceeds a threshold
Resource metric: Errors Record number of errors during a fixed period exceeds a threshold
Resource metric: Availability Record the resource is unavailable for a percentage of time that exceeds a threshold
Event: Work-related Page critical work that should have been completed is reported as incomplete or failed


See Rules for detecting alert


Horizontal Line

A fixed boundary as a floor or ceiling that characterizes normal behavior which, if crossed, indicates a deviation from normal behavior.

Static thresholds are insufficient in accurately capturing deviations in oscillating signals.

Time series Model

Time serie forecasting method. The bound is then no longer static and can “move” with the input signal.

A model that requires only the most recent observation to be kept is suitable for real-time alerting. such as Time Series - Exponential smoothing

Control Chart Limit


Aggregation / Duplicate Detection

Many alert can be fired for the same root cause. Real problems are often lost in a sea of noisy alarms.

They can be:

  • a duplicate of a an existing one
  • or a new one caused by a chain reaction (correlated / cascade).

Many alerts are therefore often aggregated to show a real state of the system.

Duplicate detection may have several rules such as:

  • same environment
  • same resource attributes
  • same severity.
  • timeframe
  • or simply through a aggregate key property of the alert.

Documentation / Reference

Powered by ComboStrap