Monitoring - (Alert|Anomalie) Detection

About

Alerting is the science of detection

Deviations from prediction on Time serie are a powerful way to tell when there is a problem and to trigger alerts when a threshold is reached.

Articles Related

Format

Subject

An alert should communicate in plain language:

Service A is down
90% of all web requests are taking more than 0.5s to process and respond.

Property

Urgency / Severity

Severity / Urgency	Alert Type	Description	Example
High / Urgent	Page (as in wiki/Pager)	Immediate human intervention, interrupts a recipient’s work, sleep, or personal time, whatever the hour	Response times exceeding a SLA: Service Level Agreements, no acceptable throughput, latency, or error rates.
Moderate	Notification	Eventual human intervention, notifies someone who can fix the problem in a non-interrupting way such as email or chat.	Data store is running low on disk space
Low	Log / Record	Attention needed in the future, does not notify anyone automatically	Transient issues could be to blame, such as network congestion, often go away on their own.

Moderate and low alert won’t wake anyone in the middle of the night or disrupt an engineer’s flow.

Label, Milestone, Assignments

Github has no priority, nor ordering. See Issues 2.0: The Next Generation It revolves around three major pillars:

Assignments, Each issue can be assigned to a collaborator.
Labels (tag, Gmail Label). One issue can be tagged with different labels.
and milestones. “package” several issues into a milestone.

Symptom vs Cause

A symptom (oftentimes user-facing problems) may have any number of different causes.

Page on a symptom (User experience, such as slow website responses) and notify on potential causes of the symptom, such as high load on your web servers. The users will not know or care about server load if the website is still responding quickly.

Example

Data	Alert	Trigger
Work metric: Throughput	Page	value is much higher or lower than usual, or there is an anomalous rate of change
Work metric: Success	Page	the percentage of work that is successfully processed drops below a threshold
Work metric: Errors	Page	the error rate exceeds a threshold
Work metric: Latency	Page	work takes too long to complete (e.g., performance violates internal SLA)
Resource metric: Utilization	Notification	approaching critical resource limit (e.g., free disk space drops below a threshold)
Resource metric: Saturation	Record	number of waiting processes exceeds a threshold
Resource metric: Errors	Record	number of errors during a fixed period exceeds a threshold
Resource metric: Availability	Record	the resource is unavailable for a percentage of time that exceeds a threshold
Event: Work-related	Page	critical work that should have been completed is reported as incomplete or failed

Detection

See Rules for detecting alert

The Western Electric rules
The Wheeler rules (equivalent to the Western Electric zone tests)
The Nelson rules

Threshold

Horizontal Line

A fixed boundary as a floor or ceiling that characterizes normal behavior which, if crossed, indicates a deviation from normal behavior.

Static thresholds are insufficient in accurately capturing deviations in oscillating signals.

Time series Model

Time serie forecasting method. The bound is then no longer static and can “move” with the input signal.

A model that requires only the most recent observation to be kept is suitable for real-time alerting. such as Time Series - Exponential smoothing

Control Chart Limit

See Control Chart Limit

Tool/Library

Apache Common Reporting (for Java Object)
VisualVM (API Quickstart)

http://alerta.io/
scobal/seyren - Alerting dashboard for Graphite (Java)
https://github.com/prometheus/alertmanager/blob/master/README.md

Aggregation / Duplicate Detection

Many alert can be fired for the same root cause. Real problems are often lost in a sea of noisy alarms.

They can be:

a duplicate of a an existing one
or a new one caused by a chain reaction (correlated / cascade).

Many alerts are therefore often aggregated to show a real state of the system.

Duplicate detection may have several rules such as:

same environment
same resource attributes
same severity.
timeframe
or simply through a aggregate key property of the alert.