Monitoring - (Alert|Anomalie) Detection

Card Puncher Data Processing

About

Alerting is the science of detection

Deviations from prediction on Time serie are a powerful way to tell when there is a problem and to trigger alerts when a threshold is reached.

See also:

Format

Subject

An alert should communicate in plain language:

  • Service A is down
  • 90% of all web requests are taking more than 0.5s to process and respond.

Property

Urgency / Severity

Severity / Urgency Alert Type Description Example
High / Urgent Page (as in wiki/Pager) Immediate human intervention, interrupts a recipient’s work, sleep, or personal time, whatever the hour Response times exceeding a SLA: Service Level Agreements, no acceptable throughput, latency, or error rates.
Moderate Notification Eventual human intervention, notifies someone who can fix the problem in a non-interrupting way such as email or chat. Data store is running low on disk space
Low Log / Record Attention needed in the future, does not notify anyone automatically Transient issues could be to blame, such as network congestion, often go away on their own.

Moderate and low alert won’t wake anyone in the middle of the night or disrupt an engineer’s flow.

See also: Priority vs Severity.

Note that depending on severity, a notification may be more appropriate than a page, or vice versa:

Label, Milestone, Assignments

Github has no priority, nor ordering. See Issues 2.0: The Next Generation It revolves around three major pillars:

  • Assignments, Each issue can be assigned to a collaborator.
  • Labels (tag, Gmail Label). One issue can be tagged with different labels.
  • and milestones. “package” several issues into a milestone.

Symptom vs Cause

A symptom (oftentimes user-facing problems) may have any number of different causes.

Page on a symptom (User experience, such as slow website responses) and notify on potential causes of the symptom, such as high load on your web servers. The users will not know or care about server load if the website is still responding quickly.

Example

Data Alert Trigger
Work metric: Throughput Page value is much higher or lower than usual, or there is an anomalous rate of change
Work metric: Success Page the percentage of work that is successfully processed drops below a threshold
Work metric: Errors Page the error rate exceeds a threshold
Work metric: Latency Page work takes too long to complete (e.g., performance violates internal SLA)
Resource metric: Utilization Notification approaching critical resource limit (e.g., free disk space drops below a threshold)
Resource metric: Saturation Record number of waiting processes exceeds a threshold
Resource metric: Errors Record number of errors during a fixed period exceeds a threshold
Resource metric: Availability Record the resource is unavailable for a percentage of time that exceeds a threshold
Event: Work-related Page critical work that should have been completed is reported as incomplete or failed

Detection

See Rules for detecting alert

Threshold

Horizontal Line

A fixed boundary as a floor or ceiling that characterizes normal behavior which, if crossed, indicates a deviation from normal behavior.

Static thresholds are insufficient in accurately capturing deviations in oscillating signals.

Time series Model

Time serie forecasting method. The bound is then no longer static and can “move” with the input signal.

A model that requires only the most recent observation to be kept is suitable for real-time alerting. such as Time Series - Exponential smoothing

Control Chart Limit

See Control Chart Limit

Tool/Library

Aggregation / Duplicate Detection

Many alert can be fired for the same root cause. Real problems are often lost in a sea of noisy alarms.

They can be:

  • a duplicate of a an existing one
  • or a new one caused by a chain reaction (correlated / cascade).

Many alerts are therefore often aggregated to show a real state of the system.

Duplicate detection may have several rules such as:

  • same environment
  • same resource attributes
  • same severity.
  • timeframe
  • or simply through a aggregate key property of the alert.

Documentation / Reference





Discover More
Event Centric Thinking
Complex Event Processing (CEP) - detection of sequences of events

The detection and matching of predefined sequences of events in incoming, unbounded data streams is called Complex Event Processing (CEP). CEP is like searching the desired pattern in a query window.MATCH_RECOGNIZE...
Card Puncher Data Processing
Computer Monitoring / Operational Intelligence / Real Time Monitoring

Monitoring is the process of defining metrics and alerts in order to respond to a performance degradation where the acceptable level was defined in service level agreement. Monitoring system implements...
Thomas Bayes
Data Mining - Fraud Detection

Fraud detection graph analysis to identify stolen credit cards and fake identities B012WA66SKFraud Analytics Using Descriptive,...
Card Puncher Data Processing
Event-Data Application

are event-driven application that reports / analyze the immutable event collected (without any notion of a pre-defined lifecycle). An event-driven application is a stateful application that: ingest...
Card Puncher Data Processing
Monitoring Platform

Monitoring platform provides one or more monitoring services such as: Metrics management Log management Trace Management and other monitoring service such as visualization and ticketing. They...
Scale Counter Graph
Prometheus - Alert

alert in prometheus Alerts Firing: Sums up the alerts that have been firing over the last 24 hours.
Event Centric Thinking
Stream - Algorithm (Stochastic) - one-touch processing

in Stream A streaming algorithm needs only need to see each incoming item only once. They work on a stream of data in a single pass. They are also known as one-touch processing. Sketch are streaming...
Time Serie - Analysis

Time series analysis aims to uncover specific patterns in these data to forecast future values basing on previously observed ones. In Time series, fixed variation are introduced into...
Breakout
Time Series - Breakout detection

Breakout occurs in time series data and have two characteristics: A Mean shift: A sudden jump in the time series corresponds to a mean shift. A sudden jump in CPU utilization from 40% to 60% would exemplify...
Utah Teapot
Viz - Control Chart (Shewhart)

The purpose of control charts is to allow simple detection of events that are indicative of actual process change. Control charts attempt to differentiate “assignable” (“special”) sources of variation...



Share this page:
Follow us:
Task Runner