This resource (counter|metrics) are usually expressed in the following terms:
- utilization: as a percent over a time interval. eg, “one disk is running at 90% utilization”.
- saturation: as a queue length. eg, “the CPUs have an average run queue length of four”.
- errors: scalar counts. eg, “this network interface has had fifty late collisions”.
They are part of SLI: Service Level Indicators
The data are generally collected as a gauge
|Disk IO||% time that device was busy||wait queue length||# device errors||% time writable|
|Memory||% of total memory capacity in use||swap usage||N/A (not usually observable)||N/A|
|Microservice||average % time each request-servicing thread was busy||# enqueued requests||# internal errors such as caught exceptions % time service is reachable|
|Database||average % time each connection was busy||# enqueued queries||# internal errors, e.g. replication errors % time database is reachable|