Metrics

Metrics

Performance indicators

Host/process/container metrics

Easy to collect with generic agent

Application metrics

Cloud native tools provide Prometheus metrics endpoint

Traditional tools require dedicated tooling

Golden Signals

Google SRE defines important signals to watch

Latency

Latency of requests overall as well as per service

Traffic

Demand placed on a service

Errors

Rate of failed requests

Saturation

Fraction of resources available to a service

Visualization

Dashboard per service

Graphs for individual metrics

Only include useful metrics

One traffic light for each services

Analysis

Service status depends on multiple metrics

Calculate service status from metrics and thresholds

Collect more metrics than required for RCA

Statistics

Mean is problematic

95%/99% percentile

Develop baseline

Watch trends

Alerting

Decide green/red or green/yellow/red

Make sure thresholds are reasonable

Only alert when intervention is unavoidable

Excess alerts lead to dulling lead to missed outages