Performance indicators
Easy to collect with generic agent
Cloud native tools provide Prometheus metrics endpoint
Traditional tools require dedicated tooling
–
Google SRE defines important signals to watch
Latency of requests overall as well as per service
Demand placed on a service
Rate of failed requests
Fraction of resources available to a service
–
Dashboard per service
Graphs for individual metrics
Only include useful metrics
One traffic light for each services
–
Service status depends on multiple metrics
Calculate service status from metrics and thresholds
Collect more metrics than required for RCA
Mean is problematic
95%/99% percentile
Develop baseline
Watch trends
–
Decide green/red or green/yellow/red
Make sure thresholds are reasonable
Only alert when intervention is unavoidable
Excess alerts lead to dulling lead to missed outages