Dashboard
SLA Dashboard
To create an SLA Dashboard in Grafana for your servers, you need to monitor key metrics related to availability, performance, and reliability. Below are the essential Prometheus queries you can use.
π Key Metrics for SLA Dashboard
1οΈβ£ Uptime & Availability
πΉ Uptime of the Server (in %)
Check server uptime over a defined period (e.g., last 30 days).
(sum by (instance) (rate(node_time_seconds[5m])) / count(node_time_seconds)) * 100β What it does?
Calculates server uptime percentage based on system time.
Helps measure SLA compliance (e.g., 99.9% uptime).
2οΈβ£ Server Health & Load
πΉ CPU Usage (%)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)β What it does?
Monitors CPU load.
High values indicate server stress.
πΉ Memory Usage (%)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100β What it does?
Tracks memory consumption.
Useful for avoiding OOM (Out of Memory) issues.
πΉ Disk Space Usage (%)
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)β What it does?
Monitors disk space to prevent downtime due to storage issues.
3οΈβ£ Network & Connectivity
πΉ Network Latency
avg by (instance) (rate(node_network_receive_packets_total[5m]))β What it does?
Checks network packet receive rate (detects slow connections).
πΉ Network Errors
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])β What it does?
Monitors network failures impacting service reliability.
4οΈβ£ Response Time & Performance
πΉ Request Latency (if using Nginx/Apache)
For Nginx logs (via Prometheus Exporter):
histogram_quantile(0.95, sum(rate(nginx_vts_server_request_duration_seconds_bucket[5m])) by (le, instance))β What it does?
Monitors 95th percentile response time (SLO measurement).
5οΈβ£ Service Level Objective (SLO) & SLA Compliance
πΉ SLA Compliance (% Availability)
(sum(up == 1) by (instance) / count(up)) * 100β What it does?
Measures server availability percentage.
Helps validate SLA guarantees (e.g., 99.9% uptime).
π Recommended Grafana Panels
Single Stat Panels β Show Uptime %, CPU Usage, Memory Usage.
Time Series Graphs β Display Latency Trends, Network Performance.
Gauge Panels β Track Disk & Memory Usage.
Heatmaps β Analyze CPU & Load Distribution.
π Final Thoughts
This setup ensures you track SLA compliance, detect issues, and visualize service health in Grafana. π Let me know if you need alerts or specific visualizations!
Last updated