Dashboard

chevron-rightSLA Dashboardhashtag

To create an SLA Dashboard in Grafana for your servers, you need to monitor key metrics related to availability, performance, and reliability. Below are the essential Prometheus queries you can use.


πŸš€ Key Metrics for SLA Dashboard

1️⃣ Uptime & Availability

πŸ”Ή Uptime of the Server (in %)

Check server uptime over a defined period (e.g., last 30 days).

(sum by (instance) (rate(node_time_seconds[5m])) / count(node_time_seconds)) * 100

βœ… What it does?

  • Calculates server uptime percentage based on system time.

  • Helps measure SLA compliance (e.g., 99.9% uptime).


2️⃣ Server Health & Load

πŸ”Ή CPU Usage (%)

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

βœ… What it does?

  • Monitors CPU load.

  • High values indicate server stress.

πŸ”Ή Memory Usage (%)

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

βœ… What it does?

  • Tracks memory consumption.

  • Useful for avoiding OOM (Out of Memory) issues.

πŸ”Ή Disk Space Usage (%)

100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)

βœ… What it does?

  • Monitors disk space to prevent downtime due to storage issues.


3️⃣ Network & Connectivity

πŸ”Ή Network Latency

avg by (instance) (rate(node_network_receive_packets_total[5m]))

βœ… What it does?

  • Checks network packet receive rate (detects slow connections).

πŸ”Ή Network Errors

rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])

βœ… What it does?

  • Monitors network failures impacting service reliability.


4️⃣ Response Time & Performance

πŸ”Ή Request Latency (if using Nginx/Apache)

For Nginx logs (via Prometheus Exporter):

histogram_quantile(0.95, sum(rate(nginx_vts_server_request_duration_seconds_bucket[5m])) by (le, instance))

βœ… What it does?

  • Monitors 95th percentile response time (SLO measurement).


5️⃣ Service Level Objective (SLO) & SLA Compliance

πŸ”Ή SLA Compliance (% Availability)

(sum(up == 1) by (instance) / count(up)) * 100

βœ… What it does?

  • Measures server availability percentage.

  • Helps validate SLA guarantees (e.g., 99.9% uptime).


  • Single Stat Panels β†’ Show Uptime %, CPU Usage, Memory Usage.

  • Time Series Graphs β†’ Display Latency Trends, Network Performance.

  • Gauge Panels β†’ Track Disk & Memory Usage.

  • Heatmaps β†’ Analyze CPU & Load Distribution.


πŸ”— Final Thoughts

This setup ensures you track SLA compliance, detect issues, and visualize service health in Grafana. πŸš€ Let me know if you need alerts or specific visualizations!

Last updated