observe
monitor vs observa
things to monitor
This is a core SRE interview + real-world ops question. Below is a practical, real-time view of how SREs actually calculate and monitor latency, throughput, availability, error budget, status codes, and other SRE signals — with formulas + live system examples.
1️⃣ Latency (Response Time)
📌 What it is
Time taken to serve a request.
📐 How to calculate (real time)
Latency = response_time - request_start_time📊 Percentiles (VERY IMPORTANT)
p50 → median user experience
p95 / p99 → worst-case users
Example (Prometheus)
histogram_quantile(
0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)Real-world source
Nginx access logs
Envoy
Application instrumentation
OpenTelemetry
2️⃣ Throughput (Traffic / Load)
📌 What it is
Number of requests processed per unit time.
📐 Formula
Throughput = total_requests / timeExample
6000 requests / 60 seconds = 100 RPSPrometheus (HTTP RPS)
sum(rate(http_requests_total[1m]))3️⃣ Availability (Uptime %)
📌 What it is
Percentage of successful requests.
📐 Formula
Availability = (successful_requests / total_requests) × 100Example
99,950 successful / 100,000 total
= 99.95%Prometheus
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))4️⃣ Error Rate
📌 What it is
Percentage of failed requests.
📐 Formula
Error Rate = (error_requests / total_requests) × 100Prometheus
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))5️⃣ Error Budget (Most Important SRE Concept)
📌 What it is
How much failure is acceptable.
📐 Formula
Error Budget = 100% - SLOExample
99.9%
0.1%
99.99%
0.01%
Monthly example
Total seconds in 30 days = 2,592,000
Allowed downtime (99.9%) = 2,592 seconds (~43 min)Error Budget Burn Rate
Burn Rate = current_error_rate / allowed_error_rate6️⃣ Count HTTP Status Codes (Real Time)
Nginx / App metrics
sum(rate(http_requests_total{status="200"}[1m]))
sum(rate(http_requests_total{status=~"4.."}[1m]))
sum(rate(http_requests_total{status=~"5.."}[1m]))Log-based (Loki)
count_over_time({job="nginx"} |= " 500 "[1m])7️⃣ Latency SLO (Golden Signal)
Example SLO
95% of requests must be < 300ms
Prometheus
histogram_quantile(
0.95,
rate(http_request_duration_seconds_bucket[5m])
) < 0.38️⃣ Saturation (Resource Pressure)
📌 What it is
How close the system is to its limits.
Examples
CPU usage %
Memory usage %
Queue depth
Thread pool usage
Prometheus
avg(rate(container_cpu_usage_seconds_total[1m])) * 1009️⃣ Apdex (User Satisfaction)
📌 What it is
Measures user happiness.
📐 Formula
Apdex = (Satisfied + Tolerating/2) / TotalThresholds
Satisfied ≤ 300ms
Tolerating ≤ 1.2s
🔟 Real-Time Monitoring Stack (Industry Standard)
App → OpenTelemetry → Prometheus → Grafana → Alertmanager
Logs → Loki
Traces → Tempo / Jaeger1️⃣1️⃣ Alerts (Actionable Only)
Bad Alert ❌
CPU > 80%
Good Alert ✅
p99 latency > 500ms for 5 mins AND error rate > 2%
1️⃣2️⃣ Sample SRE Dashboard (What Companies Expect)
Latency
p50, p95, p99
Traffic
RPS
Errors
4xx / 5xx
Availability
% success
Saturation
CPU, memory
Reliability
Error budget burn
🔑 Interview-Ready One-Liner Summary
“In real time, we calculate latency using percentiles from request histograms, throughput using request rate, availability as successful requests over total requests, and manage reliability using SLO-driven error budgets and burn rates.”
If you want next:
Real Grafana dashboard layout
SRE interview answers (FAANG-style)
Production Prometheus configs
Alerting rules examples
Explain this with Kubernetes + ALB + Nginx
Just say 👍
3 telemetry pillers
This is a core SRE interview + real-world ops question. Below is a practical, real-time view of how SREs actually calculate and monitor latency, throughput, availability, error budget, status codes, and other SRE signals — with formulas + live system examples.
1️⃣ Latency (Response Time)
📌 What it is
Time taken to serve a request.
📐 How to calculate (real time)
Latency = response_time - request_start_time📊 Percentiles (VERY IMPORTANT)
p50 → median user experience
p95 / p99 → worst-case users
Example (Prometheus)
histogram_quantile(
0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)Real-world source
Nginx access logs
Envoy
Application instrumentation
OpenTelemetry
2️⃣ Throughput (Traffic / Load)
📌 What it is
Number of requests processed per unit time.
📐 Formula
Throughput = total_requests / timeExample
6000 requests / 60 seconds = 100 RPSPrometheus (HTTP RPS)
sum(rate(http_requests_total[1m]))3️⃣ Availability (Uptime %)
📌 What it is
Percentage of successful requests.
📐 Formula
Availability = (successful_requests / total_requests) × 100Example
99,950 successful / 100,000 total
= 99.95%Prometheus
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))4️⃣ Error Rate
📌 What it is
Percentage of failed requests.
📐 Formula
Error Rate = (error_requests / total_requests) × 100Prometheus
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))5️⃣ Error Budget (Most Important SRE Concept)
📌 What it is
How much failure is acceptable.
📐 Formula
Error Budget = 100% - SLOExample
99.9%
0.1%
99.99%
0.01%
Monthly example
Total seconds in 30 days = 2,592,000
Allowed downtime (99.9%) = 2,592 seconds (~43 min)Error Budget Burn Rate
Burn Rate = current_error_rate / allowed_error_rate6️⃣ Count HTTP Status Codes (Real Time)
Nginx / App metrics
sum(rate(http_requests_total{status="200"}[1m]))
sum(rate(http_requests_total{status=~"4.."}[1m]))
sum(rate(http_requests_total{status=~"5.."}[1m]))Log-based (Loki)
count_over_time({job="nginx"} |= " 500 "[1m])7️⃣ Latency SLO (Golden Signal)
Example SLO
95% of requests must be < 300ms
Prometheus
histogram_quantile(
0.95,
rate(http_request_duration_seconds_bucket[5m])
) < 0.38️⃣ Saturation (Resource Pressure)
📌 What it is
How close the system is to its limits.
Examples
CPU usage %
Memory usage %
Queue depth
Thread pool usage
Prometheus
avg(rate(container_cpu_usage_seconds_total[1m])) * 1009️⃣ Apdex (User Satisfaction)
📌 What it is
Measures user happiness.
📐 Formula
Apdex = (Satisfied + Tolerating/2) / TotalThresholds
Satisfied ≤ 300ms
Tolerating ≤ 1.2s
🔟 Real-Time Monitoring Stack (Industry Standard)
App → OpenTelemetry → Prometheus → Grafana → Alertmanager
Logs → Loki
Traces → Tempo / Jaeger1️⃣1️⃣ Alerts (Actionable Only)
Bad Alert ❌
CPU > 80%
Good Alert ✅
p99 latency > 500ms for 5 mins AND error rate > 2%
1️⃣2️⃣ Sample SRE Dashboard (What Companies Expect)
Latency
p50, p95, p99
Traffic
RPS
Errors
4xx / 5xx
Availability
% success
Saturation
CPU, memory
Reliability
Error budget burn
🔑 Interview-Ready One-Liner Summary
“In real time, we calculate latency using percentiles from request histograms, throughput using request rate, availability as successful requests over total requests, and manage reliability using SLO-driven error budgets and burn rates.”
If you want next:
Real Grafana dashboard layout
SRE interview answers (FAANG-style)
Production Prometheus configs
Alerting rules examples
Explain this with Kubernetes + ALB + Nginx
Just say 👍
avalability, latenccy. total usrs, orders server for period, thoughputs, error budget, status code count
Last updated