observe

monitor vs observa

chevron-rightthings to monitorhashtag

This is a core SRE interview + real-world ops question. Below is a practical, real-time view of how SREs actually calculate and monitor latency, throughput, availability, error budget, status codes, and other SRE signals — with formulas + live system examples.


1️⃣ Latency (Response Time)

📌 What it is

Time taken to serve a request.

📐 How to calculate (real time)

Latency = response_time - request_start_time

📊 Percentiles (VERY IMPORTANT)

  • p50 → median user experience

  • p95 / p99 → worst-case users

Example (Prometheus)

histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Real-world source

  • Nginx access logs

  • Envoy

  • Application instrumentation

  • OpenTelemetry


2️⃣ Throughput (Traffic / Load)

📌 What it is

Number of requests processed per unit time.

📐 Formula

Throughput = total_requests / time

Example

6000 requests / 60 seconds = 100 RPS

Prometheus (HTTP RPS)

sum(rate(http_requests_total[1m]))

3️⃣ Availability (Uptime %)

📌 What it is

Percentage of successful requests.

📐 Formula

Availability = (successful_requests / total_requests) × 100

Example

99,950 successful / 100,000 total
= 99.95%

Prometheus

sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

4️⃣ Error Rate

📌 What it is

Percentage of failed requests.

📐 Formula

Error Rate = (error_requests / total_requests) × 100

Prometheus

sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

5️⃣ Error Budget (Most Important SRE Concept)

📌 What it is

How much failure is acceptable.

📐 Formula

Error Budget = 100% - SLO

Example

SLO
Error Budget

99.9%

0.1%

99.99%

0.01%

Monthly example

Total seconds in 30 days = 2,592,000
Allowed downtime (99.9%) = 2,592 seconds (~43 min)

Error Budget Burn Rate

Burn Rate = current_error_rate / allowed_error_rate

6️⃣ Count HTTP Status Codes (Real Time)

Nginx / App metrics

sum(rate(http_requests_total{status="200"}[1m]))
sum(rate(http_requests_total{status=~"4.."}[1m]))
sum(rate(http_requests_total{status=~"5.."}[1m]))

Log-based (Loki)

count_over_time({job="nginx"} |= " 500 "[1m])

7️⃣ Latency SLO (Golden Signal)

Example SLO

95% of requests must be < 300ms

Prometheus

histogram_quantile(
  0.95,
  rate(http_request_duration_seconds_bucket[5m])
) < 0.3

8️⃣ Saturation (Resource Pressure)

📌 What it is

How close the system is to its limits.

Examples

  • CPU usage %

  • Memory usage %

  • Queue depth

  • Thread pool usage

Prometheus

avg(rate(container_cpu_usage_seconds_total[1m])) * 100

9️⃣ Apdex (User Satisfaction)

📌 What it is

Measures user happiness.

📐 Formula

Apdex = (Satisfied + Tolerating/2) / Total

Thresholds

  • Satisfied ≤ 300ms

  • Tolerating ≤ 1.2s


🔟 Real-Time Monitoring Stack (Industry Standard)

App → OpenTelemetry → Prometheus → Grafana → Alertmanager
Logs → Loki
Traces → Tempo / Jaeger

1️⃣1️⃣ Alerts (Actionable Only)

Bad Alert ❌

CPU > 80%

Good Alert ✅

p99 latency > 500ms for 5 mins AND error rate > 2%


1️⃣2️⃣ Sample SRE Dashboard (What Companies Expect)

Signal
Metric

Latency

p50, p95, p99

Traffic

RPS

Errors

4xx / 5xx

Availability

% success

Saturation

CPU, memory

Reliability

Error budget burn


🔑 Interview-Ready One-Liner Summary

“In real time, we calculate latency using percentiles from request histograms, throughput using request rate, availability as successful requests over total requests, and manage reliability using SLO-driven error budgets and burn rates.”


If you want next:

  • Real Grafana dashboard layout

  • SRE interview answers (FAANG-style)

  • Production Prometheus configs

  • Alerting rules examples

  • Explain this with Kubernetes + ALB + Nginx

Just say 👍

chevron-right3 telemetry pillershashtag

This is a core SRE interview + real-world ops question. Below is a practical, real-time view of how SREs actually calculate and monitor latency, throughput, availability, error budget, status codes, and other SRE signals — with formulas + live system examples.


1️⃣ Latency (Response Time)

📌 What it is

Time taken to serve a request.

📐 How to calculate (real time)

Latency = response_time - request_start_time

📊 Percentiles (VERY IMPORTANT)

  • p50 → median user experience

  • p95 / p99 → worst-case users

Example (Prometheus)

histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Real-world source

  • Nginx access logs

  • Envoy

  • Application instrumentation

  • OpenTelemetry


2️⃣ Throughput (Traffic / Load)

📌 What it is

Number of requests processed per unit time.

📐 Formula

Throughput = total_requests / time

Example

6000 requests / 60 seconds = 100 RPS

Prometheus (HTTP RPS)

sum(rate(http_requests_total[1m]))

3️⃣ Availability (Uptime %)

📌 What it is

Percentage of successful requests.

📐 Formula

Availability = (successful_requests / total_requests) × 100

Example

99,950 successful / 100,000 total
= 99.95%

Prometheus

sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

4️⃣ Error Rate

📌 What it is

Percentage of failed requests.

📐 Formula

Error Rate = (error_requests / total_requests) × 100

Prometheus

sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

5️⃣ Error Budget (Most Important SRE Concept)

📌 What it is

How much failure is acceptable.

📐 Formula

Error Budget = 100% - SLO

Example

SLO
Error Budget

99.9%

0.1%

99.99%

0.01%

Monthly example

Total seconds in 30 days = 2,592,000
Allowed downtime (99.9%) = 2,592 seconds (~43 min)

Error Budget Burn Rate

Burn Rate = current_error_rate / allowed_error_rate

6️⃣ Count HTTP Status Codes (Real Time)

Nginx / App metrics

sum(rate(http_requests_total{status="200"}[1m]))
sum(rate(http_requests_total{status=~"4.."}[1m]))
sum(rate(http_requests_total{status=~"5.."}[1m]))

Log-based (Loki)

count_over_time({job="nginx"} |= " 500 "[1m])

7️⃣ Latency SLO (Golden Signal)

Example SLO

95% of requests must be < 300ms

Prometheus

histogram_quantile(
  0.95,
  rate(http_request_duration_seconds_bucket[5m])
) < 0.3

8️⃣ Saturation (Resource Pressure)

📌 What it is

How close the system is to its limits.

Examples

  • CPU usage %

  • Memory usage %

  • Queue depth

  • Thread pool usage

Prometheus

avg(rate(container_cpu_usage_seconds_total[1m])) * 100

9️⃣ Apdex (User Satisfaction)

📌 What it is

Measures user happiness.

📐 Formula

Apdex = (Satisfied + Tolerating/2) / Total

Thresholds

  • Satisfied ≤ 300ms

  • Tolerating ≤ 1.2s


🔟 Real-Time Monitoring Stack (Industry Standard)

App → OpenTelemetry → Prometheus → Grafana → Alertmanager
Logs → Loki
Traces → Tempo / Jaeger

1️⃣1️⃣ Alerts (Actionable Only)

Bad Alert ❌

CPU > 80%

Good Alert ✅

p99 latency > 500ms for 5 mins AND error rate > 2%


1️⃣2️⃣ Sample SRE Dashboard (What Companies Expect)

Signal
Metric

Latency

p50, p95, p99

Traffic

RPS

Errors

4xx / 5xx

Availability

% success

Saturation

CPU, memory

Reliability

Error budget burn


🔑 Interview-Ready One-Liner Summary

“In real time, we calculate latency using percentiles from request histograms, throughput using request rate, availability as successful requests over total requests, and manage reliability using SLO-driven error budgets and burn rates.”


If you want next:

  • Real Grafana dashboard layout

  • SRE interview answers (FAANG-style)

  • Production Prometheus configs

  • Alerting rules examples

  • Explain this with Kubernetes + ALB + Nginx

Just say 👍

avalability, latenccy. total usrs, orders server for period, thoughputs, error budget, status code count

Last updated