observe

monitor vs observa

things to monitor

This is a core SRE interview + real-world ops question. Below is a practical, real-time view of how SREs actually calculate and monitor latency, throughput, availability, error budget, status codes, and other SRE signals — with formulas + live system examples.

1️⃣ Latency (Response Time)

📌 What it is

Time taken to serve a request.

📐 How to calculate (real time)

Latency = response_time - request_start_time

📊 Percentiles (VERY IMPORTANT)

p50 → median user experience
p95 / p99 → worst-case users

Example (Prometheus)

histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Real-world source

Nginx access logs
Envoy
Application instrumentation
OpenTelemetry

2️⃣ Throughput (Traffic / Load)

📌 What it is

Number of requests processed per unit time.

📐 Formula

Throughput = total_requests / time

Example

6000 requests / 60 seconds = 100 RPS

Prometheus (HTTP RPS)

sum(rate(http_requests_total[1m]))

3️⃣ Availability (Uptime %)

📌 What it is

Percentage of successful requests.

📐 Formula

Availability = (successful_requests / total_requests) × 100

Example

99,950 successful / 100,000 total
= 99.95%

Prometheus

sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

4️⃣ Error Rate

📌 What it is

Percentage of failed requests.

📐 Formula

Error Rate = (error_requests / total_requests) × 100

Prometheus

sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

5️⃣ Error Budget (Most Important SRE Concept)

📌 What it is

How much failure is acceptable.

📐 Formula

Error Budget = 100% - SLO

Example

SLO

Error Budget

99.9%

0.1%

99.99%

0.01%

Monthly example

Total seconds in 30 days = 2,592,000
Allowed downtime (99.9%) = 2,592 seconds (~43 min)

Error Budget Burn Rate

Burn Rate = current_error_rate / allowed_error_rate

6️⃣ Count HTTP Status Codes (Real Time)

Nginx / App metrics

sum(rate(http_requests_total{status="200"}[1m]))
sum(rate(http_requests_total{status=~"4.."}[1m]))
sum(rate(http_requests_total{status=~"5.."}[1m]))

Log-based (Loki)

count_over_time({job="nginx"} |= " 500 "[1m])

7️⃣ Latency SLO (Golden Signal)

Example SLO

95% of requests must be < 300ms

Prometheus

histogram_quantile(
  0.95,
  rate(http_request_duration_seconds_bucket[5m])
) < 0.3

8️⃣ Saturation (Resource Pressure)

📌 What it is

How close the system is to its limits.

Examples

CPU usage %
Memory usage %
Queue depth
Thread pool usage

Prometheus

avg(rate(container_cpu_usage_seconds_total[1m])) * 100

9️⃣ Apdex (User Satisfaction)

📌 What it is

Measures user happiness.

📐 Formula

Apdex = (Satisfied + Tolerating/2) / Total

Thresholds

Satisfied ≤ 300ms
Tolerating ≤ 1.2s

🔟 Real-Time Monitoring Stack (Industry Standard)

App → OpenTelemetry → Prometheus → Grafana → Alertmanager
Logs → Loki
Traces → Tempo / Jaeger

1️⃣1️⃣ Alerts (Actionable Only)

Bad Alert ❌

CPU > 80%

Good Alert ✅

p99 latency > 500ms for 5 mins AND error rate > 2%

1️⃣2️⃣ Sample SRE Dashboard (What Companies Expect)

Signal

Metric

Latency

p50, p95, p99

Traffic

RPS

Errors

4xx / 5xx

Availability

% success

Saturation

CPU, memory

Reliability

Error budget burn

🔑 Interview-Ready One-Liner Summary

“In real time, we calculate latency using percentiles from request histograms, throughput using request rate, availability as successful requests over total requests, and manage reliability using SLO-driven error budgets and burn rates.”

If you want next:

Real Grafana dashboard layout
SRE interview answers (FAANG-style)
Production Prometheus configs
Alerting rules examples
Explain this with Kubernetes + ALB + Nginx

Just say 👍

3 telemetry pillers

1️⃣ Latency (Response Time)

📌 What it is

Time taken to serve a request.

📐 How to calculate (real time)

Latency = response_time - request_start_time

📊 Percentiles (VERY IMPORTANT)

p50 → median user experience
p95 / p99 → worst-case users

Example (Prometheus)

histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Real-world source

Nginx access logs
Envoy
Application instrumentation
OpenTelemetry

2️⃣ Throughput (Traffic / Load)

📌 What it is

Number of requests processed per unit time.

📐 Formula

Throughput = total_requests / time

Example

6000 requests / 60 seconds = 100 RPS

Prometheus (HTTP RPS)

sum(rate(http_requests_total[1m]))

3️⃣ Availability (Uptime %)

📌 What it is

Percentage of successful requests.

📐 Formula

Availability = (successful_requests / total_requests) × 100

Example

99,950 successful / 100,000 total
= 99.95%

Prometheus

sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

4️⃣ Error Rate

📌 What it is

Percentage of failed requests.

📐 Formula

Error Rate = (error_requests / total_requests) × 100

Prometheus

sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

5️⃣ Error Budget (Most Important SRE Concept)

📌 What it is

How much failure is acceptable.

📐 Formula

Error Budget = 100% - SLO

Example

SLO

Error Budget

99.9%

0.1%

99.99%

0.01%

Monthly example

Total seconds in 30 days = 2,592,000
Allowed downtime (99.9%) = 2,592 seconds (~43 min)

Error Budget Burn Rate

Burn Rate = current_error_rate / allowed_error_rate

6️⃣ Count HTTP Status Codes (Real Time)

Nginx / App metrics

sum(rate(http_requests_total{status="200"}[1m]))
sum(rate(http_requests_total{status=~"4.."}[1m]))
sum(rate(http_requests_total{status=~"5.."}[1m]))

Log-based (Loki)

count_over_time({job="nginx"} |= " 500 "[1m])

7️⃣ Latency SLO (Golden Signal)

Example SLO

95% of requests must be < 300ms

Prometheus

histogram_quantile(
  0.95,
  rate(http_request_duration_seconds_bucket[5m])
) < 0.3

8️⃣ Saturation (Resource Pressure)

📌 What it is

How close the system is to its limits.

Examples

CPU usage %
Memory usage %
Queue depth
Thread pool usage

Prometheus

avg(rate(container_cpu_usage_seconds_total[1m])) * 100

9️⃣ Apdex (User Satisfaction)

📌 What it is

Measures user happiness.

📐 Formula

Apdex = (Satisfied + Tolerating/2) / Total

Thresholds

Satisfied ≤ 300ms
Tolerating ≤ 1.2s

🔟 Real-Time Monitoring Stack (Industry Standard)

App → OpenTelemetry → Prometheus → Grafana → Alertmanager
Logs → Loki
Traces → Tempo / Jaeger

1️⃣1️⃣ Alerts (Actionable Only)

Bad Alert ❌

CPU > 80%

Good Alert ✅

p99 latency > 500ms for 5 mins AND error rate > 2%

1️⃣2️⃣ Sample SRE Dashboard (What Companies Expect)

Signal

Metric

Latency

p50, p95, p99

Traffic

RPS

Errors

4xx / 5xx

Availability

% success

Saturation

CPU, memory

Reliability

Error budget burn

🔑 Interview-Ready One-Liner Summary

“In real time, we calculate latency using percentiles from request histograms, throughput using request rate, availability as successful requests over total requests, and manage reliability using SLO-driven error budgets and burn rates.”

If you want next:

Real Grafana dashboard layout
SRE interview answers (FAANG-style)
Production Prometheus configs
Alerting rules examples
Explain this with Kubernetes + ALB + Nginx

Just say 👍

avalability, latenccy. total usrs, orders server for period, thoughputs, error budget, status code count

PreviousAWS Latency Nextintroduction

Last updated 1 month ago

hashtag1️⃣ Latency (Response Time)

hashtag📌 What it is

hashtag📐 How to calculate (real time)

hashtag📊 Percentiles (VERY IMPORTANT)

hashtagExample (Prometheus)

hashtagReal-world source

hashtag2️⃣ Throughput (Traffic / Load)

hashtag📌 What it is

hashtag📐 Formula

hashtagExample

hashtagPrometheus (HTTP RPS)

hashtag3️⃣ Availability (Uptime %)

hashtag📌 What it is

hashtag📐 Formula

hashtagExample

hashtagPrometheus

hashtag4️⃣ Error Rate

hashtag📌 What it is

hashtag📐 Formula

hashtagPrometheus

hashtag5️⃣ Error Budget (Most Important SRE Concept)

hashtag📌 What it is

hashtag📐 Formula

hashtagExample

hashtagMonthly example

hashtagError Budget Burn Rate

hashtag6️⃣ Count HTTP Status Codes (Real Time)

hashtagNginx / App metrics

hashtagLog-based (Loki)

hashtag7️⃣ Latency SLO (Golden Signal)

hashtagExample SLO

hashtagPrometheus

hashtag8️⃣ Saturation (Resource Pressure)

hashtag📌 What it is

hashtagExamples

hashtagPrometheus

hashtag9️⃣ Apdex (User Satisfaction)

hashtag📌 What it is

hashtag📐 Formula

hashtagThresholds

hashtag🔟 Real-Time Monitoring Stack (Industry Standard)

hashtag1️⃣1️⃣ Alerts (Actionable Only)

hashtagBad Alert ❌

hashtagGood Alert ✅

hashtag1️⃣2️⃣ Sample SRE Dashboard (What Companies Expect)

hashtag🔑 Interview-Ready One-Liner Summary

hashtag1️⃣ Latency (Response Time)

hashtag📌 What it is

hashtag📐 How to calculate (real time)

hashtag📊 Percentiles (VERY IMPORTANT)

hashtagExample (Prometheus)

hashtagReal-world source

hashtag2️⃣ Throughput (Traffic / Load)

hashtag📌 What it is

hashtag📐 Formula

hashtagExample

hashtagPrometheus (HTTP RPS)

hashtag3️⃣ Availability (Uptime %)

hashtag📌 What it is

hashtag📐 Formula

hashtagExample

hashtagPrometheus

hashtag4️⃣ Error Rate

hashtag📌 What it is

hashtag📐 Formula

hashtagPrometheus

hashtag5️⃣ Error Budget (Most Important SRE Concept)

hashtag📌 What it is

hashtag📐 Formula

hashtagExample

hashtagMonthly example

hashtagError Budget Burn Rate

hashtag6️⃣ Count HTTP Status Codes (Real Time)

hashtagNginx / App metrics

hashtagLog-based (Loki)

hashtag7️⃣ Latency SLO (Golden Signal)

hashtagExample SLO

hashtagPrometheus

hashtag8️⃣ Saturation (Resource Pressure)

hashtag📌 What it is

1️⃣ Latency (Response Time)

📌 What it is

📐 How to calculate (real time)

📊 Percentiles (VERY IMPORTANT)

Example (Prometheus)

Real-world source

2️⃣ Throughput (Traffic / Load)

📌 What it is

📐 Formula

Example

Prometheus (HTTP RPS)

3️⃣ Availability (Uptime %)

📌 What it is

📐 Formula

Example

Prometheus

4️⃣ Error Rate

📌 What it is

📐 Formula

Prometheus

5️⃣ Error Budget (Most Important SRE Concept)

📌 What it is

📐 Formula

Example

Monthly example

Error Budget Burn Rate

6️⃣ Count HTTP Status Codes (Real Time)

Nginx / App metrics

Log-based (Loki)

7️⃣ Latency SLO (Golden Signal)

Example SLO

Prometheus

8️⃣ Saturation (Resource Pressure)

📌 What it is

Examples

Prometheus

9️⃣ Apdex (User Satisfaction)

📌 What it is

📐 Formula

Thresholds

🔟 Real-Time Monitoring Stack (Industry Standard)

1️⃣1️⃣ Alerts (Actionable Only)

Bad Alert ❌

Good Alert ✅

1️⃣2️⃣ Sample SRE Dashboard (What Companies Expect)

🔑 Interview-Ready One-Liner Summary

1️⃣ Latency (Response Time)

📌 What it is

📐 How to calculate (real time)

📊 Percentiles (VERY IMPORTANT)

Example (Prometheus)

Real-world source

2️⃣ Throughput (Traffic / Load)

📌 What it is

📐 Formula

Example

Prometheus (HTTP RPS)

3️⃣ Availability (Uptime %)

📌 What it is

📐 Formula

Example

Prometheus

4️⃣ Error Rate

📌 What it is

📐 Formula

Prometheus

5️⃣ Error Budget (Most Important SRE Concept)

📌 What it is

📐 Formula

Example

Monthly example

Error Budget Burn Rate

6️⃣ Count HTTP Status Codes (Real Time)

Nginx / App metrics

Log-based (Loki)

7️⃣ Latency SLO (Golden Signal)

Example SLO

Prometheus

8️⃣ Saturation (Resource Pressure)

📌 What it is