SLO

STUDY MATERIAL

Google SRE - Embracing risk and reliability engineering bookGoogle SRE

SLO matter for them

slo set by them

suppose we are building logistics platform and maintaining for client platform have list of orders featch from various ecommerce site eg shopify so seller choose order and assign it to specific courier partenr from logistic app and monitor delivery status

so here who will set SLO and what slo would be set by them

Great scenario — you're building and maintaining a logistics platform for a client, where the flow involves:

Fetching orders from e-commerce sites (e.g., Shopify)
Allowing sellers to assign couriers
Tracking delivery status

Now let’s break it down:

🎯 Who Sets the SLOs?

Stakeholder

What They Care About

Do They Set SLOs?

Your client (the platform owner)

Wants reliable service for their sellers and integrations

✅ Yes

You (DevOps/SRE provider)

Ensures backend APIs/services meet those expectations

✅ Yes (internal targets)

E-commerce sites (Shopify)

Provide data but usually don't set SLOs for your platform

❌ No

Sellers (using the platform)

Expect responsive and reliable UI/API

❌ Indirectly (through client expectations)

🧩 What SLOs Could Be Set and By Whom?

1. Order Fetching API

Who sets it: Your client
SLO example:
- 99.9% of /fetch-orders responses succeed within 2 seconds
- Uptime: < 43.2 minutes of downtime per month

Why: If order data isn't fetched reliably, sellers can’t fulfill or track orders.

2. Courier Assignment API

Who sets it: Your client
SLO example:
- 99.95% success rate for /assign-courier
- p95 latency < 300ms

Why: This is mission-critical. A failed courier assignment breaks fulfillment.

3. Delivery Tracking / Status Updates

Who sets it: Your client
SLO example:
- 99.9% of /track-delivery responses return latest status within 1 second
- Updates should not lag more than 5 minutes in >99.5% of cases (if async)

Why: Sellers need real-time visibility for customer service & notifications.

4. Overall Platform Availability (Frontend/API Gateway)

Who sets it: Your client (or jointly with your SRE team)
SLO example:
- 99.95% availability of dashboard or core APIs
- i.e., platform available 24x7 with <21.6 minutes of downtime/month

📈 Bonus: SLOs Based on User Journeys

These combine multiple API calls into a business-critical flow.

Example: "Assign and Track Order" Flow

SLO:
99.9% of sellers should be able to fetch order → assign courier → get tracking info within 60 seconds total

Why: This reflects real user experience, not just API uptime.

🧠 Practical Setup

As the SRE/DevOps team, you'd:

Define SLIs (availability, latency, error rate) per API
Set SLOs with the client based on:
- Criticality
- Business needs
- Technical feasibility
Monitor error budgets, burn rates, alerting using Prometheus + Grafana

✅ Final Summary

Component

Who Sets the SLO

Example SLO

/fetch-orders

Client

99.9% success, <2s latency

/assign-courier

Client

99.95% success, <300ms p95 latency

/track-delivery

Client

99.9% success, <1s latency

Platform frontend

Client & You

99.95% uptime

User journey (end-to-end)

You suggest → client approves

99.9% success end-to-end flow in <60s

Would you like help building a sample SLO document to present to your client, or define SLIs + Prometheus rules for this?

Exactly ✅ — you’ve got it right.

In real-world SRE practice, including your logistics platform scenario:

🎯 Only Critical APIs Get Explicit SLOs

✔️ Yes, you set SLOs only for APIs that matter most to:

The business
Client success
User experience

🧩 Why?

Monitoring and enforcing SLOs for every single API is:

Overkill for low-impact endpoints
Noise-heavy (you’ll get alerts for things that don’t affect users)
Resource-intensive (observability cost & effort)

✅ So Yes, Clients Typically Want SLOs for:

🔸 High-priority, user-facing APIs, like:

API Endpoint

Reason it's Critical

/fetch-orders

Without this, sellers can’t fulfill

/assign-courier

Directly affects order shipping

/track-delivery

Affects customer communication

/login, /place-order

Blockers for platform usage

⚠️ APIs That Usually Don’t Need SLOs:

API Endpoint

Why not critical

/get-banner, /faq

Cosmetic, doesn't affect order flow

/user-feedback

Low frequency usage

/help-center-content

Non-functional/auxiliary

✅ What the Client Wants

The client only cares about high-value business operations
So they’ll say:
“Ensure 99.95% uptime for /assign-courier and order processing. I don’t care about the help page.”

🔧 What You Do as SRE

Task

Action

Identify critical APIs

Talk to product/client, study user flow

Define SLIs/SLOs

Only for APIs that affect revenue/UX

Monitor others lightly

Use general metrics/logging, no alerts

Tune alerting

Alert only on SLO-breach-worthy APIs

🧠 Real-World Tip

Start with 3–5 SLOs max for the platform. More than that becomes hard to manage.

Would you like help designing a simple SLO proposal doc to send to your client listing only the APIs that should be tracked?

so client only set this slo and other monitoring set by service provider like system metrics and logs that client dont interfere ?

Exactly ✅ — you've nailed it.

In a client-service provider relationship like yours, here's how SLOs, monitoring, and observability responsibilities are typically split:

🧭 Who Owns What?

Responsibility

Owned by Client?

Owned by You (Service Provider)?

Setting SLOs for critical APIs

✅ Yes

✅ (with your guidance)

System-level monitoring (CPU, memory, disk)

❌ No

✅ Yes

Internal metrics (pod health, queues, retries)

❌ No

✅ Yes

Application logs, infra logs

❌ No

✅ Yes

Uptime/downtime reporting for APIs

✅ Interested

✅ You provide it

Alerting and incident response

❌ No

✅ You handle it

🔍 What Clients Usually Care About

Clients usually want outcomes, not implementation details:

✅ What they want:

“Is the /assign-courier API up 99.95% of the time?”
“Can you alert us if a delivery status isn't updated for 5 minutes?”
SLA/SLO adherence reports

❌ What they don’t care about:

“CPU load on Node 3”
“Prometheus scrape errors”
“Pod crashloop count”

⚙️ What You Set Up Behind the Scenes

As the service provider / SRE / DevOps, you maintain:

System Metrics:

Node resource usage (CPU, memory, disk)
Pod/container status
DB/query performance

Application Metrics:

API error rates (4xx/5xx)
Latency and throughput
Queue size, retry rates, etc.

Logs:

App logs (e.g., failed courier assignment reasons)
Infra logs (Nginx, DB logs, etc.)

Alerting:

Infra failure alerts (e.g., disk full, OOMKilled)
App crash/slowdown alerts
Burn rate alerts for SLO violations

You might share summary dashboards with the client, but not raw logs or system metrics (unless they ask).

🛠️ SRE Best Practice

🎯 Clients define business-level SLOs
🛡️ You enforce them using low-level metrics/logs
🧾 You give reporting or dashboards so they can verify it

✅ Real-World Summary

Thing

Set/Owned by

99.9% SLO for /fetch-orders

✅ Client defines

Pod restart count alert

✅ You own it

Disk usage > 90% on node

✅ You monitor

p95 latency alert on API

✅ You configure

Incident RCA & resolution

✅ You handle

Let me know if you want:

A checklist for what to monitor behind the scenes
A sample monthly SLO report to send to clients
A dashboard template to separate internal vs client-facing metrics

how to structure grafana or give me tree g]ddigrams like create folders sub folders for specifc slo and metrics like speae dashboard for client and separate for that not client interested

what matter for them

In most real-world scenarios:

🔍 Clients don’t actively monitor the SLOs — but they do expect visibility into them.

Let’s break that down clearly:

✅ Who Monitors vs Who Cares About the SLO?

Action

SRE/Service Provider

Client (Platform Owner)

Actively monitors 24/7

✅ Yes

❌ No

Sets SLO targets

🔄 Collaboratively

✅ Yes

Gets alerts

✅ Internal only

❌ Rarely

Receives reports/dashboards

✅ You provide

✅ Yes

Investigates burn rate

✅ Yes

❌ No

📊 What Clients Actually Want

Clients don’t sit in Grafana all day. They expect:

SLO dashboards (read-only access)
Monthly SLO/SLA compliance reports
Summary notifications when something goes wrong (e.g. incident report or “SLO violation occurred on /assign-courier”)
Confidence that you're monitoring it on their behalf

🧭 So What Should You Do?

✅ Your responsibility:

Proactively track, monitor, and alert on all SLOs
Share clean Grafana dashboards under 📁 Client Dashboards
Provide monthly SLO reports (PDF/export/email)
If there's an incident or burn rate spike → notify client
Act fast when error budget starts burning

🧠 Real-World Examples

Situation

What Client Does

What You Do

SLO burn rate spikes for /assign-courier

❌ Doesn’t notice

✅ You get alerted & mitigate it

Monthly report shows 99.85% (SLO was 99.9%)

✅ Client asks “why?”

✅ You provide RCA or context

Tracking API p95 latency degrading gradually

❌ Client unaware

✅ You create alert, investigate

✅ Summary

Question

Answer

Does the client monitor SLOs?

❌ No, but they expect visibility

Who handles real-time monitoring?

✅ You (SRE/dev team)

Should the client get dashboards?

✅ Yes (read-only, clean view)

Should the client get reports?

✅ Yes (monthly or on-demand)

Would you like a monthly SLO report template (in Markdown or Google Docs format) that you can fill out and send to clients?

PreviousZapconnect logistics NextGrafana

Last updated 7 months ago

hashtag🎯 Who Sets the SLOs?

hashtag🧩 What SLOs Could Be Set and By Whom?

hashtag1. Order Fetching API

hashtag2. Courier Assignment API

hashtag3. Delivery Tracking / Status Updates

hashtag4. Overall Platform Availability (Frontend/API Gateway)

hashtag📈 Bonus: SLOs Based on User Journeys

hashtagExample: "Assign and Track Order" Flow

hashtag🧠 Practical Setup

hashtag✅ Final Summary

hashtag🎯 Only Critical APIs Get Explicit SLOs

hashtag✔️ Yes, you set SLOs only for APIs that matter most to:

hashtag🧩 Why?

hashtag✅ So Yes, Clients Typically Want SLOs for:

hashtag🔸 High-priority, user-facing APIs, like:

hashtag⚠️ APIs That Usually Don’t Need SLOs:

hashtag✅ What the Client Wants

hashtag🔧 What You Do as SRE

hashtag🧠 Real-World Tip

hashtag🧭 Who Owns What?

hashtag🔍 What Clients Usually Care About

hashtag✅ What they want:

hashtag❌ What they don’t care about:

hashtag⚙️ What You Set Up Behind the Scenes

hashtagSystem Metrics:

hashtagApplication Metrics:

hashtagLogs:

hashtagAlerting:

hashtag🛠️ SRE Best Practice

hashtag✅ Real-World Summary

hashtag✅ Who Monitors vs Who Cares About the SLO?

hashtag📊 What Clients Actually Want

hashtag🧭 So What Should You Do?

hashtag✅ Your responsibility:

hashtag🧠 Real-World Examples

hashtag✅ Summary

🎯 Who Sets the SLOs?

🧩 What SLOs Could Be Set and By Whom?

1. Order Fetching API

2. Courier Assignment API

3. Delivery Tracking / Status Updates

4. Overall Platform Availability (Frontend/API Gateway)

📈 Bonus: SLOs Based on User Journeys

Example: "Assign and Track Order" Flow

🧠 Practical Setup

✅ Final Summary

🎯 Only Critical APIs Get Explicit SLOs

✔️ Yes, you set SLOs only for APIs that matter most to:

🧩 Why?

✅ So Yes, Clients Typically Want SLOs for:

🔸 High-priority, user-facing APIs, like:

⚠️ APIs That Usually Don’t Need SLOs:

✅ What the Client Wants

🔧 What You Do as SRE

🧠 Real-World Tip

🧭 Who Owns What?

🔍 What Clients Usually Care About

✅ What they want:

❌ What they don’t care about:

⚙️ What You Set Up Behind the Scenes

System Metrics:

Application Metrics:

Logs:

Alerting:

🛠️ SRE Best Practice

✅ Real-World Summary

✅ Who Monitors vs Who Cares About the SLO?

📊 What Clients Actually Want

🧭 So What Should You Do?

✅ Your responsibility:

🧠 Real-World Examples

✅ Summary