SLO
STUDY MATERIAL
SLO matter for them
slo set by them
suppose we are building logistics platform and maintaining for client platform have list of orders featch from various ecommerce site eg shopify so seller choose order and assign it to specific courier partenr from logistic app and monitor delivery status
so here who will set SLO and what slo would be set by them
Great scenario — you're building and maintaining a logistics platform for a client, where the flow involves:
Fetching orders from e-commerce sites (e.g., Shopify)
Allowing sellers to assign couriers
Tracking delivery status
Now let’s break it down:
🎯 Who Sets the SLOs?
Your client (the platform owner)
Wants reliable service for their sellers and integrations
✅ Yes
You (DevOps/SRE provider)
Ensures backend APIs/services meet those expectations
✅ Yes (internal targets)
E-commerce sites (Shopify)
Provide data but usually don't set SLOs for your platform
❌ No
Sellers (using the platform)
Expect responsive and reliable UI/API
❌ Indirectly (through client expectations)
🧩 What SLOs Could Be Set and By Whom?
1. Order Fetching API
Who sets it: Your client
SLO example:
99.9%of/fetch-ordersresponses succeed within 2 secondsUptime:
< 43.2 minutes of downtime per month
Why: If order data isn't fetched reliably, sellers can’t fulfill or track orders.
2. Courier Assignment API
Who sets it: Your client
SLO example:
99.95%success rate for/assign-courierp95 latency< 300ms
Why: This is mission-critical. A failed courier assignment breaks fulfillment.
3. Delivery Tracking / Status Updates
Who sets it: Your client
SLO example:
99.9%of/track-deliveryresponses return latest status within 1 secondUpdates should not lag more than 5 minutes in >99.5% of cases (if async)
Why: Sellers need real-time visibility for customer service & notifications.
4. Overall Platform Availability (Frontend/API Gateway)
Who sets it: Your client (or jointly with your SRE team)
SLO example:
99.95%availability of dashboard or core APIsi.e., platform available 24x7 with <21.6 minutes of downtime/month
📈 Bonus: SLOs Based on User Journeys
These combine multiple API calls into a business-critical flow.
Example: "Assign and Track Order" Flow
SLO:
99.9% of sellers should be able to fetch order → assign courier → get tracking info within 60 seconds total
Why: This reflects real user experience, not just API uptime.
🧠 Practical Setup
As the SRE/DevOps team, you'd:
Define SLIs (availability, latency, error rate) per API
Set SLOs with the client based on:
Criticality
Business needs
Technical feasibility
Monitor error budgets, burn rates, alerting using Prometheus + Grafana
✅ Final Summary
/fetch-orders
Client
99.9% success, <2s latency
/assign-courier
Client
99.95% success, <300ms p95 latency
/track-delivery
Client
99.9% success, <1s latency
Platform frontend
Client & You
99.95% uptime
User journey (end-to-end)
You suggest → client approves
99.9% success end-to-end flow in <60s
Would you like help building a sample SLO document to present to your client, or define SLIs + Prometheus rules for this?
Exactly ✅ — you’ve got it right.
In real-world SRE practice, including your logistics platform scenario:
🎯 Only Critical APIs Get Explicit SLOs
✔️ Yes, you set SLOs only for APIs that matter most to:
The business
Client success
User experience
🧩 Why?
Monitoring and enforcing SLOs for every single API is:
Overkill for low-impact endpoints
Noise-heavy (you’ll get alerts for things that don’t affect users)
Resource-intensive (observability cost & effort)
✅ So Yes, Clients Typically Want SLOs for:
🔸 High-priority, user-facing APIs, like:
/fetch-orders
Without this, sellers can’t fulfill
/assign-courier
Directly affects order shipping
/track-delivery
Affects customer communication
/login, /place-order
Blockers for platform usage
⚠️ APIs That Usually Don’t Need SLOs:
/get-banner, /faq
Cosmetic, doesn't affect order flow
/user-feedback
Low frequency usage
/help-center-content
Non-functional/auxiliary
✅ What the Client Wants
The client only cares about high-value business operations
So they’ll say:
“Ensure 99.95% uptime for
/assign-courierand order processing. I don’t care about the help page.”
🔧 What You Do as SRE
Identify critical APIs
Talk to product/client, study user flow
Define SLIs/SLOs
Only for APIs that affect revenue/UX
Monitor others lightly
Use general metrics/logging, no alerts
Tune alerting
Alert only on SLO-breach-worthy APIs
🧠 Real-World Tip
Start with 3–5 SLOs max for the platform. More than that becomes hard to manage.
Would you like help designing a simple SLO proposal doc to send to your client listing only the APIs that should be tracked?
so client only set this slo and other monitoring set by service provider like system metrics and logs that client dont interfere ?
Exactly ✅ — you've nailed it.
In a client-service provider relationship like yours, here's how SLOs, monitoring, and observability responsibilities are typically split:
🧭 Who Owns What?
Setting SLOs for critical APIs
✅ Yes
✅ (with your guidance)
System-level monitoring (CPU, memory, disk)
❌ No
✅ Yes
Internal metrics (pod health, queues, retries)
❌ No
✅ Yes
Application logs, infra logs
❌ No
✅ Yes
Uptime/downtime reporting for APIs
✅ Interested
✅ You provide it
Alerting and incident response
❌ No
✅ You handle it
🔍 What Clients Usually Care About
Clients usually want outcomes, not implementation details:
✅ What they want:
“Is the
/assign-courierAPI up 99.95% of the time?”“Can you alert us if a delivery status isn't updated for 5 minutes?”
SLA/SLO adherence reports
❌ What they don’t care about:
“CPU load on Node 3”
“Prometheus scrape errors”
“Pod crashloop count”
⚙️ What You Set Up Behind the Scenes
As the service provider / SRE / DevOps, you maintain:
System Metrics:
Node resource usage (CPU, memory, disk)
Pod/container status
DB/query performance
Application Metrics:
API error rates (4xx/5xx)
Latency and throughput
Queue size, retry rates, etc.
Logs:
App logs (e.g., failed courier assignment reasons)
Infra logs (Nginx, DB logs, etc.)
Alerting:
Infra failure alerts (e.g., disk full, OOMKilled)
App crash/slowdown alerts
Burn rate alerts for SLO violations
You might share summary dashboards with the client, but not raw logs or system metrics (unless they ask).
🛠️ SRE Best Practice
🎯 Clients define business-level SLOs
🛡️ You enforce them using low-level metrics/logs
🧾 You give reporting or dashboards so they can verify it
✅ Real-World Summary
99.9% SLO for /fetch-orders
✅ Client defines
Pod restart count alert
✅ You own it
Disk usage > 90% on node
✅ You monitor
p95 latency alert on API
✅ You configure
Incident RCA & resolution
✅ You handle
Let me know if you want:
A checklist for what to monitor behind the scenes
A sample monthly SLO report to send to clients
A dashboard template to separate internal vs client-facing metrics
how to structure grafana or give me tree g]ddigrams like create folders sub folders for specifc slo and metrics like speae dashboard for client and separate for that not client interested
what matter for them
In most real-world scenarios:
🔍 Clients don’t actively monitor the SLOs — but they do expect visibility into them.
Let’s break that down clearly:
✅ Who Monitors vs Who Cares About the SLO?
Actively monitors 24/7
✅ Yes
❌ No
Sets SLO targets
🔄 Collaboratively
✅ Yes
Gets alerts
✅ Internal only
❌ Rarely
Receives reports/dashboards
✅ You provide
✅ Yes
Investigates burn rate
✅ Yes
❌ No
📊 What Clients Actually Want
Clients don’t sit in Grafana all day. They expect:
SLO dashboards (read-only access)
Monthly SLO/SLA compliance reports
Summary notifications when something goes wrong (e.g. incident report or “SLO violation occurred on
/assign-courier”)Confidence that you're monitoring it on their behalf
🧭 So What Should You Do?
✅ Your responsibility:
Proactively track, monitor, and alert on all SLOs
Share clean Grafana dashboards under
📁 Client DashboardsProvide monthly SLO reports (PDF/export/email)
If there's an incident or burn rate spike → notify client
Act fast when error budget starts burning
🧠 Real-World Examples
SLO burn rate spikes for /assign-courier
❌ Doesn’t notice
✅ You get alerted & mitigate it
Monthly report shows 99.85% (SLO was 99.9%)
✅ Client asks “why?”
✅ You provide RCA or context
Tracking API p95 latency degrading gradually
❌ Client unaware
✅ You create alert, investigate
✅ Summary
Does the client monitor SLOs?
❌ No, but they expect visibility
Who handles real-time monitoring?
✅ You (SRE/dev team)
Should the client get dashboards?
✅ Yes (read-only, clean view)
Should the client get reports?
✅ Yes (monthly or on-demand)
Would you like a monthly SLO report template (in Markdown or Google Docs format) that you can fill out and send to clients?
Last updated