SLO

STUDY MATERIAL

SLO matter for them

chevron-rightslo set by themhashtag

suppose we are building logistics platform and maintaining for client platform have list of orders featch from various ecommerce site eg shopify so seller choose order and assign it to specific courier partenr from logistic app and monitor delivery status

so here who will set SLO and what slo would be set by them

Great scenario — you're building and maintaining a logistics platform for a client, where the flow involves:

  1. Fetching orders from e-commerce sites (e.g., Shopify)

  2. Allowing sellers to assign couriers

  3. Tracking delivery status

Now let’s break it down:


🎯 Who Sets the SLOs?

Stakeholder
What They Care About
Do They Set SLOs?

Your client (the platform owner)

Wants reliable service for their sellers and integrations

✅ Yes

You (DevOps/SRE provider)

Ensures backend APIs/services meet those expectations

✅ Yes (internal targets)

E-commerce sites (Shopify)

Provide data but usually don't set SLOs for your platform

❌ No

Sellers (using the platform)

Expect responsive and reliable UI/API

❌ Indirectly (through client expectations)


🧩 What SLOs Could Be Set and By Whom?

1. Order Fetching API

  • Who sets it: Your client

  • SLO example:

    • 99.9% of /fetch-orders responses succeed within 2 seconds

    • Uptime: < 43.2 minutes of downtime per month

Why: If order data isn't fetched reliably, sellers can’t fulfill or track orders.


2. Courier Assignment API

  • Who sets it: Your client

  • SLO example:

    • 99.95% success rate for /assign-courier

    • p95 latency < 300ms

Why: This is mission-critical. A failed courier assignment breaks fulfillment.


3. Delivery Tracking / Status Updates

  • Who sets it: Your client

  • SLO example:

    • 99.9% of /track-delivery responses return latest status within 1 second

    • Updates should not lag more than 5 minutes in >99.5% of cases (if async)

Why: Sellers need real-time visibility for customer service & notifications.


4. Overall Platform Availability (Frontend/API Gateway)

  • Who sets it: Your client (or jointly with your SRE team)

  • SLO example:

    • 99.95% availability of dashboard or core APIs

    • i.e., platform available 24x7 with <21.6 minutes of downtime/month


📈 Bonus: SLOs Based on User Journeys

These combine multiple API calls into a business-critical flow.

Example: "Assign and Track Order" Flow

  • SLO:

    99.9% of sellers should be able to fetch order → assign courier → get tracking info within 60 seconds total

Why: This reflects real user experience, not just API uptime.


🧠 Practical Setup

As the SRE/DevOps team, you'd:

  1. Define SLIs (availability, latency, error rate) per API

  2. Set SLOs with the client based on:

    • Criticality

    • Business needs

    • Technical feasibility

  3. Monitor error budgets, burn rates, alerting using Prometheus + Grafana


✅ Final Summary

Component
Who Sets the SLO
Example SLO

/fetch-orders

Client

99.9% success, <2s latency

/assign-courier

Client

99.95% success, <300ms p95 latency

/track-delivery

Client

99.9% success, <1s latency

Platform frontend

Client & You

99.95% uptime

User journey (end-to-end)

You suggest → client approves

99.9% success end-to-end flow in <60s


Would you like help building a sample SLO document to present to your client, or define SLIs + Prometheus rules for this?

Exactly ✅ — you’ve got it right.

In real-world SRE practice, including your logistics platform scenario:


🎯 Only Critical APIs Get Explicit SLOs

✔️ Yes, you set SLOs only for APIs that matter most to:

  • The business

  • Client success

  • User experience


🧩 Why?

Monitoring and enforcing SLOs for every single API is:

  • Overkill for low-impact endpoints

  • Noise-heavy (you’ll get alerts for things that don’t affect users)

  • Resource-intensive (observability cost & effort)


✅ So Yes, Clients Typically Want SLOs for:

🔸 High-priority, user-facing APIs, like:

API Endpoint
Reason it's Critical

/fetch-orders

Without this, sellers can’t fulfill

/assign-courier

Directly affects order shipping

/track-delivery

Affects customer communication

/login, /place-order

Blockers for platform usage


⚠️ APIs That Usually Don’t Need SLOs:

API Endpoint
Why not critical

/get-banner, /faq

Cosmetic, doesn't affect order flow

/user-feedback

Low frequency usage

/help-center-content

Non-functional/auxiliary


✅ What the Client Wants

  • The client only cares about high-value business operations

  • So they’ll say:

    “Ensure 99.95% uptime for /assign-courier and order processing. I don’t care about the help page.”


🔧 What You Do as SRE

Task
Action

Identify critical APIs

Talk to product/client, study user flow

Define SLIs/SLOs

Only for APIs that affect revenue/UX

Monitor others lightly

Use general metrics/logging, no alerts

Tune alerting

Alert only on SLO-breach-worthy APIs


🧠 Real-World Tip

Start with 3–5 SLOs max for the platform. More than that becomes hard to manage.


Would you like help designing a simple SLO proposal doc to send to your client listing only the APIs that should be tracked?

so client only set this slo and other monitoring set by service provider like system metrics and logs that client dont interfere ?

Exactly ✅ — you've nailed it.

In a client-service provider relationship like yours, here's how SLOs, monitoring, and observability responsibilities are typically split:


🧭 Who Owns What?

Responsibility
Owned by Client?
Owned by You (Service Provider)?

Setting SLOs for critical APIs

✅ Yes

✅ (with your guidance)

System-level monitoring (CPU, memory, disk)

❌ No

✅ Yes

Internal metrics (pod health, queues, retries)

❌ No

✅ Yes

Application logs, infra logs

❌ No

✅ Yes

Uptime/downtime reporting for APIs

✅ Interested

✅ You provide it

Alerting and incident response

❌ No

✅ You handle it


🔍 What Clients Usually Care About

Clients usually want outcomes, not implementation details:

✅ What they want:

  • “Is the /assign-courier API up 99.95% of the time?”

  • “Can you alert us if a delivery status isn't updated for 5 minutes?”

  • SLA/SLO adherence reports

❌ What they don’t care about:

  • “CPU load on Node 3”

  • “Prometheus scrape errors”

  • “Pod crashloop count”


⚙️ What You Set Up Behind the Scenes

As the service provider / SRE / DevOps, you maintain:

System Metrics:

  • Node resource usage (CPU, memory, disk)

  • Pod/container status

  • DB/query performance

Application Metrics:

  • API error rates (4xx/5xx)

  • Latency and throughput

  • Queue size, retry rates, etc.

Logs:

  • App logs (e.g., failed courier assignment reasons)

  • Infra logs (Nginx, DB logs, etc.)

Alerting:

  • Infra failure alerts (e.g., disk full, OOMKilled)

  • App crash/slowdown alerts

  • Burn rate alerts for SLO violations

You might share summary dashboards with the client, but not raw logs or system metrics (unless they ask).


🛠️ SRE Best Practice

  • 🎯 Clients define business-level SLOs

  • 🛡️ You enforce them using low-level metrics/logs

  • 🧾 You give reporting or dashboards so they can verify it


✅ Real-World Summary

Thing
Set/Owned by

99.9% SLO for /fetch-orders

✅ Client defines

Pod restart count alert

✅ You own it

Disk usage > 90% on node

✅ You monitor

p95 latency alert on API

✅ You configure

Incident RCA & resolution

✅ You handle


Let me know if you want:

  • A checklist for what to monitor behind the scenes

  • A sample monthly SLO report to send to clients

  • A dashboard template to separate internal vs client-facing metrics

how to structure grafana or give me tree g]ddigrams like create folders sub folders for specifc slo and metrics like speae dashboard for client and separate for that not client interested

chevron-rightwhat matter for themhashtag

In most real-world scenarios:

🔍 Clients don’t actively monitor the SLOs — but they do expect visibility into them.

Let’s break that down clearly:


✅ Who Monitors vs Who Cares About the SLO?

Action
SRE/Service Provider
Client (Platform Owner)

Actively monitors 24/7

✅ Yes

❌ No

Sets SLO targets

🔄 Collaboratively

✅ Yes

Gets alerts

✅ Internal only

❌ Rarely

Receives reports/dashboards

✅ You provide

✅ Yes

Investigates burn rate

✅ Yes

❌ No


📊 What Clients Actually Want

Clients don’t sit in Grafana all day. They expect:

  1. SLO dashboards (read-only access)

  2. Monthly SLO/SLA compliance reports

  3. Summary notifications when something goes wrong (e.g. incident report or “SLO violation occurred on /assign-courier”)

  4. Confidence that you're monitoring it on their behalf


🧭 So What Should You Do?

✅ Your responsibility:

  • Proactively track, monitor, and alert on all SLOs

  • Share clean Grafana dashboards under 📁 Client Dashboards

  • Provide monthly SLO reports (PDF/export/email)

  • If there's an incident or burn rate spike → notify client

  • Act fast when error budget starts burning


🧠 Real-World Examples

Situation
What Client Does
What You Do

SLO burn rate spikes for /assign-courier

❌ Doesn’t notice

✅ You get alerted & mitigate it

Monthly report shows 99.85% (SLO was 99.9%)

✅ Client asks “why?”

✅ You provide RCA or context

Tracking API p95 latency degrading gradually

❌ Client unaware

✅ You create alert, investigate


✅ Summary

Question
Answer

Does the client monitor SLOs?

No, but they expect visibility

Who handles real-time monitoring?

✅ You (SRE/dev team)

Should the client get dashboards?

✅ Yes (read-only, clean view)

Should the client get reports?

✅ Yes (monthly or on-demand)


Would you like a monthly SLO report template (in Markdown or Google Docs format) that you can fill out and send to clients?

Last updated