Grafana

chevron-righttree strucutrehashtag

That's a great approach — structuring Grafana dashboards like a tree with folders/subfolders helps:

  • Keep things clean ✅

  • Separate internal vs client-facing views ✅

  • Make ownership & access control easier ✅


🎯 Goal: Separate Grafana Structure for SLOs & Metrics

You want to:

  • Create dedicated dashboards for client-facing SLOs

  • Keep infra/internal dashboards for your SRE/dev team

  • Organize everything into folders & subfolders


Here’s a clean structure to follow:

📁 Logistics Platform Monitoring
├── 📁 Client Dashboards
│   ├── 📊 SLO - Order API
│   ├── 📊 SLO - Courier Assignment API
│   ├── 📊 SLO - Tracking API
│   └── 📊 Summary - Monthly SLO Report

├── 📁 Internal Dashboards
│   ├── 📁 API Metrics
│   │   ├── 📊 All API Latency
│   │   ├── 📊 All API Error Rate
│   │   └── 📊 HTTP Status Overview
│   ├── 📁 System Metrics
│   │   ├── 📊 Kubernetes Nodes & Pods
│   │   ├── 📊 CPU/Memory/Disk Usage
│   │   └── 📊 DB Performance
│   ├── 📁 Logs & Alerts
│   │   ├── 📊 Loki Log Search
│   │   └── 📊 Alerting Burn Rate
│   └── 📁 Dev Only / Experiments
│       ├── 📊 Dev API Testing
│       └── 📊 Feature Flags & AB Tests

✅ How This Helps

Folder
Purpose
Shared With Client?

Client Dashboards

Shows only SLOs and SLIs relevant to them

✅ Yes

API Metrics

All internal API metrics grouped by route

❌ No

System Metrics

Node, pod, infra-level metrics

❌ No

Logs & Alerts

Used for incident debugging/log search

❌ No

Dev Only

Dashboards used by dev/test engineers

❌ No


🔐 Bonus: Grafana Permissions Setup

Grafana supports folder-level access control.

Folder
Who Has Access

Client Dashboards

✅ Clients, ✅ SRE team

Internal Dashboards

✅ SRE only

Dev Dashboards

✅ Devs, ✅ SRE


🛠️ How to Create This in Grafana

  1. Go to "Dashboards" > "New Folder"

  2. Create top-level folders:

    • Client Dashboards

    • Internal Dashboards

  3. Inside each, create dashboards or organize with tags or subfolders (if using Grafana Enterprise or plugin).

  4. Apply folder permissions in Settings > Permissions

  5. Use template variables like $api, $status, $client to reuse panels across dashboards.


Add dashboard tags to help you search and filter:

  • SLO, critical, non-critical, infra, client-visible, latency, error-rate


✅ Summary Diagram (Text Tree)

📁 Logistics Monitoring
├── 📁 Client Dashboards
│   ├── SLO - Order API
│   ├── SLO - Assign Courier API
│   ├── SLO - Track Delivery API
│   └── Summary SLO Monthly Report

├── 📁 Internal Dashboards
│   ├── 📁 API Metrics
│   ├── 📁 System Metrics
│   ├── 📁 Logs & Alerts
│   └── 📁 Dev Only

Would you like me to generate a real Grafana dashboard JSON for one of these (e.g., SLO for /assign-courier) or a permission setup guide?

Great! For your 📁 Client Dashboards structure in Grafana, each dashboard (e.g., SLO - Order API, SLO - Assign Courier API, etc.) should show clear, client-relevant charts that reflect the health, performance, and reliability of each API. Here’s a detailed list of suggested charts for each:


📊 SLO - Order API Dashboard

🔸 Purpose: Show reliability of /fetch-orders and similar endpoints

Chart Title
Type
PromQL / Notes

✅ SLI: Success Rate (2xx Responses)

Stat or Line

success / total for /fetch-orders

⏱️ Latency - p50, p90, p95

Line chart

histogram_quantile() from duration bucket

❌ Error Rate (4xx/5xx)

Bar or Area

Grouped by status code

📊 Total Requests

Line chart

Shows usage volume over time

📉 Availability Trend (Daily % Success)

Line chart

Over time view of SLO vs actual

🔥 Burn Rate

Single stat or graph

Error budget burn rate (short and long window)

📆 Error Budget Remaining

Single stat

How much budget left (in time or %)

📍 Request Breakdown by Status Code

Pie chart

e.g., 90% 2xx, 8% 4xx, 2% 5xx


📊 SLO - Assign Courier API Dashboard

🔸 Purpose: Monitor /assign-courier reliability and speed

Chart Title
Type
Notes

✅ Success Rate

Stat

Track 2xx or 2xx+3xx / total

❌ 5xx Error Spike Detection

Line/Area

Alerting threshold here

⏱️ p95 Latency

Line chart

Critical for courier assignment responsiveness

🔄 Retry Rate

Line chart

Only if client/app retries failed calls

🧭 Total Assignments

Stat

Count of successful courier assignments

🔥 Burn Rate (Short + Long window)

Line or Stat

For SLO enforcement

📅 Assignment Failures Over Time

Bar chart

Count of failed attempts per day/hour

🔁 Time to First Success (if async logic)

Line chart

Optional if applicable


📊 SLO - Track Delivery API Dashboard

🔸 Purpose: Visibility into /track-delivery updates

Chart Title
Type
Notes

✅ SLI: Successful Tracking Updates

Stat

Based on 2xx success

🕐 Latency for Tracking Response

Line chart

p95 or p99

🔁 Update Delay

Line chart

If async, track lag between shipment & status update

📉 Availability Over Time

Line

Hourly/daily SLI trend

🚨 5xx Error Rate

Line/Bar

Should be nearly 0

🧭 Track Requests per Courier Partner

Bar chart

Usage split, if needed by client

📍 Track Status by Region (Optional)

Table/Map

For location-based delivery monitoring


📊 Summary SLO Monthly Report Dashboard

🔸 Purpose: Give clients a high-level overview of all APIs

Chart Title
Type
Notes

✅ Overall Platform Availability

Stat

Average across critical APIs

📊 API-Wise Success Rate

Table/Bar

/fetch-orders, /assign-courier, /track-delivery

📆 Error Budget Used This Month

Stat

% of budget consumed for each API

🔥 Burn Rate Summary

Line table

Highlights any risk of SLO violation

⏱️ Monthly p95 Latency by API

Table

Performance snapshot

❌ Incident Count (if tracked)

Stat

Optional based on integrated alert system

📍 SLA Breach Timeline (if any)

Line chart

When/how budget was violated

🗓️ Monthly API Usage

Line/Bar

For visibility on volume trends


✅ Key Notes for Client Dashboards

  • Keep dashboards read-only for clients.

  • Use clear naming, avoid overly technical PromQL in panel titles.

  • Include tooltips/notes to explain what each chart means.

  • Highlight SLO target lines visually (e.g., 99.9%) in graphs.

  • Use variables like $api, $status_code, $region for filtering if needed.


Would you like:

  • A dashboard JSON template for one of these?

  • Or PromQL queries tailored to your actual metric names like http_requests_total, request_duration_seconds_bucket, etc.?

Last updated