Grafana
tree strucutre
That's a great approach — structuring Grafana dashboards like a tree with folders/subfolders helps:
Keep things clean ✅
Separate internal vs client-facing views ✅
Make ownership & access control easier ✅
🎯 Goal: Separate Grafana Structure for SLOs & Metrics
You want to:
Create dedicated dashboards for client-facing SLOs
Keep infra/internal dashboards for your SRE/dev team
Organize everything into folders & subfolders
🌳 Recommended Grafana Folder Tree
Here’s a clean structure to follow:
📁 Logistics Platform Monitoring
├── 📁 Client Dashboards
│ ├── 📊 SLO - Order API
│ ├── 📊 SLO - Courier Assignment API
│ ├── 📊 SLO - Tracking API
│ └── 📊 Summary - Monthly SLO Report
│
├── 📁 Internal Dashboards
│ ├── 📁 API Metrics
│ │ ├── 📊 All API Latency
│ │ ├── 📊 All API Error Rate
│ │ └── 📊 HTTP Status Overview
│ ├── 📁 System Metrics
│ │ ├── 📊 Kubernetes Nodes & Pods
│ │ ├── 📊 CPU/Memory/Disk Usage
│ │ └── 📊 DB Performance
│ ├── 📁 Logs & Alerts
│ │ ├── 📊 Loki Log Search
│ │ └── 📊 Alerting Burn Rate
│ └── 📁 Dev Only / Experiments
│ ├── 📊 Dev API Testing
│ └── 📊 Feature Flags & AB Tests✅ How This Helps
Client Dashboards
Shows only SLOs and SLIs relevant to them
✅ Yes
API Metrics
All internal API metrics grouped by route
❌ No
System Metrics
Node, pod, infra-level metrics
❌ No
Logs & Alerts
Used for incident debugging/log search
❌ No
Dev Only
Dashboards used by dev/test engineers
❌ No
🔐 Bonus: Grafana Permissions Setup
Grafana supports folder-level access control.
Client Dashboards
✅ Clients, ✅ SRE team
Internal Dashboards
✅ SRE only
Dev Dashboards
✅ Devs, ✅ SRE
🛠️ How to Create This in Grafana
Go to "Dashboards" > "New Folder"
Create top-level folders:
Client DashboardsInternal Dashboards
Inside each, create dashboards or organize with tags or subfolders (if using Grafana Enterprise or plugin).
Apply folder permissions in Settings > Permissions
Use template variables like
$api,$status,$clientto reuse panels across dashboards.
📦 Optional Tags for Search
Add dashboard tags to help you search and filter:
SLO,critical,non-critical,infra,client-visible,latency,error-rate
✅ Summary Diagram (Text Tree)
📁 Logistics Monitoring
├── 📁 Client Dashboards
│ ├── SLO - Order API
│ ├── SLO - Assign Courier API
│ ├── SLO - Track Delivery API
│ └── Summary SLO Monthly Report
│
├── 📁 Internal Dashboards
│ ├── 📁 API Metrics
│ ├── 📁 System Metrics
│ ├── 📁 Logs & Alerts
│ └── 📁 Dev OnlyWould you like me to generate a real Grafana dashboard JSON for one of these (e.g., SLO for /assign-courier) or a permission setup guide?
Great! For your 📁 Client Dashboards structure in Grafana, each dashboard (e.g., SLO - Order API, SLO - Assign Courier API, etc.) should show clear, client-relevant charts that reflect the health, performance, and reliability of each API. Here’s a detailed list of suggested charts for each:
📊 SLO - Order API Dashboard
🔸 Purpose: Show reliability of /fetch-orders and similar endpoints
/fetch-orders and similar endpoints✅ SLI: Success Rate (2xx Responses)
Stat or Line
success / total for /fetch-orders
⏱️ Latency - p50, p90, p95
Line chart
histogram_quantile() from duration bucket
❌ Error Rate (4xx/5xx)
Bar or Area
Grouped by status code
📊 Total Requests
Line chart
Shows usage volume over time
📉 Availability Trend (Daily % Success)
Line chart
Over time view of SLO vs actual
🔥 Burn Rate
Single stat or graph
Error budget burn rate (short and long window)
📆 Error Budget Remaining
Single stat
How much budget left (in time or %)
📍 Request Breakdown by Status Code
Pie chart
e.g., 90% 2xx, 8% 4xx, 2% 5xx
📊 SLO - Assign Courier API Dashboard
🔸 Purpose: Monitor /assign-courier reliability and speed
/assign-courier reliability and speed✅ Success Rate
Stat
Track 2xx or 2xx+3xx / total
❌ 5xx Error Spike Detection
Line/Area
Alerting threshold here
⏱️ p95 Latency
Line chart
Critical for courier assignment responsiveness
🔄 Retry Rate
Line chart
Only if client/app retries failed calls
🧭 Total Assignments
Stat
Count of successful courier assignments
🔥 Burn Rate (Short + Long window)
Line or Stat
For SLO enforcement
📅 Assignment Failures Over Time
Bar chart
Count of failed attempts per day/hour
🔁 Time to First Success (if async logic)
Line chart
Optional if applicable
📊 SLO - Track Delivery API Dashboard
🔸 Purpose: Visibility into /track-delivery updates
/track-delivery updates✅ SLI: Successful Tracking Updates
Stat
Based on 2xx success
🕐 Latency for Tracking Response
Line chart
p95 or p99
🔁 Update Delay
Line chart
If async, track lag between shipment & status update
📉 Availability Over Time
Line
Hourly/daily SLI trend
🚨 5xx Error Rate
Line/Bar
Should be nearly 0
🧭 Track Requests per Courier Partner
Bar chart
Usage split, if needed by client
📍 Track Status by Region (Optional)
Table/Map
For location-based delivery monitoring
📊 Summary SLO Monthly Report Dashboard
🔸 Purpose: Give clients a high-level overview of all APIs
✅ Overall Platform Availability
Stat
Average across critical APIs
📊 API-Wise Success Rate
Table/Bar
/fetch-orders, /assign-courier, /track-delivery
📆 Error Budget Used This Month
Stat
% of budget consumed for each API
🔥 Burn Rate Summary
Line table
Highlights any risk of SLO violation
⏱️ Monthly p95 Latency by API
Table
Performance snapshot
❌ Incident Count (if tracked)
Stat
Optional based on integrated alert system
📍 SLA Breach Timeline (if any)
Line chart
When/how budget was violated
🗓️ Monthly API Usage
Line/Bar
For visibility on volume trends
✅ Key Notes for Client Dashboards
Keep dashboards read-only for clients.
Use clear naming, avoid overly technical PromQL in panel titles.
Include tooltips/notes to explain what each chart means.
Highlight SLO target lines visually (e.g., 99.9%) in graphs.
Use variables like
$api,$status_code,$regionfor filtering if needed.
Would you like:
A dashboard JSON template for one of these?
Or PromQL queries tailored to your actual metric names like
http_requests_total,request_duration_seconds_bucket, etc.?
Last updated