Monitoring
What we can monitor
Great question 🚀 – you’re basically designing a Kubernetes monitoring dashboard in Grafana. Let’s break this down into core metrics, node/pod insights, autoscaling (Karpenter), and health checks.
🔹 1. Cluster Overview
Total Nodes (
count(kube_node_info))Ready vs NotReady nodes (
kube_node_status_condition{condition="Ready"})Total Pods in Cluster (
count(kube_pod_info))Running vs Pending vs Failed Pods (
kube_pod_status_phase)Cluster Pod Capacity vs Used Pods (
sum(kube_node_status_capacity_pods),sum(kubelet_running_pod_count))
🔹 2. Node-Level Metrics
Node CPU usage → (
instance:node_cpu_utilisation:rate1m)Node Memory usage → (
node_memory_Active_bytes / node_memory_MemTotal_bytes)Node Disk usage → (
node_filesystem_size_bytes - node_filesystem_free_bytes)Node Allocatable Resources vs Used
CPU allocatable:
kube_node_status_allocatable_cpu_coresMemory allocatable:
kube_node_status_allocatable_memory_bytesPods allocatable:
kube_node_status_allocatable_pods
🔹 3. Pod & Workload Metrics
Top 5 namespaces by pod count
Top 5 deployments by CPU/Memory usage
Pod restarts per namespace (
rate(kube_pod_container_status_restarts_total[5m]))OOMKilled / CrashLoopBackOff count
Pending pods (unschedulable) (
kube_pod_status_scheduled{condition="false"})
🔹 4. Karpenter-Specific (Autoscaling)
Karpenter provisioner events (from
karpentercontroller logs → Prometheus)Karpenter node lifecycle
karpenter_provisioner_nodes_created_totalkarpenter_provisioner_nodes_terminated_total
Time to provision new nodes
Scaling decisions over time (new nodes vs terminating)
🔹 5. Kubelet & Control Plane
Kubelet status (
up{job="kubelet"})Kubelet resource pressure
Disk pressure:
kube_node_status_condition{condition="DiskPressure"}Memory pressure:
kube_node_status_condition{condition="MemoryPressure"}PID pressure:
kube_node_status_condition{condition="PIDPressure"}
Kubelet running pods (
kubelet_running_pod_countper node)API Server latency (
apiserver_request_duration_seconds_bucket)
🔹 6. Cluster Reliability
Etcd health (
etcd_server_has_leader,etcd_server_health_failures_total)API server errors (
apiserver_request_total{code!~"2.."})Scheduler latency & failures
Controller manager errors
🔹 7. Network & Ingress
Ingress request rate, latency, error %
Service endpoint availability (
kube_endpoint_address_available)DNS resolution errors (via CoreDNS metrics)
🔹 8. Extra (Good to Have)
Events trend (with kube-eventer / event-exporter → Prometheus)
Jobs/cronjobs success vs failure
PVC usage (
kubelet_volume_stats_used_bytes)DaemonSet health
Horizontal Pod Autoscaler (HPA) status
Pod distribution across nodes (binpacking efficiency)
✅ Summary – Your dashboard should have these panels:
Cluster Summary (nodes, pods, allocatable vs used)
Node Utilization (CPU, memory, disk per node)
Pod Overview (running, pending, failed, restarts)
Karpenter Autoscaling (nodes provisioned/terminated, scaling events)
Kubelet Status (ready, pressures, running pods)
Control Plane Health (API server, etcd, scheduler)
Network/Ingress (latency, errors, endpoints)
Storage (PVC usage, disk pressure)
👉 Do you want me to generate a ready-to-import Grafana dashboard JSON (with Prometheus as datasource) containing all these panels so you can use it directly?
How to monitor above metrics
Perfect 👌 — you’re essentially asking how to stitch all the tools together so you can monitor Kubernetes, Nodes, Pods, Karpenter, and Control Plane with Prometheus + Grafana.
Here’s the full toolkit & setup flow:
🔹 1. Metrics Collection
✅ Prometheus
Core metrics database.
Installed via Prometheus Operator or kube-prometheus-stack (Helm).
Scrapes metrics from Kubernetes components, nodes, and apps.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespaceThis deploys:
Prometheus
Alertmanager
Grafana
Node Exporter
Kube State Metrics
🔹 2. Metrics Sources
To cover everything in your list:
📌 Node-level metrics
node-exporter (DaemonSet, deployed with kube-prometheus-stack)
Exposes CPU, memory, disk, filesystem, network
Endpoint:
/metrics
📌 Kubernetes object metrics
kube-state-metrics (deployed with kube-prometheus-stack)
Exposes CRD/object states: deployments, pods, nodes, HPA, etc.
Needed for: pod count, pod status, node allocatable vs capacity
📌 Kubelet metrics
Exposed on each node at
https://<node-ip>:10250/metricsOperator sets up scraping for kubelet
Metrics include:
kubelet_running_pod_count, pressure (Disk/Memory/PID)
📌 Control Plane
API Server (
/metricson master/control plane endpoint)Scheduler and Controller Manager metrics (if accessible)
Etcd metrics (if self-managed, EKS hides this)
⚠️ On EKS, you won’t have direct access to etcd/controller-manager/scheduler, but API Server + kube-state-metrics is usually enough.
📌 Karpenter
Karpenter exposes Prometheus metrics at:
http://<karpenter-pod>:8080/metricsExample metrics:
karpenter_provisioner_nodes_created_totalkarpenter_provisioner_nodes_terminated_total
You need to configure Prometheus to scrape the Karpenter service.
Example scrape config (values.yaml for Prometheus):
additionalScrapeConfigs:
- job_name: 'karpenter'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
regex: karpenter
action: keep
- source_labels: [__address__]
target_label: __address__
regex: (.*):.*
replacement: ${1}:8080🔹 3. Visualization
✅ Grafana
Comes with kube-prometheus-stack
Default login:
admin/prom-operatorDatasource: Prometheus (auto-configured)
Recommended Dashboards (Grafana.com IDs)
Kubernetes / Compute Resources / Cluster →
315(official kube-prometheus)Node Exporter Full →
1860(per node CPU/mem/disk)Karpenter Dashboard →
18616(community)Kubernetes API Server →
12006Kubernetes / Compute Resources / Namespace (Pods) →
9614
Import dashboards:
Grafana → Dashboards → Import → Enter ID (e.g.,
1860)Select Prometheus datasource → Import
🔹 4. Alerts
Use PrometheusRule CRDs with Alertmanager for notifications.
Example alerts:
Node not ready
Pod crashloop > X times
Disk > 80%
API Server errors > threshold
Karpenter node launch failures
Example (Node Not Ready):
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: node-alerts
namespace: monitoring
spec:
groups:
- name: node.rules
rules:
- alert: NodeDown
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
description: "Node {{ $labels.node }} is down"
summary: "Node is NotReady"🔹 5. Logs (Optional but Powerful)
Use Loki (Grafana Loki) for logs.
Deploy via Helm:
helm install loki grafana/loki-stack -n monitoringCollect logs from pods/nodes → view inside Grafana alongside metrics.
🔹 6. Events (Optional but Recommended)
Events don’t persist long in Kubernetes.
Use kubernetes-event-exporter to push events into Prometheus/Elasticsearch/Loki/CloudWatch.
Then you can see pod failures, Karpenter scaling events, etc. in Grafana.
✅ Final Monitoring Flow
Prometheus Operator / kube-prometheus-stack → deploys Prometheus, Grafana, node-exporter, kube-state-metrics.
Add Karpenter scrape config in Prometheus.
Import Grafana dashboards (cluster, node, Karpenter, API server).
Set alerts in Prometheus → Alertmanager → Slack/Email/Webhook.
(Optional) Add Loki for logs + event-exporter for events.
👉 Do you want me to create a ready-to-use Helm values.yaml for kube-prometheus-stack with Karpenter scraping + alert rules + Grafana dashboards preconfigured so you just helm install and get the full setup?
Last updated