Monitoring

chevron-rightWhat we can monitorhashtag

Great question 🚀 – you’re basically designing a Kubernetes monitoring dashboard in Grafana. Let’s break this down into core metrics, node/pod insights, autoscaling (Karpenter), and health checks.


🔹 1. Cluster Overview

  • Total Nodes (count(kube_node_info))

  • Ready vs NotReady nodes (kube_node_status_condition{condition="Ready"})

  • Total Pods in Cluster (count(kube_pod_info))

  • Running vs Pending vs Failed Pods (kube_pod_status_phase)

  • Cluster Pod Capacity vs Used Pods (sum(kube_node_status_capacity_pods), sum(kubelet_running_pod_count))


🔹 2. Node-Level Metrics

  • Node CPU usage → (instance:node_cpu_utilisation:rate1m)

  • Node Memory usage → (node_memory_Active_bytes / node_memory_MemTotal_bytes)

  • Node Disk usage → (node_filesystem_size_bytes - node_filesystem_free_bytes)

  • Node Allocatable Resources vs Used

    • CPU allocatable: kube_node_status_allocatable_cpu_cores

    • Memory allocatable: kube_node_status_allocatable_memory_bytes

    • Pods allocatable: kube_node_status_allocatable_pods


🔹 3. Pod & Workload Metrics

  • Top 5 namespaces by pod count

  • Top 5 deployments by CPU/Memory usage

  • Pod restarts per namespace (rate(kube_pod_container_status_restarts_total[5m]))

  • OOMKilled / CrashLoopBackOff count

  • Pending pods (unschedulable) (kube_pod_status_scheduled{condition="false"})


🔹 4. Karpenter-Specific (Autoscaling)

  • Karpenter provisioner events (from karpenter controller logs → Prometheus)

  • Karpenter node lifecycle

    • karpenter_provisioner_nodes_created_total

    • karpenter_provisioner_nodes_terminated_total

  • Time to provision new nodes

  • Scaling decisions over time (new nodes vs terminating)


🔹 5. Kubelet & Control Plane

  • Kubelet status (up{job="kubelet"})

  • Kubelet resource pressure

    • Disk pressure: kube_node_status_condition{condition="DiskPressure"}

    • Memory pressure: kube_node_status_condition{condition="MemoryPressure"}

    • PID pressure: kube_node_status_condition{condition="PIDPressure"}

  • Kubelet running pods (kubelet_running_pod_count per node)

  • API Server latency (apiserver_request_duration_seconds_bucket)


🔹 6. Cluster Reliability

  • Etcd health (etcd_server_has_leader, etcd_server_health_failures_total)

  • API server errors (apiserver_request_total{code!~"2.."})

  • Scheduler latency & failures

  • Controller manager errors


🔹 7. Network & Ingress

  • Ingress request rate, latency, error %

  • Service endpoint availability (kube_endpoint_address_available)

  • DNS resolution errors (via CoreDNS metrics)


🔹 8. Extra (Good to Have)

  • Events trend (with kube-eventer / event-exporter → Prometheus)

  • Jobs/cronjobs success vs failure

  • PVC usage (kubelet_volume_stats_used_bytes)

  • DaemonSet health

  • Horizontal Pod Autoscaler (HPA) status

  • Pod distribution across nodes (binpacking efficiency)


Summary – Your dashboard should have these panels:

  1. Cluster Summary (nodes, pods, allocatable vs used)

  2. Node Utilization (CPU, memory, disk per node)

  3. Pod Overview (running, pending, failed, restarts)

  4. Karpenter Autoscaling (nodes provisioned/terminated, scaling events)

  5. Kubelet Status (ready, pressures, running pods)

  6. Control Plane Health (API server, etcd, scheduler)

  7. Network/Ingress (latency, errors, endpoints)

  8. Storage (PVC usage, disk pressure)


👉 Do you want me to generate a ready-to-import Grafana dashboard JSON (with Prometheus as datasource) containing all these panels so you can use it directly?

chevron-rightHow to monitor above metricshashtag

Perfect 👌 — you’re essentially asking how to stitch all the tools together so you can monitor Kubernetes, Nodes, Pods, Karpenter, and Control Plane with Prometheus + Grafana.

Here’s the full toolkit & setup flow:


🔹 1. Metrics Collection

Prometheus

  • Core metrics database.

  • Installed via Prometheus Operator or kube-prometheus-stack (Helm).

  • Scrapes metrics from Kubernetes components, nodes, and apps.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

This deploys:

  • Prometheus

  • Alertmanager

  • Grafana

  • Node Exporter

  • Kube State Metrics


🔹 2. Metrics Sources

To cover everything in your list:

📌 Node-level metrics

  • node-exporter (DaemonSet, deployed with kube-prometheus-stack)

    • Exposes CPU, memory, disk, filesystem, network

    • Endpoint: /metrics

📌 Kubernetes object metrics

  • kube-state-metrics (deployed with kube-prometheus-stack)

    • Exposes CRD/object states: deployments, pods, nodes, HPA, etc.

    • Needed for: pod count, pod status, node allocatable vs capacity

📌 Kubelet metrics

  • Exposed on each node at https://<node-ip>:10250/metrics

  • Operator sets up scraping for kubelet

  • Metrics include: kubelet_running_pod_count, pressure (Disk/Memory/PID)

📌 Control Plane

  • API Server (/metrics on master/control plane endpoint)

  • Scheduler and Controller Manager metrics (if accessible)

  • Etcd metrics (if self-managed, EKS hides this)

⚠️ On EKS, you won’t have direct access to etcd/controller-manager/scheduler, but API Server + kube-state-metrics is usually enough.

📌 Karpenter

  • Karpenter exposes Prometheus metrics at: http://<karpenter-pod>:8080/metrics

  • Example metrics:

    • karpenter_provisioner_nodes_created_total

    • karpenter_provisioner_nodes_terminated_total

  • You need to configure Prometheus to scrape the Karpenter service.

Example scrape config (values.yaml for Prometheus):

additionalScrapeConfigs:
  - job_name: 'karpenter'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        regex: karpenter
        action: keep
      - source_labels: [__address__]
        target_label: __address__
        regex: (.*):.* 
        replacement: ${1}:8080

🔹 3. Visualization

Grafana

  • Comes with kube-prometheus-stack

  • Default login: admin/prom-operator

  • Datasource: Prometheus (auto-configured)

  • Kubernetes / Compute Resources / Cluster315 (official kube-prometheus)

  • Node Exporter Full1860 (per node CPU/mem/disk)

  • Karpenter Dashboard18616 (community)

  • Kubernetes API Server12006

  • Kubernetes / Compute Resources / Namespace (Pods)9614

Import dashboards:

  1. Grafana → Dashboards → Import → Enter ID (e.g., 1860)

  2. Select Prometheus datasource → Import


🔹 4. Alerts

  • Use PrometheusRule CRDs with Alertmanager for notifications.

  • Example alerts:

    • Node not ready

    • Pod crashloop > X times

    • Disk > 80%

    • API Server errors > threshold

    • Karpenter node launch failures

Example (Node Not Ready):

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-alerts
  namespace: monitoring
spec:
  groups:
  - name: node.rules
    rules:
    - alert: NodeDown
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        description: "Node {{ $labels.node }} is down"
        summary: "Node is NotReady"

🔹 5. Logs (Optional but Powerful)

  • Use Loki (Grafana Loki) for logs.

  • Deploy via Helm:

    helm install loki grafana/loki-stack -n monitoring
  • Collect logs from pods/nodes → view inside Grafana alongside metrics.


  • Events don’t persist long in Kubernetes.

  • Use kubernetes-event-exporter to push events into Prometheus/Elasticsearch/Loki/CloudWatch.

  • Then you can see pod failures, Karpenter scaling events, etc. in Grafana.


✅ Final Monitoring Flow

  1. Prometheus Operator / kube-prometheus-stack → deploys Prometheus, Grafana, node-exporter, kube-state-metrics.

  2. Add Karpenter scrape config in Prometheus.

  3. Import Grafana dashboards (cluster, node, Karpenter, API server).

  4. Set alerts in Prometheus → Alertmanager → Slack/Email/Webhook.

  5. (Optional) Add Loki for logs + event-exporter for events.


👉 Do you want me to create a ready-to-use Helm values.yaml for kube-prometheus-stack with Karpenter scraping + alert rules + Grafana dashboards preconfigured so you just helm install and get the full setup?

Last updated