metrics

alert when node is in not-ready state

Kubernetes Events + Prometheus + Alertmanager

chevron-righthowhashtag

Good question 👍 — for alerting when an EKS node goes into NotReady state, you generally rely on monitoring + alerting pipelines.

Here are common approaches:


🔹 1. Using Kubernetes Events + Prometheus + Alertmanager

  • Step 1: Enable kube-state-metrics (if not already).

    • This exposes node conditions as Prometheus metrics.

  • Step 2: Metric to Watch

    • kube_node_status_condition{condition="Ready",status="true"} → means node is healthy.

    • If this goes to status="false", node is NotReady.

  • Step 3: Alert Rule (Prometheus)

- alert: K8sNodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.node }} is NotReady"
    description: "EKS node {{ $labels.node }} has been in NotReady state for more than 2 minutes."
  • Step 4: Route via Alertmanager → Slack, Email, PagerDuty, OpsGenie, etc.


🔹 2. Using CloudWatch + CloudWatch Alarms (AWS Native)

EKS worker nodes are EC2 instances, so you can use AWS monitoring:

  • Step 1: Install CloudWatch Container Insights for EKS.

  • Step 2: CloudWatch Logs Agent collects node health metrics.

  • Step 3: Create CloudWatch Alarm for NodeReady status (via Container Insights).

  • Step 4: SNS Notification → Email/SMS/Slack.


🔹 3. Using Kubernetes Native Tooling (kubectl + CronJob)

  • Run a small script every 2-5 minutes:

kubectl get nodes | grep NotReady
  • If match found → send Slack/email via webhook.

  • Can be run as a Kubernetes CronJob.


🔹 4. Using Datadog / NewRelic / Grafana Cloud

  • Most 3rd-party monitoring platforms already provide EKS node health dashboards + alerts.

  • Just enable node monitoring and configure “Node NotReady” alert rule.


Best Practice (what most SRE teams do):

  • Use Prometheus + Alertmanager (fine-grained alerts).

  • OR CloudWatch Container Insights if you want fully AWS-managed.

  • Set for: 2–5m in alerts → avoids false positives due to short node restarts.


👉 Pavan, do you want me to write a step-by-step Prometheus + Alertmanager setup for EKS NodeNotReady alert (with kube-state-metrics YAML + alert rules), so you can directly apply it to your cluster?

chevron-rightmetrics available by Kube-state metricshashtag

Kube-State-Metrics (KSM) is a service that listens to the Kubernetes API server and generates metrics about the state of Kubernetes objects — such as Deployments, Nodes, Pods, etc. — in Prometheus format. It does not focus on the health of the individual Kubernetes components, but rather on the state of the Kubernetes objects themselves.

As of kube-state-metrics v2.x, here’s a comprehensive list of the families of metrics exposed, grouped by Kubernetes resource type:


📊 1. Node Metrics

  • kube_node_info

  • kube_node_status_capacity

  • kube_node_status_allocatable

  • kube_node_status_condition

  • kube_node_spec_unschedulable

  • kube_node_status_phase

  • kube_node_labels

  • kube_node_role

  • kube_node_created

  • kube_node_spec_taint

Tracks node info, capacity, conditions (Ready, MemoryPressure, etc.), labels, taints, roles, and creation time.


🐳 2. Pod Metrics

  • kube_pod_info

  • kube_pod_owner

  • kube_pod_status_phase

  • kube_pod_container_info

  • kube_pod_container_status_waiting

  • kube_pod_container_status_running

  • kube_pod_container_status_terminated

  • kube_pod_container_status_last_terminated_reason

  • kube_pod_container_status_ready

  • kube_pod_container_status_restarts_total

  • kube_pod_container_resource_requests

  • kube_pod_container_resource_limits

  • kube_pod_start_time

  • kube_pod_completion_time

  • kube_pod_created

  • kube_pod_deletion_timestamp

  • kube_pod_status_scheduled

  • kube_pod_status_ready

  • kube_pod_status_unschedulable

  • kube_pod_status_scheduled_time

  • kube_pod_overhead

  • kube_pod_restart_policy

  • kube_pod_runtime_class_name

Includes pod lifecycle, container states, resource requests/limits, restarts, ownership, scheduling status, and more.


🧩 3. Deployment Metrics

  • kube_deployment_spec_replicas

  • kube_deployment_status_replicas

  • kube_deployment_status_replicas_available

  • kube_deployment_status_replicas_unavailable

  • kube_deployment_status_replicas_updated

  • kube_deployment_status_observed_generation

  • kube_deployment_metadata_generation

  • kube_deployment_created

  • kube_deployment_labels

  • kube_deployment_annotations

  • kube_deployment_spec_paused

  • kube_deployment_spec_strategy_rollingupdate_max_unavailable

  • kube_deployment_spec_strategy_rollingupdate_max_surge

Tracks desired vs current replicas, availability, update status, generation, labels, annotations, and rollout strategy.


🔄 4. ReplicaSet Metrics

  • kube_replicaset_owner

  • kube_replicaset_status_replicas

  • kube_replicaset_status_fully_labeled_replicas

  • kube_replicaset_status_ready_replicas

  • kube_replicaset_status_observed_generation

  • kube_replicaset_metadata_generation

  • kube_replicaset_spec_replicas

  • kube_replicaset_created

  • kube_replicaset_labels

Useful for tracking ReplicaSets owned by Deployments or other controllers.


🧵 5. StatefulSet Metrics

  • kube_statefulset_replicas

  • kube_statefulset_status_replicas

  • kube_statefulset_status_replicas_current

  • kube_statefulset_status_replicas_ready

  • kube_statefulset_status_replicas_updated

  • kube_statefulset_status_observed_generation

  • kube_statefulset_metadata_generation

  • kube_statefulset_created

  • kube_statefulset_labels

  • kube_statefulset_update_strategy

Similar to Deployment metrics but specific to StatefulSets.


🎯 6. DaemonSet Metrics

  • kube_daemonset_status_current_number_scheduled

  • kube_daemonset_status_desired_number_scheduled

  • kube_daemonset_status_number_misscheduled

  • kube_daemonset_status_number_ready

  • kube_daemonset_status_number_available

  • kube_daemonset_status_number_unavailable

  • kube_daemonset_status_updated_number_scheduled

  • kube_daemonset_status_number_available

  • kube_daemonset_metadata_generation

  • kube_daemonset_status_observed_generation

  • kube_daemonset_created

  • kube_daemonset_labels

  • kube_daemonset_update_strategy

Tracks scheduling status across nodes, readiness, updates, and generations.


🚀 7. Job & CronJob Metrics

Job:

  • kube_job_info

  • kube_job_owner

  • kube_job_status_active

  • kube_job_status_failed

  • kube_job_status_succeeded

  • kube_job_spec_completions

  • kube_job_spec_parallelism

  • kube_job_created

  • kube_job_labels

  • kube_job_annotations

  • kube_job_completion_time

  • kube_job_start_time

CronJob:

  • kube_cronjob_info

  • kube_cronjob_created

  • kube_cronjob_labels

  • kube_cronjob_annotations

  • kube_cronjob_status_active

  • kube_cronjob_spec_suspend

  • kube_cronjob_spec_concurrency_policy

  • kube_cronjob_spec_starting_deadline_seconds

  • kube_cronjob_next_schedule_time

Tracks active/failed/succeeded pods, scheduling, concurrency, suspend state, next run time.


🧭 8. Service Metrics

  • kube_service_info

  • kube_service_labels

  • kube_service_created

  • kube_service_spec_type

  • kube_service_spec_cluster_ip

  • kube_service_spec_external_ips

  • kube_service_status_load_balancer_ingress

Exposes service type (ClusterIP, NodePort, LoadBalancer), IPs, load balancer ingress, labels.


🔗 9. Endpoints / EndpointSlice Metrics

  • kube_endpoint_address_available

  • kube_endpoint_address_not_ready

  • kube_endpoint_labels

  • kube_endpoint_created

For EndpointSlices (v1.19+):

  • kube_endpointslice_address_available

  • kube_endpointslice_address_not_ready

  • kube_endpointslice_labels

Tracks ready vs not-ready backend addresses per service.


📦 10. PersistentVolume (PV) & PersistentVolumeClaim (PVC) Metrics

PV:

  • kube_persistentvolume_info

  • kube_persistentvolume_status_phase

  • kube_persistentvolume_capacity_bytes

  • kube_persistentvolume_created

  • kube_persistentvolume_labels

  • kube_persistentvolume_annotations

  • kube_persistentvolume_access_mode

PVC:

  • kube_persistentvolumeclaim_info

  • kube_persistentvolumeclaim_status_phase

  • kube_persistentvolumeclaim_resource_requests_storage_bytes

  • kube_persistentvolumeclaim_labels

  • kube_persistentvolumeclaim_annotations

  • kube_persistentvolumeclaim_access_mode

  • kube_persistentvolumeclaim_created

Tracks storage capacity, binding status, access modes, phase (Bound/Pending), and labels.


🏷️ 11. ConfigMap & Secret Metrics

ConfigMap:

  • kube_configmap_info

  • kube_configmap_created

  • kube_configmap_metadata_resource_version

Secret:

  • kube_secret_info

  • kube_secret_type

  • kube_secret_created

  • kube_secret_metadata_resource_version

Mostly metadata: creation time, type, resource version, labels.


🧑‍💼 12. Namespace Metrics

  • kube_namespace_status_phase

  • kube_namespace_created

  • kube_namespace_labels

  • kube_namespace_annotations

Tracks namespace lifecycle: Active/Terminating.


👥 13. HorizontalPodAutoscaler (HPA) Metrics

  • kube_horizontalpodautoscaler_spec_min_replicas

  • kube_horizontalpodautoscaler_spec_max_replicas

  • kube_horizontalpodautoscaler_status_desired_replicas

  • kube_horizontalpodautoscaler_status_current_replicas

  • kube_horizontalpodautoscaler_status_conditions

  • kube_horizontalpodautoscaler_metadata_generation

  • kube_horizontalpodautoscaler_labels

  • kube_horizontalpodautoscaler_annotations

  • kube_horizontalpodautoscaler_created

Useful for autoscaling monitoring: desired/current replicas, min/max bounds, condition status.


🧩 14. LimitRange Metrics

  • kube_limitrange_created

  • kube_limitrange_metadata_resource_version

Very minimal — mostly creation timestamp and resource version.


🧮 15. ResourceQuota Metrics

  • kube_resourcequota_created

  • kube_resourcequota_labels

  • kube_resourcequota_annotations

  • kube_resourcequota_spec_hard

  • kube_resourcequota_status_used

Tracks hard limits and current usage for CPU, memory, pods, etc., per namespace.


🛡️ 16. NetworkPolicy Metrics

  • kube_networkpolicy_info

  • kube_networkpolicy_created

  • kube_networkpolicy_labels

Basic metadata and creation time.


🧱 17. Ingress & IngressClass Metrics

Ingress:

  • kube_ingress_info

  • kube_ingress_created

  • kube_ingress_labels

  • kube_ingress_metadata_resource_version

  • kube_ingress_path

IngressClass:

  • kube_ingressclass_info

  • kube_ingressclass_created

  • kube_ingressclass_labels

Tracks host/path rules, class, controller, TLS, and metadata.


⚙️ 18. CustomResourceDefinition (CRD) Metrics

  • kube_customresourcedefinition_info

  • kube_customresourcedefinition_created

  • kube_customresourcedefinition_metadata_generation

  • kube_customresourcedefinition_status_condition

Monitors CRD status, conditions, versions, and group/kind info.


🧬 19. MutatingWebhookConfiguration & ValidatingWebhookConfiguration

  • kube_mutatingwebhookconfiguration_info

  • kube_mutatingwebhookconfiguration_created

  • kube_validatingwebhookconfiguration_info

  • kube_validatingwebhookconfiguration_created

Metadata and creation timestamps for admission webhooks.


🧑‍🔧 20. Role, RoleBinding, ClusterRole, ClusterRoleBinding

  • kube_role_created, kube_rolebinding_created

  • kube_clusterrole_created, kube_clusterrolebinding_created

  • kube_role_labels, etc.

Creation timestamps and labels — useful for auditing RBAC changes.


📈 21. PriorityClass Metrics

  • kube_priorityclass_info

  • kube_priorityclass_preemption_policy

  • kube_priorityclass_value

Exposes priority values and preemption policies.


🧪 22. RuntimeClass Metrics (if enabled)

  • kube_runtimeclass_info

  • kube_runtimeclass_created

  • kube_runtimeclass_handler

Info about configured runtime handlers (e.g., gvisor, kata-containers).


✅ 23. CertificateSigningRequest (CSR)

  • kube_certificate_signing_request_info

  • kube_certificate_signing_request_conditions

  • kube_certificate_signing_request_created

Tracks CSR approval/denial status and signer name.


🌐 24. Lease Metrics (for HA control plane monitoring)

  • kube_lease_owner

  • kube_lease_holder_identity

  • kube_lease_acquire_time

  • kube_lease_renew_time

  • kube_lease_leader_transitions_total

Critical for monitoring etcd leader election and component liveness (e.g., kube-controller-manager, scheduler).


🧾 25. Event Metrics (optional — disabled by default due to high cardinality)

  • kube_event_info

  • kube_event_created

  • kube_event_type

  • kube_event_reason

  • kube_event_count

Must be explicitly enabled via flags (--collectors=events). Useful for auditing but can generate massive metric volume.


🧰 Optional Collectors (May Require Explicit Enablement)

Some collectors are disabled by default or require feature gates:

  • events → High-cardinality, off by default

  • certificatesigningrequests → Available since v1.9+

  • leases → Available since v1.14+

  • runtimeclasses → Available since v1.19+

You can enable/disable collectors via CLI flags:

--collectors=<comma-separated-list>

Example:

--collectors=deployments,pods,nodes,jobs,cronjobs,horizontalpodautoscalers

📌 Notes

  • All metrics are prefixed with kube_.

  • Most metrics include labels like namespace, name, node, container, phase, condition, etc.

  • KSM exposes gauges and counters — no histograms or summaries.

  • Metrics reflect desired state and observed state — useful for detecting drift or misconfigurations.

  • Version compatibility: Always check kube-state-metrics releasesarrow-up-right for Kubernetes version support.


📚 Official Documentation

  • GitHub: https://github.com/kubernetes/kube-state-metrics

  • Metrics Documentation: https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md

  • Each resource has its own doc (e.g., pod-metrics.mdarrow-up-right)


Summary: Kube-State-Metrics exposes rich, object-level state metrics across nearly all core Kubernetes resources — ideal for alerting, dashboards, and auditing object lifecycle and configuration drift.

Let me know if you want sample PromQL queries or Grafana dashboard recommendations for any of these!

chevron-rightinstallhashtag

Prerequisites

  1. EKS Cluster – Running and accessible via kubectl.

  2. kubectl – Configured to talk to your EKS cluster.

  3. Helm 3+ – Installed locally. Install Helmarrow-up-right

  4. (Optional) Prometheus & Grafana – If you want to scrape and visualize KSM metrics.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kube-state-metrics prometheus-community/kube-state-metrics \
  --namespace monitoring \
  --create-namespace \
  -f values.yaml

# Deploy into monitoring namespace (recommended)
namespace: monitoring

# Resource limits
resources:
  requests:
    memory: 256Mi
    cpu: 100m
  limits:
    memory: 512Mi
    cpu: 200m

# Enable only required collectors to reduce metric volume
collectors:
  - namespaces
  - pods
  - deployments
  - daemonsets
  - statefulsets
  - jobs
  - cronjobs
  - replicasets
  - nodes
  - horizontalpodautoscalers
  - persistentvolumeclaims
  - services
  # Add others as needed

# ServiceMonitor for Prometheus Operator (if you use it)
serviceMonitor:
  enabled: true
  namespace: monitoring
  interval: 30s
  labels:
    release: prometheus-stack  # adjust to match your Prometheus selector

# RBAC (usually auto-created, but explicit is safe)
rbac:
  create: true

# Tolerations if you have dedicated monitoring node group
# tolerations:
#   - key: "dedicated"
#     operator: "Equal"
#     value: "monitoring"
#     effect: "NoSchedule"

# NodeSelector if you want to pin to specific nodes
# nodeSelector:
#   role: monitoring

# Security context (optional)
securityContext:
  runAsNonRoot: true
  runAsUser: 65534

Last updated