metrics
alert when node is in not-ready state
Kubernetes Events + Prometheus + Alertmanager
how
Good question 👍 — for alerting when an EKS node goes into NotReady state, you generally rely on monitoring + alerting pipelines.
Here are common approaches:
🔹 1. Using Kubernetes Events + Prometheus + Alertmanager
Step 1: Enable kube-state-metrics (if not already).
This exposes node conditions as Prometheus metrics.
Step 2: Metric to Watch
kube_node_status_condition{condition="Ready",status="true"}→ means node is healthy.If this goes to
status="false", node is NotReady.
Step 3: Alert Rule (Prometheus)
- alert: K8sNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is NotReady"
description: "EKS node {{ $labels.node }} has been in NotReady state for more than 2 minutes."Step 4: Route via Alertmanager → Slack, Email, PagerDuty, OpsGenie, etc.
🔹 2. Using CloudWatch + CloudWatch Alarms (AWS Native)
EKS worker nodes are EC2 instances, so you can use AWS monitoring:
Step 1: Install CloudWatch Container Insights for EKS.
Step 2: CloudWatch Logs Agent collects node health metrics.
Step 3: Create CloudWatch Alarm for
NodeReadystatus (via Container Insights).Step 4: SNS Notification → Email/SMS/Slack.
🔹 3. Using Kubernetes Native Tooling (kubectl + CronJob)
Run a small script every 2-5 minutes:
kubectl get nodes | grep NotReadyIf match found → send Slack/email via webhook.
Can be run as a Kubernetes CronJob.
🔹 4. Using Datadog / NewRelic / Grafana Cloud
Most 3rd-party monitoring platforms already provide EKS node health dashboards + alerts.
Just enable node monitoring and configure “Node NotReady” alert rule.
✅ Best Practice (what most SRE teams do):
Use Prometheus + Alertmanager (fine-grained alerts).
OR CloudWatch Container Insights if you want fully AWS-managed.
Set for: 2–5m in alerts → avoids false positives due to short node restarts.
👉 Pavan, do you want me to write a step-by-step Prometheus + Alertmanager setup for EKS NodeNotReady alert (with kube-state-metrics YAML + alert rules), so you can directly apply it to your cluster?
metrics available by Kube-state metrics
Kube-State-Metrics (KSM) is a service that listens to the Kubernetes API server and generates metrics about the state of Kubernetes objects — such as Deployments, Nodes, Pods, etc. — in Prometheus format. It does not focus on the health of the individual Kubernetes components, but rather on the state of the Kubernetes objects themselves.
As of kube-state-metrics v2.x, here’s a comprehensive list of the families of metrics exposed, grouped by Kubernetes resource type:
📊 1. Node Metrics
kube_node_infokube_node_status_capacitykube_node_status_allocatablekube_node_status_conditionkube_node_spec_unschedulablekube_node_status_phasekube_node_labelskube_node_rolekube_node_createdkube_node_spec_taint
Tracks node info, capacity, conditions (Ready, MemoryPressure, etc.), labels, taints, roles, and creation time.
🐳 2. Pod Metrics
kube_pod_infokube_pod_ownerkube_pod_status_phasekube_pod_container_infokube_pod_container_status_waitingkube_pod_container_status_runningkube_pod_container_status_terminatedkube_pod_container_status_last_terminated_reasonkube_pod_container_status_readykube_pod_container_status_restarts_totalkube_pod_container_resource_requestskube_pod_container_resource_limitskube_pod_start_timekube_pod_completion_timekube_pod_createdkube_pod_deletion_timestampkube_pod_status_scheduledkube_pod_status_readykube_pod_status_unschedulablekube_pod_status_scheduled_timekube_pod_overheadkube_pod_restart_policykube_pod_runtime_class_name
Includes pod lifecycle, container states, resource requests/limits, restarts, ownership, scheduling status, and more.
🧩 3. Deployment Metrics
kube_deployment_spec_replicaskube_deployment_status_replicaskube_deployment_status_replicas_availablekube_deployment_status_replicas_unavailablekube_deployment_status_replicas_updatedkube_deployment_status_observed_generationkube_deployment_metadata_generationkube_deployment_createdkube_deployment_labelskube_deployment_annotationskube_deployment_spec_pausedkube_deployment_spec_strategy_rollingupdate_max_unavailablekube_deployment_spec_strategy_rollingupdate_max_surge
Tracks desired vs current replicas, availability, update status, generation, labels, annotations, and rollout strategy.
🔄 4. ReplicaSet Metrics
kube_replicaset_ownerkube_replicaset_status_replicaskube_replicaset_status_fully_labeled_replicaskube_replicaset_status_ready_replicaskube_replicaset_status_observed_generationkube_replicaset_metadata_generationkube_replicaset_spec_replicaskube_replicaset_createdkube_replicaset_labels
Useful for tracking ReplicaSets owned by Deployments or other controllers.
🧵 5. StatefulSet Metrics
kube_statefulset_replicaskube_statefulset_status_replicaskube_statefulset_status_replicas_currentkube_statefulset_status_replicas_readykube_statefulset_status_replicas_updatedkube_statefulset_status_observed_generationkube_statefulset_metadata_generationkube_statefulset_createdkube_statefulset_labelskube_statefulset_update_strategy
Similar to Deployment metrics but specific to StatefulSets.
🎯 6. DaemonSet Metrics
kube_daemonset_status_current_number_scheduledkube_daemonset_status_desired_number_scheduledkube_daemonset_status_number_misscheduledkube_daemonset_status_number_readykube_daemonset_status_number_availablekube_daemonset_status_number_unavailablekube_daemonset_status_updated_number_scheduledkube_daemonset_status_number_availablekube_daemonset_metadata_generationkube_daemonset_status_observed_generationkube_daemonset_createdkube_daemonset_labelskube_daemonset_update_strategy
Tracks scheduling status across nodes, readiness, updates, and generations.
🚀 7. Job & CronJob Metrics
Job:
kube_job_infokube_job_ownerkube_job_status_activekube_job_status_failedkube_job_status_succeededkube_job_spec_completionskube_job_spec_parallelismkube_job_createdkube_job_labelskube_job_annotationskube_job_completion_timekube_job_start_time
CronJob:
kube_cronjob_infokube_cronjob_createdkube_cronjob_labelskube_cronjob_annotationskube_cronjob_status_activekube_cronjob_spec_suspendkube_cronjob_spec_concurrency_policykube_cronjob_spec_starting_deadline_secondskube_cronjob_next_schedule_time
Tracks active/failed/succeeded pods, scheduling, concurrency, suspend state, next run time.
🧭 8. Service Metrics
kube_service_infokube_service_labelskube_service_createdkube_service_spec_typekube_service_spec_cluster_ipkube_service_spec_external_ipskube_service_status_load_balancer_ingress
Exposes service type (ClusterIP, NodePort, LoadBalancer), IPs, load balancer ingress, labels.
🔗 9. Endpoints / EndpointSlice Metrics
kube_endpoint_address_availablekube_endpoint_address_not_readykube_endpoint_labelskube_endpoint_created
For EndpointSlices (v1.19+):
kube_endpointslice_address_availablekube_endpointslice_address_not_readykube_endpointslice_labels
Tracks ready vs not-ready backend addresses per service.
📦 10. PersistentVolume (PV) & PersistentVolumeClaim (PVC) Metrics
PV:
kube_persistentvolume_infokube_persistentvolume_status_phasekube_persistentvolume_capacity_byteskube_persistentvolume_createdkube_persistentvolume_labelskube_persistentvolume_annotationskube_persistentvolume_access_mode
PVC:
kube_persistentvolumeclaim_infokube_persistentvolumeclaim_status_phasekube_persistentvolumeclaim_resource_requests_storage_byteskube_persistentvolumeclaim_labelskube_persistentvolumeclaim_annotationskube_persistentvolumeclaim_access_modekube_persistentvolumeclaim_created
Tracks storage capacity, binding status, access modes, phase (Bound/Pending), and labels.
🏷️ 11. ConfigMap & Secret Metrics
ConfigMap:
kube_configmap_infokube_configmap_createdkube_configmap_metadata_resource_version
Secret:
kube_secret_infokube_secret_typekube_secret_createdkube_secret_metadata_resource_version
Mostly metadata: creation time, type, resource version, labels.
🧑💼 12. Namespace Metrics
kube_namespace_status_phasekube_namespace_createdkube_namespace_labelskube_namespace_annotations
Tracks namespace lifecycle: Active/Terminating.
👥 13. HorizontalPodAutoscaler (HPA) Metrics
kube_horizontalpodautoscaler_spec_min_replicaskube_horizontalpodautoscaler_spec_max_replicaskube_horizontalpodautoscaler_status_desired_replicaskube_horizontalpodautoscaler_status_current_replicaskube_horizontalpodautoscaler_status_conditionskube_horizontalpodautoscaler_metadata_generationkube_horizontalpodautoscaler_labelskube_horizontalpodautoscaler_annotationskube_horizontalpodautoscaler_created
Useful for autoscaling monitoring: desired/current replicas, min/max bounds, condition status.
🧩 14. LimitRange Metrics
kube_limitrange_createdkube_limitrange_metadata_resource_version
Very minimal — mostly creation timestamp and resource version.
🧮 15. ResourceQuota Metrics
kube_resourcequota_createdkube_resourcequota_labelskube_resourcequota_annotationskube_resourcequota_spec_hardkube_resourcequota_status_used
Tracks hard limits and current usage for CPU, memory, pods, etc., per namespace.
🛡️ 16. NetworkPolicy Metrics
kube_networkpolicy_infokube_networkpolicy_createdkube_networkpolicy_labels
Basic metadata and creation time.
🧱 17. Ingress & IngressClass Metrics
Ingress:
kube_ingress_infokube_ingress_createdkube_ingress_labelskube_ingress_metadata_resource_versionkube_ingress_path
IngressClass:
kube_ingressclass_infokube_ingressclass_createdkube_ingressclass_labels
Tracks host/path rules, class, controller, TLS, and metadata.
⚙️ 18. CustomResourceDefinition (CRD) Metrics
kube_customresourcedefinition_infokube_customresourcedefinition_createdkube_customresourcedefinition_metadata_generationkube_customresourcedefinition_status_condition
Monitors CRD status, conditions, versions, and group/kind info.
🧬 19. MutatingWebhookConfiguration & ValidatingWebhookConfiguration
kube_mutatingwebhookconfiguration_infokube_mutatingwebhookconfiguration_createdkube_validatingwebhookconfiguration_infokube_validatingwebhookconfiguration_created
Metadata and creation timestamps for admission webhooks.
🧑🔧 20. Role, RoleBinding, ClusterRole, ClusterRoleBinding
kube_role_created,kube_rolebinding_createdkube_clusterrole_created,kube_clusterrolebinding_createdkube_role_labels, etc.
Creation timestamps and labels — useful for auditing RBAC changes.
📈 21. PriorityClass Metrics
kube_priorityclass_infokube_priorityclass_preemption_policykube_priorityclass_value
Exposes priority values and preemption policies.
🧪 22. RuntimeClass Metrics (if enabled)
kube_runtimeclass_infokube_runtimeclass_createdkube_runtimeclass_handler
Info about configured runtime handlers (e.g., gvisor, kata-containers).
✅ 23. CertificateSigningRequest (CSR)
kube_certificate_signing_request_infokube_certificate_signing_request_conditionskube_certificate_signing_request_created
Tracks CSR approval/denial status and signer name.
🌐 24. Lease Metrics (for HA control plane monitoring)
kube_lease_ownerkube_lease_holder_identitykube_lease_acquire_timekube_lease_renew_timekube_lease_leader_transitions_total
Critical for monitoring etcd leader election and component liveness (e.g., kube-controller-manager, scheduler).
🧾 25. Event Metrics (optional — disabled by default due to high cardinality)
kube_event_infokube_event_createdkube_event_typekube_event_reasonkube_event_count
Must be explicitly enabled via flags (
--collectors=events). Useful for auditing but can generate massive metric volume.
🧰 Optional Collectors (May Require Explicit Enablement)
Some collectors are disabled by default or require feature gates:
events→ High-cardinality, off by defaultcertificatesigningrequests→ Available since v1.9+leases→ Available since v1.14+runtimeclasses→ Available since v1.19+
You can enable/disable collectors via CLI flags:
--collectors=<comma-separated-list>Example:
--collectors=deployments,pods,nodes,jobs,cronjobs,horizontalpodautoscalers📌 Notes
All metrics are prefixed with
kube_.Most metrics include labels like
namespace,name,node,container,phase,condition, etc.KSM exposes gauges and counters — no histograms or summaries.
Metrics reflect desired state and observed state — useful for detecting drift or misconfigurations.
Version compatibility: Always check kube-state-metrics releases for Kubernetes version support.
📚 Official Documentation
GitHub: https://github.com/kubernetes/kube-state-metrics
Metrics Documentation: https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md
Each resource has its own doc (e.g., pod-metrics.md)
✅ Summary: Kube-State-Metrics exposes rich, object-level state metrics across nearly all core Kubernetes resources — ideal for alerting, dashboards, and auditing object lifecycle and configuration drift.
Let me know if you want sample PromQL queries or Grafana dashboard recommendations for any of these!
install
Prerequisites
EKS Cluster – Running and accessible via
kubectl.kubectl – Configured to talk to your EKS cluster.
Helm 3+ – Installed locally. Install Helm
(Optional) Prometheus & Grafana – If you want to scrape and visualize KSM metrics.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install kube-state-metrics prometheus-community/kube-state-metrics \
--namespace monitoring \
--create-namespace \
-f values.yaml# Deploy into monitoring namespace (recommended)
namespace: monitoring
# Resource limits
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 200m
# Enable only required collectors to reduce metric volume
collectors:
- namespaces
- pods
- deployments
- daemonsets
- statefulsets
- jobs
- cronjobs
- replicasets
- nodes
- horizontalpodautoscalers
- persistentvolumeclaims
- services
# Add others as needed
# ServiceMonitor for Prometheus Operator (if you use it)
serviceMonitor:
enabled: true
namespace: monitoring
interval: 30s
labels:
release: prometheus-stack # adjust to match your Prometheus selector
# RBAC (usually auto-created, but explicit is safe)
rbac:
create: true
# Tolerations if you have dedicated monitoring node group
# tolerations:
# - key: "dedicated"
# operator: "Equal"
# value: "monitoring"
# effect: "NoSchedule"
# NodeSelector if you want to pin to specific nodes
# nodeSelector:
# role: monitoring
# Security context (optional)
securityContext:
runAsNonRoot: true
runAsUser: 65534Last updated