metrics

alert when node is in not-ready state

Kubernetes Events + Prometheus + Alertmanager

how

Good question 👍 — for alerting when an EKS node goes into NotReady state, you generally rely on monitoring + alerting pipelines.

Here are common approaches:

🔹 1. Using Kubernetes Events + Prometheus + Alertmanager

Step 1: Enable kube-state-metrics (if not already).
- This exposes node conditions as Prometheus metrics.
Step 2: Metric to Watch
- kube_node_status_condition{condition="Ready",status="true"} → means node is healthy.
- If this goes to status="false", node is NotReady.
Step 3: Alert Rule (Prometheus)

- alert: K8sNodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.node }} is NotReady"
    description: "EKS node {{ $labels.node }} has been in NotReady state for more than 2 minutes."

Step 4: Route via Alertmanager → Slack, Email, PagerDuty, OpsGenie, etc.

🔹 2. Using CloudWatch + CloudWatch Alarms (AWS Native)

EKS worker nodes are EC2 instances, so you can use AWS monitoring:

Step 1: Install CloudWatch Container Insights for EKS.
Step 2: CloudWatch Logs Agent collects node health metrics.
Step 3: Create CloudWatch Alarm for NodeReady status (via Container Insights).
Step 4: SNS Notification → Email/SMS/Slack.

🔹 3. Using Kubernetes Native Tooling (kubectl + CronJob)

Run a small script every 2-5 minutes:

kubectl get nodes | grep NotReady

If match found → send Slack/email via webhook.
Can be run as a Kubernetes CronJob.

🔹 4. Using Datadog / NewRelic / Grafana Cloud

Most 3rd-party monitoring platforms already provide EKS node health dashboards + alerts.
Just enable node monitoring and configure “Node NotReady” alert rule.

✅ Best Practice (what most SRE teams do):

Use Prometheus + Alertmanager (fine-grained alerts).
OR CloudWatch Container Insights if you want fully AWS-managed.
Set for: 2–5m in alerts → avoids false positives due to short node restarts.

👉 Pavan, do you want me to write a step-by-step Prometheus + Alertmanager setup for EKS NodeNotReady alert (with kube-state-metrics YAML + alert rules), so you can directly apply it to your cluster?

metrics available by Kube-state metrics

Kube-State-Metrics (KSM) is a service that listens to the Kubernetes API server and generates metrics about the state of Kubernetes objects — such as Deployments, Nodes, Pods, etc. — in Prometheus format. It does not focus on the health of the individual Kubernetes components, but rather on the state of the Kubernetes objects themselves.

As of kube-state-metrics v2.x, here’s a comprehensive list of the families of metrics exposed, grouped by Kubernetes resource type:

📊 1. Node Metrics

kube_node_info
kube_node_status_capacity
kube_node_status_allocatable
kube_node_status_condition
kube_node_spec_unschedulable
kube_node_status_phase
kube_node_labels
kube_node_role
kube_node_created
kube_node_spec_taint

Tracks node info, capacity, conditions (Ready, MemoryPressure, etc.), labels, taints, roles, and creation time.

🐳 2. Pod Metrics

kube_pod_info
kube_pod_owner
kube_pod_status_phase
kube_pod_container_info
kube_pod_container_status_waiting
kube_pod_container_status_running
kube_pod_container_status_terminated
kube_pod_container_status_last_terminated_reason
kube_pod_container_status_ready
kube_pod_container_status_restarts_total
kube_pod_container_resource_requests
kube_pod_container_resource_limits
kube_pod_start_time
kube_pod_completion_time
kube_pod_created
kube_pod_deletion_timestamp
kube_pod_status_scheduled
kube_pod_status_ready
kube_pod_status_unschedulable
kube_pod_status_scheduled_time
kube_pod_overhead
kube_pod_restart_policy
kube_pod_runtime_class_name

Includes pod lifecycle, container states, resource requests/limits, restarts, ownership, scheduling status, and more.

🧩 3. Deployment Metrics

kube_deployment_spec_replicas
kube_deployment_status_replicas
kube_deployment_status_replicas_available
kube_deployment_status_replicas_unavailable
kube_deployment_status_replicas_updated
kube_deployment_status_observed_generation
kube_deployment_metadata_generation
kube_deployment_created
kube_deployment_labels
kube_deployment_annotations
kube_deployment_spec_paused
kube_deployment_spec_strategy_rollingupdate_max_unavailable
kube_deployment_spec_strategy_rollingupdate_max_surge

Tracks desired vs current replicas, availability, update status, generation, labels, annotations, and rollout strategy.

🔄 4. ReplicaSet Metrics

kube_replicaset_owner
kube_replicaset_status_replicas
kube_replicaset_status_fully_labeled_replicas
kube_replicaset_status_ready_replicas
kube_replicaset_status_observed_generation
kube_replicaset_metadata_generation
kube_replicaset_spec_replicas
kube_replicaset_created
kube_replicaset_labels

Useful for tracking ReplicaSets owned by Deployments or other controllers.

🧵 5. StatefulSet Metrics

kube_statefulset_replicas
kube_statefulset_status_replicas
kube_statefulset_status_replicas_current
kube_statefulset_status_replicas_ready
kube_statefulset_status_replicas_updated
kube_statefulset_status_observed_generation
kube_statefulset_metadata_generation
kube_statefulset_created
kube_statefulset_labels
kube_statefulset_update_strategy

Similar to Deployment metrics but specific to StatefulSets.

🎯 6. DaemonSet Metrics

kube_daemonset_status_current_number_scheduled
kube_daemonset_status_desired_number_scheduled
kube_daemonset_status_number_misscheduled
kube_daemonset_status_number_ready
kube_daemonset_status_number_available
kube_daemonset_status_number_unavailable
kube_daemonset_status_updated_number_scheduled
kube_daemonset_status_number_available
kube_daemonset_metadata_generation
kube_daemonset_status_observed_generation
kube_daemonset_created
kube_daemonset_labels
kube_daemonset_update_strategy

Tracks scheduling status across nodes, readiness, updates, and generations.

🚀 7. Job & CronJob Metrics

Job:

kube_job_info
kube_job_owner
kube_job_status_active
kube_job_status_failed
kube_job_status_succeeded
kube_job_spec_completions
kube_job_spec_parallelism
kube_job_created
kube_job_labels
kube_job_annotations
kube_job_completion_time
kube_job_start_time

CronJob:

kube_cronjob_info
kube_cronjob_created
kube_cronjob_labels
kube_cronjob_annotations
kube_cronjob_status_active
kube_cronjob_spec_suspend
kube_cronjob_spec_concurrency_policy
kube_cronjob_spec_starting_deadline_seconds
kube_cronjob_next_schedule_time

Tracks active/failed/succeeded pods, scheduling, concurrency, suspend state, next run time.

🧭 8. Service Metrics

kube_service_info
kube_service_labels
kube_service_created
kube_service_spec_type
kube_service_spec_cluster_ip
kube_service_spec_external_ips
kube_service_status_load_balancer_ingress

Exposes service type (ClusterIP, NodePort, LoadBalancer), IPs, load balancer ingress, labels.

🔗 9. Endpoints / EndpointSlice Metrics

kube_endpoint_address_available
kube_endpoint_address_not_ready
kube_endpoint_labels
kube_endpoint_created

For EndpointSlices (v1.19+):

kube_endpointslice_address_available
kube_endpointslice_address_not_ready
kube_endpointslice_labels

Tracks ready vs not-ready backend addresses per service.

📦 10. PersistentVolume (PV) & PersistentVolumeClaim (PVC) Metrics

PV:

kube_persistentvolume_info
kube_persistentvolume_status_phase
kube_persistentvolume_capacity_bytes
kube_persistentvolume_created
kube_persistentvolume_labels
kube_persistentvolume_annotations
kube_persistentvolume_access_mode

PVC:

kube_persistentvolumeclaim_info
kube_persistentvolumeclaim_status_phase
kube_persistentvolumeclaim_resource_requests_storage_bytes
kube_persistentvolumeclaim_labels
kube_persistentvolumeclaim_annotations
kube_persistentvolumeclaim_access_mode
kube_persistentvolumeclaim_created

Tracks storage capacity, binding status, access modes, phase (Bound/Pending), and labels.

🏷️ 11. ConfigMap & Secret Metrics

ConfigMap:

kube_configmap_info
kube_configmap_created
kube_configmap_metadata_resource_version

Secret:

kube_secret_info
kube_secret_type
kube_secret_created
kube_secret_metadata_resource_version

Mostly metadata: creation time, type, resource version, labels.

🧑‍💼 12. Namespace Metrics

kube_namespace_status_phase
kube_namespace_created
kube_namespace_labels
kube_namespace_annotations

Tracks namespace lifecycle: Active/Terminating.

👥 13. HorizontalPodAutoscaler (HPA) Metrics

kube_horizontalpodautoscaler_spec_min_replicas
kube_horizontalpodautoscaler_spec_max_replicas
kube_horizontalpodautoscaler_status_desired_replicas
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_status_conditions
kube_horizontalpodautoscaler_metadata_generation
kube_horizontalpodautoscaler_labels
kube_horizontalpodautoscaler_annotations
kube_horizontalpodautoscaler_created

Useful for autoscaling monitoring: desired/current replicas, min/max bounds, condition status.

🧩 14. LimitRange Metrics

kube_limitrange_created
kube_limitrange_metadata_resource_version

Very minimal — mostly creation timestamp and resource version.

🧮 15. ResourceQuota Metrics

kube_resourcequota_created
kube_resourcequota_labels
kube_resourcequota_annotations
kube_resourcequota_spec_hard
kube_resourcequota_status_used

Tracks hard limits and current usage for CPU, memory, pods, etc., per namespace.

🛡️ 16. NetworkPolicy Metrics

kube_networkpolicy_info
kube_networkpolicy_created
kube_networkpolicy_labels

Basic metadata and creation time.

🧱 17. Ingress & IngressClass Metrics

Ingress:

kube_ingress_info
kube_ingress_created
kube_ingress_labels
kube_ingress_metadata_resource_version
kube_ingress_path

IngressClass:

kube_ingressclass_info
kube_ingressclass_created
kube_ingressclass_labels

Tracks host/path rules, class, controller, TLS, and metadata.

⚙️ 18. CustomResourceDefinition (CRD) Metrics

kube_customresourcedefinition_info
kube_customresourcedefinition_created
kube_customresourcedefinition_metadata_generation
kube_customresourcedefinition_status_condition

Monitors CRD status, conditions, versions, and group/kind info.

🧬 19. MutatingWebhookConfiguration & ValidatingWebhookConfiguration

kube_mutatingwebhookconfiguration_info
kube_mutatingwebhookconfiguration_created
kube_validatingwebhookconfiguration_info
kube_validatingwebhookconfiguration_created

Metadata and creation timestamps for admission webhooks.

🧑‍🔧 20. Role, RoleBinding, ClusterRole, ClusterRoleBinding

kube_role_created, kube_rolebinding_created
kube_clusterrole_created, kube_clusterrolebinding_created
kube_role_labels, etc.

Creation timestamps and labels — useful for auditing RBAC changes.

📈 21. PriorityClass Metrics

kube_priorityclass_info
kube_priorityclass_preemption_policy
kube_priorityclass_value

Exposes priority values and preemption policies.

🧪 22. RuntimeClass Metrics (if enabled)

kube_runtimeclass_info
kube_runtimeclass_created
kube_runtimeclass_handler

Info about configured runtime handlers (e.g., gvisor, kata-containers).

✅ 23. CertificateSigningRequest (CSR)

kube_certificate_signing_request_info
kube_certificate_signing_request_conditions
kube_certificate_signing_request_created

Tracks CSR approval/denial status and signer name.

🌐 24. Lease Metrics (for HA control plane monitoring)

kube_lease_owner
kube_lease_holder_identity
kube_lease_acquire_time
kube_lease_renew_time
kube_lease_leader_transitions_total

Critical for monitoring etcd leader election and component liveness (e.g., kube-controller-manager, scheduler).

🧾 25. Event Metrics (optional — disabled by default due to high cardinality)

kube_event_info
kube_event_created
kube_event_type
kube_event_reason
kube_event_count

Must be explicitly enabled via flags (--collectors=events). Useful for auditing but can generate massive metric volume.

🧰 Optional Collectors (May Require Explicit Enablement)

Some collectors are disabled by default or require feature gates:

events → High-cardinality, off by default
certificatesigningrequests → Available since v1.9+
leases → Available since v1.14+
runtimeclasses → Available since v1.19+

You can enable/disable collectors via CLI flags:

--collectors=<comma-separated-list>

Example:

--collectors=deployments,pods,nodes,jobs,cronjobs,horizontalpodautoscalers

📌 Notes

All metrics are prefixed with kube_.
Most metrics include labels like namespace, name, node, container, phase, condition, etc.
KSM exposes gauges and counters — no histograms or summaries.
Metrics reflect desired state and observed state — useful for detecting drift or misconfigurations.
Version compatibility: Always check kube-state-metrics releases for Kubernetes version support.

📚 Official Documentation

GitHub: https://github.com/kubernetes/kube-state-metrics
Metrics Documentation: https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md
Each resource has its own doc (e.g., pod-metrics.md)

✅ Summary: Kube-State-Metrics exposes rich, object-level state metrics across nearly all core Kubernetes resources — ideal for alerting, dashboards, and auditing object lifecycle and configuration drift.

Let me know if you want sample PromQL queries or Grafana dashboard recommendations for any of these!

install

Prerequisites

EKS Cluster – Running and accessible via kubectl.
kubectl – Configured to talk to your EKS cluster.
Helm 3+ – Installed locally. Install Helm
(Optional) Prometheus & Grafana – If you want to scrape and visualize KSM metrics.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kube-state-metrics prometheus-community/kube-state-metrics \
  --namespace monitoring \
  --create-namespace \
  -f values.yaml

# Deploy into monitoring namespace (recommended)
namespace: monitoring

# Resource limits
resources:
  requests:
    memory: 256Mi
    cpu: 100m
  limits:
    memory: 512Mi
    cpu: 200m

# Enable only required collectors to reduce metric volume
collectors:
  - namespaces
  - pods
  - deployments
  - daemonsets
  - statefulsets
  - jobs
  - cronjobs
  - replicasets
  - nodes
  - horizontalpodautoscalers
  - persistentvolumeclaims
  - services
  # Add others as needed

# ServiceMonitor for Prometheus Operator (if you use it)
serviceMonitor:
  enabled: true
  namespace: monitoring
  interval: 30s
  labels:
    release: prometheus-stack  # adjust to match your Prometheus selector

# RBAC (usually auto-created, but explicit is safe)
rbac:
  create: true

# Tolerations if you have dedicated monitoring node group
# tolerations:
#   - key: "dedicated"
#     operator: "Equal"
#     value: "monitoring"
#     effect: "NoSchedule"

# NodeSelector if you want to pin to specific nodes
# nodeSelector:
#   role: monitoring

# Security context (optional)
securityContext:
  runAsNonRoot: true
  runAsUser: 65534

PreviousLogs and events NextCluster Upgrade

Last updated 4 months ago

hashtag🔹 1. Using Kubernetes Events + Prometheus + Alertmanager

hashtag🔹 2. Using CloudWatch + CloudWatch Alarms (AWS Native)

hashtag🔹 3. Using Kubernetes Native Tooling (kubectl + CronJob)

hashtag🔹 4. Using Datadog / NewRelic / Grafana Cloud

hashtag📊 1. Node Metrics

hashtag🐳 2. Pod Metrics

hashtag🧩 3. Deployment Metrics

hashtag🔄 4. ReplicaSet Metrics

hashtag🧵 5. StatefulSet Metrics

hashtag🎯 6. DaemonSet Metrics

hashtag🚀 7. Job & CronJob Metrics

hashtagJob:

hashtagCronJob:

hashtag🧭 8. Service Metrics

hashtag🔗 9. Endpoints / EndpointSlice Metrics

hashtag📦 10. PersistentVolume (PV) & PersistentVolumeClaim (PVC) Metrics

hashtagPV:

hashtagPVC:

hashtag🏷️ 11. ConfigMap & Secret Metrics

hashtagConfigMap:

hashtagSecret:

hashtag🧑‍💼 12. Namespace Metrics

hashtag👥 13. HorizontalPodAutoscaler (HPA) Metrics

hashtag🧩 14. LimitRange Metrics

hashtag🧮 15. ResourceQuota Metrics

hashtag🛡️ 16. NetworkPolicy Metrics

hashtag🧱 17. Ingress & IngressClass Metrics

hashtagIngress:

hashtagIngressClass:

hashtag⚙️ 18. CustomResourceDefinition (CRD) Metrics

hashtag🧬 19. MutatingWebhookConfiguration & ValidatingWebhookConfiguration

hashtag🧑‍🔧 20. Role, RoleBinding, ClusterRole, ClusterRoleBinding

hashtag📈 21. PriorityClass Metrics

hashtag🧪 22. RuntimeClass Metrics (if enabled)

hashtag✅ 23. CertificateSigningRequest (CSR)

hashtag🌐 24. Lease Metrics (for HA control plane monitoring)

hashtag🧾 25. Event Metrics (optional — disabled by default due to high cardinality)

hashtag🧰 Optional Collectors (May Require Explicit Enablement)

hashtag📌 Notes

hashtag📚 Official Documentation

hashtagPrerequisites