Files and Flow

How Alerting Works

📌 End-to-End Example

Prometheus Alerting Rule (`rules.yml`)

yamlCopyEditgroups:
- name: node_alerts
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} is down"
      description: "The instance {{ $labels.instance }} has been unreachable for more than 5 minutes."

Prometheus Config (`prometheus.yml`)

yamlCopyEditalerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]  # Prometheus forwards alerts to Alertmanager

Alertmanager Config (`alertmanager.yml`)

yamlCopyEditroute:
  receiver: 'slack'
  group_by: ['alertname', 'instance']

receivers:
  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        api_url: 'https://hooks.slack.com/services/XXXXX/YYYYY/ZZZZZ'

🔹 Summary

Prometheus detects an issue and marks an alert as "firing".
Alertmanager receives the alert, processes it, and determines the receiver.
Alert is sent to configured channels (Slack, Email, PagerDuty, etc.).

alert-rules

prometheus url to see active alerts http://122.248.225.19:9090/alerts

alertrule.yml added in prometheus directory muskil samay main kiase aage aajatohu adi meme file structure groups (each group have name and list of rules)

groups: (start from here)

name: Disk-usage (group name) rules: (rules 1/more)
- alert: 'Low data disk space' (alert name) expr: ... ( PromQL expression that defines the condition to trigger the alert.) for: 5m (Alert is triggered only if the expression is true continuously for 5 minutes.) labels: ... (Adds metadata to the alert. eg severity: normal/warning/critical use for Filtering/grouping in Alertmanager) annotations: ... (These provide additional information in the alert message: title, description, summary)

Alerrules.yml file

groups:

- name: Disk-usage
  rules:
  - alert: 'Low data disk space'
    expr: ceil(((node_filesystem_size_bytes{mountpoint!="/boot"} - node_filesystem_free_bytes{mountpoint!="/boot"}) / node_filesystem_size_bytes{mountpoint!="/boot"} * 100)) > 85
    labels:
      severity: 'critical'
    annotations:
      title: "Disk Usage"
      description: 'Partition : {{$labels.mountpoint}}'
      summary: "Disk usage is `{{humanize $value}}%`"
      host: "{{$labels.instance}}"


- name: Memory-usage
  rules:
  - alert: 'High memory usage'
    expr: ceil((((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100)) > 80
    labels:
      severity: 'critical'
    annotations:
      title: "Memory Usage"
      description: 'Memory usage threshold set to `80%`.'
      summary: "Memory usage is `{{humanize $value}}%`"
      host: "{{$labels.instance}}"

- name: ssl_expiry_alerts
  rules:
  - alert: SSLCertificateExpiringSoon
    expr: (probe_ssl_earliest_cert_expiry - time()) < 86400 * 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate for {{ $labels.instance }} expires soon"
      description: "The SSL cert for {{ $labels.instance }} will expire in less than 15 days ({{ $value | humanizeDuration }})."

Prometheus evaluates alerting rules at the interval defined in prometheus.yml (evaluation_interval).

If evaluation_interval: 30s, then Prometheus checks the condition every 30 seconds.
If the instance is down for a full 5 minutes continuously, the alert fires.

Rules.yml

groups:
- name: node_alerts
  rules:
  - alert: InstanceDown
    expr: up == 0           # Conditional rule: If "up" is 0, instance is down 
    for: 5m                 # Condition must persist for 5 minutes               before firing an alert
    labels:
      severity: critical    # Label to categorize alert severity
    annotations:
      summary: "Instance {{ $labels.instance }} is down"
      description: "The instance {{ $labels.instance }} has been unreachable for more than 5 minutes."

  - alert: HighCPUUsage
    expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 85
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage has been above 85% for the last 2 minutes on {{ $labels.instance }}."

  - alert: HighMemoryUsage
    expr: (node_memory_Active_bytes / node_memory_MemTotal_bytes) * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High Memory Usage on {{ $labels.instance }}"
      description: "Memory usage on {{ $labels.instance }} has exceeded 90%."

  - alert: HighDiskUsage
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Low Disk Space on {{ $labels.instance }}"
      description: "Disk space on {{ $labels.instance }} has dropped below 10%."

  - alert: HighRequestLatency
    expr: rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) > 2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High Request Latency on {{ $labels.instance }}"
      description: "The average request latency has exceeded 2 seconds on {{ $labels.instance }}."

  - alert: TooMany500Errors
    expr: rate(http_requests_total{status_code=~"5.."}[5m]) > 5
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "High 500 Error Rate on {{ $labels.instance }}"
      description: "More than 5 HTTP 500 errors per minute detected on {{ $labels.instance }}."

  - alert: KubernetesPodCrashLoop
    expr: kube_pod_container_status_restarts_total > 5
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "Pod CrashLoop detected on {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} has restarted more than 5 times in the last 10 minutes."

  - alert: KubernetesNodeNotReady
    expr: kube_node_status_condition{condition="Ready",status="false"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Kubernetes Node Not Ready"
      description: "Node {{ $labels.node }} is in NotReady state for more than 5 minutes."

PreviousAlertManager NextArchitecture

Last updated 8 months ago

hashtag📌 End-to-End Example

hashtagPrometheus Alerting Rule (rules.yml)

hashtagPrometheus Config (prometheus.yml)

hashtagAlertmanager Config (alertmanager.yml)

hashtag🔹 Summary

📌 End-to-End Example

Prometheus Alerting Rule (`rules.yml`)

Prometheus Config (`prometheus.yml`)

Alertmanager Config (`alertmanager.yml`)

🔹 Summary