Files and Flow

chevron-rightHow Alerting Workshashtag

📌 End-to-End Example

Prometheus Alerting Rule (rules.yml)

yamlCopyEditgroups:
- name: node_alerts
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} is down"
      description: "The instance {{ $labels.instance }} has been unreachable for more than 5 minutes."

Prometheus Config (prometheus.yml)

yamlCopyEditalerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]  # Prometheus forwards alerts to Alertmanager

Alertmanager Config (alertmanager.yml)

yamlCopyEditroute:
  receiver: 'slack'
  group_by: ['alertname', 'instance']

receivers:
  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        api_url: 'https://hooks.slack.com/services/XXXXX/YYYYY/ZZZZZ'

🔹 Summary

  1. Prometheus detects an issue and marks an alert as "firing".

  2. Alertmanager receives the alert, processes it, and determines the receiver.

  3. Alert is sent to configured channels (Slack, Email, PagerDuty, etc.).

chevron-rightalert-ruleshashtag

prometheus url to see active alerts http://122.248.225.19:9090/alerts

alertrule.yml added in prometheus directory muskil samay main kiase aage aajatohu adi meme file structure groups (each group have name and list of rules)

groups: (start from here)

  • name: Disk-usage (group name) rules: (rules 1/more)

    • alert: 'Low data disk space' (alert name) expr: ... ( PromQL expression that defines the condition to trigger the alert.) for: 5m (Alert is triggered only if the expression is true continuously for 5 minutes.) labels: ... (Adds metadata to the alert. eg severity: normal/warning/critical use for Filtering/grouping in Alertmanager) annotations: ... (These provide additional information in the alert message: title, description, summary)

chevron-rightAlerrules.yml filehashtag

groups:

- name: Disk-usage
  rules:
  - alert: 'Low data disk space'
    expr: ceil(((node_filesystem_size_bytes{mountpoint!="/boot"} - node_filesystem_free_bytes{mountpoint!="/boot"}) / node_filesystem_size_bytes{mountpoint!="/boot"} * 100)) > 85
    labels:
      severity: 'critical'
    annotations:
      title: "Disk Usage"
      description: 'Partition : {{$labels.mountpoint}}'
      summary: "Disk usage is `{{humanize $value}}%`"
      host: "{{$labels.instance}}"


- name: Memory-usage
  rules:
  - alert: 'High memory usage'
    expr: ceil((((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100)) > 80
    labels:
      severity: 'critical'
    annotations:
      title: "Memory Usage"
      description: 'Memory usage threshold set to `80%`.'
      summary: "Memory usage is `{{humanize $value}}%`"
      host: "{{$labels.instance}}"

- name: ssl_expiry_alerts
  rules:
  - alert: SSLCertificateExpiringSoon
    expr: (probe_ssl_earliest_cert_expiry - time()) < 86400 * 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate for {{ $labels.instance }} expires soon"
      description: "The SSL cert for {{ $labels.instance }} will expire in less than 15 days ({{ $value | humanizeDuration }})."

Prometheus evaluates alerting rules at the interval defined in prometheus.yml (evaluation_interval).

  • If evaluation_interval: 30s, then Prometheus checks the condition every 30 seconds.

  • If the instance is down for a full 5 minutes continuously, the alert fires.

Rules.yml

Last updated