efk arch design

Prompt

I want to set up a real-time, production-grade EFK (Elasticsearch, Fluentd, Kibana) stack on Amazon EKS to monitor logs from all microservices deployed in the cluster, specifically in the banking and order processing domain. The system should support:

    High volume log ingestion

    Persistent storage

    Real-time alerting

    Secure access

    Compliance-ready log retention

Please provide a complete, end-to-end setup and configuration guide covering the following:

🔧 1. Infrastructure Setup on EKS

Deploy Elasticsearch, Fluentd, and Kibana on EKS using Helm charts or Kubernetes manifests.

Use EBS-backed PersistentVolumeClaims for Elasticsearch storage.

Namespace separation (e.g., logging namespace).

Resource limits and requests (CPU, memory) for each component.

Node affinity and taints for dedicated logging nodes if needed.

Use Fargate or dedicated EC2 worker nodes based on log volume.

Enable IAM roles for service accounts for Fluentd to access S3 (if archival is used).

🧩 2. Component Configuration ✅ Elasticsearch:

elasticsearch.yml configuration with cluster settings.

Define index lifecycle policies to retain logs for 6–12 months, with rollover and deletion.

Enable replication and shard balancing for HA.

Expose service internally in cluster via ClusterIP.

✅ Fluentd:

Deploy Fluentd as a DaemonSet on all worker nodes.

Fluentd config to:

    Collect logs from /var/log/containers/*.log

    Enrich logs with Kubernetes metadata (namespace, pod, labels)

    Parse logs (JSON, regex for banking logs, custom formats)

    Forward logs to Elasticsearch securely

Include retry/backoff, buffering settings, and deduplication filters.

📁 Example Fluentd Config (brief snippet):

@type tail path /var/log/containers/*.log format json tag kube.* read_from_head true

<filter **> @type kubernetes_metadata

<match **> @type elasticsearch host elasticsearch.logging.svc.cluster.local port 9200 logstash_format true include_tag_key true flush_interval 5s retry_forever true

✅ Kibana:

Connect to Elasticsearch endpoint

Expose via Ingress with TLS termination or ALB (HTTPS)

Enable saved objects and dashboards for error, latency, and transaction logs

📁 kibana.yml

server.host: "0.0.0.0" elasticsearch.hosts: ["http://elasticsearch.logging.svc.cluster.local:9200"] xpack.security.enabled: true

🔐 3. Production-Grade Considerations

Enable TLS encryption between all components (Fluentd ↔ ES, Kibana ↔ ES).

RBAC and fine-grained access control using Kibana Spaces.

IAM integration with OpenID Connect if needed.

Setup log archival to S3 for long-term compliance storage.

Enable Index Lifecycle Management (ILM) policies in Elasticsearch.

Use Elasticsearch curator or ILM for automatic deletion and rollover.

🔍 4. Testing, Access, and Troubleshooting

Commands to:

    Port-forward to Kibana and Elasticsearch for testing

    kubectl logs Fluentd pods to verify collection

    Use _cat/indices API to check index health

Troubleshooting tips for:

    Fluentd not forwarding

    Index not created

    Parsing errors in log formats

Use test pods emitting logs to verify collection & filtering end to end

📊 5. Real-Time Log Aggregation & Filtering

Setup index patterns in Kibana (e.g., kubernetes-*, banking-logs-*)

Create dashboards:

    Failed logins, transaction timeouts

    Payment gateway logs

    Order placement flows

Use filters for:

    Pod/namespace/container

    Log level (error, warn)

    Correlation IDs or transaction IDs

Enable live log tailing or near real-time log updates

🔔 6. Alerting and Notifications

Use Kibana Alerting (or Watcher) to create:

    Error rate alerts

    Transaction failure count thresholds

    API timeouts > certain threshold

Integrate alerts with:

    Slack

    PagerDuty

    Email

Create throttling or deduplication rules for noisy logs

➕ 7. Extendibility and Additional Features

Add Fluent Bit as log forwarder (lightweight) + Fluentd aggregator

Route logs to multiple outputs (e.g., Elasticsearch + S3)

Add Metricbeat and Heartbeat for infra metrics and service availability monitoring

Add audit log collection for API Gateway, RDS, etc.

📁 Please provide:

Sample config files: fluentd.conf, elasticsearch.yml, kibana.yml

Helm values.yaml (or manifest YAMLs) for deploying EFK stack

Required RBAC policies and ServiceAccounts

EBS storage class config

Kubernetes manifests for:

    Fluentd DaemonSet

    Elasticsearch StatefulSet

    Kibana Deployment

    PVCs and Ingress

Assume all services are running on EKS already. Focus on making this real-time, secure, scalable, compliant, and resilient — suitable for high-throughput, sensitive domains like banking and order processing.

Result

Absolutely! Here’s a comprehensive, production-grade EFK (Elasticsearch, Fluentd, Kibana) stack setup for Amazon EKS, tailored for high-throughput, compliance, and security in banking/order processing. This guide covers infrastructure, configuration, security, alerting, and extendibility with sample configs and manifests.

1. Infrastructure Setup on EKS

1.1. Namespaces

apiVersion: v1
kind: Namespace
metadata:
  name: logging

1.2. EBS StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  fsType: ext4
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

1.3. RBAC & ServiceAccounts

Fluentd ServiceAccount (with IAM Role for S3)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: logging
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/FluentdS3AccessRole

RBAC for Fluentd

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluentd
rules:
- apiGroups: [""]
  resources: ["pods", "namespaces"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluentd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluentd
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: logging

1.4. Node Affinity & Taints (for dedicated logging nodes)

Example NodeGroup label: logging=true

Taint: logging=true:NoSchedule

Pod Affinity:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: logging
          operator: In
          values:
          - "true"
tolerations:
- key: "logging"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"

2. Component Configuration

2.1. Elasticsearch

Helm values.yaml (elasticsearch)

clusterName: "banking-logs"
nodeGroup: "master"
replicas: 3
minimumMasterNodes: 2
esJavaOpts: "-Xmx2g -Xms2g"
resources:
  requests:
    cpu: "2"
    memory: "4Gi"
  limits:
    cpu: "4"
    memory: "8Gi"
volumeClaimTemplate:
  accessModes: [ "ReadWriteOnce" ]
  storageClassName: "ebs-sc"
  resources:
    requests:
      storage: 200Gi
service:
  type: ClusterIP
  annotations: {}
networkHost: 0.0.0.0
extraEnvs:
  - name: "ES_JAVA_OPTS"
    value: "-Xmx2g -Xms2g"
persistence:
  enabled: true
  size: 200Gi
  storageClass: "ebs-sc"
security:
  enabled: true
  tls:
    enabled: true
    autoGenerated: true

elasticsearch.yml

cluster.name: banking-logs
node.name: ${HOSTNAME}
network.host: 0.0.0.0
discovery.seed_hosts: ["elasticsearch-master-0.elasticsearch-master.logging.svc.cluster.local", "elasticsearch-master-1.elasticsearch-master.logging.svc.cluster.local"]
cluster.initial_master_nodes: ["elasticsearch-master-0", "elasticsearch-master-1", "elasticsearch-master-2"]
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12

Index Lifecycle Policy (ILM)

PUT _ilm/policy/banking-logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "30d"
          }
        }
      },
      "delete": {
        "min_age": "180d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Assign policy to index template:

PUT _template/banking-logs-template
{
  "index_patterns": ["banking-logs-*"],
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "index.lifecycle.name": "banking-logs-policy",
    "index.lifecycle.rollover_alias": "banking-logs"
  }
}

2.2. Fluentd

DaemonSet Manifest (fluentd-daemonset.yaml)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccountName: fluentd
      tolerations:
      - key: "logging"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: logging
                operator: In
                values:
                - "true"
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.16.2-debian-elasticsearch7-1.0
        resources:
          limits:
            memory: 1Gi
            cpu: 500m
          requests:
            memory: 512Mi
            cpu: 200m
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config-volume
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config-volume
        configMap:
          name: fluentd-config

ConfigMap (fluentd-config.yaml)

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      format json
      tag kube.*
      read_from_head true
    </source>

    <filter **>
      @type kubernetes_metadata
    </filter>

    <filter **>
      @type record_transformer
      enable_ruby
      <record>
        log_type ${record["kubernetes"]["namespace_name"]}
        app_name ${record["kubernetes"]["labels"]["app"]}
      </record>
    </filter>

    <match **>
      @type elasticsearch
      host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
      port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
      scheme https
      ssl_verify false
      user elastic
      password changeme
      logstash_format true
      logstash_prefix banking-logs
      include_tag_key true
      flush_interval 5s
      buffer_chunk_limit 2M
      buffer_queue_limit 32
      retry_forever true
      reload_connections false
      reconnect_on_error true
      reload_on_failure true
      <buffer>
        @type file
        path /var/log/fluentd-buffers/
        flush_interval 5s
        retry_max_interval 30
        retry_forever true
      </buffer>
    </match>

2.3. Kibana

Deployment (kibana-deployment.yaml)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:8.13.0
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 500m
            memory: 1Gi
        env:
        - name: ELASTICSEARCH_HOSTS
          value: "https://elasticsearch.logging.svc.cluster.local:9200"
        - name: ELASTICSEARCH_USERNAME
          value: "elastic"
        - name: ELASTICSEARCH_PASSWORD
          value: "changeme"
        ports:
        - containerPort: 5601
        volumeMounts:
        - name: kibana-config
          mountPath: /usr/share/kibana/config/kibana.yml
          subPath: kibana.yml
      volumes:
      - name: kibana-config
        configMap:
          name: kibana-config

ConfigMap (kibana-config.yaml)

apiVersion: v1
kind: ConfigMap
metadata:
  name: kibana-config
  namespace: logging
data:
  kibana.yml: |
    server.host: "0.0.0.0"
    elasticsearch.hosts: ["https://elasticsearch.logging.svc.cluster.local:9200"]
    elasticsearch.username: "elastic"
    elasticsearch.password: "changeme"
    xpack.security.enabled: true
    xpack.encryptedSavedObjects.encryptionKey: "a-very-long-random-string"
    server.ssl.enabled: true
    server.ssl.certificate: /usr/share/kibana/config/certs/kibana.crt
    server.ssl.key: /usr/share/kibana/config/certs/kibana.key

Ingress (with TLS)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: kibana-ingress
  namespace: logging
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:REGION:ACCOUNT:certificate/ID
spec:
  rules:
  - host: kibana.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: kibana
            port:
              number: 5601

3. Production-Grade Considerations

TLS everywhere: Use cert-manager or Helm chart options for self-signed or ACM-provided certs.
RBAC: Use Kibana Spaces for multi-tenancy and restrict access.
IAM OIDC: Integrate with AWS IAM for SSO.
Archival: Fluentd output plugin for S3, or use Elasticsearch snapshot to S3.
ILM: As above, for compliance retention.
Curator: Optional, but ILM is preferred for automation.

4. Testing, Access, and Troubleshooting

Port-forward for testing

kubectl -n logging port-forward svc/kibana 5601:5601
kubectl -n logging port-forward svc/elasticsearch 9200:9200

Check Fluentd logs

kubectl -n logging logs -l app=fluentd

Check Elasticsearch indices

curl -u elastic:changeme https://localhost:9200/_cat/indices?v

Troubleshooting

Fluentd not forwarding: Check buffer status, Elasticsearch connectivity, RBAC.
Index not created: Check ILM policy, index template, Fluentd output config.
Parsing errors: Validate log format, update Fluentd parser.

Test log emission

Deploy a test pod:

apiVersion: v1
kind: Pod
metadata:
  name: log-tester
  namespace: default
  labels:
    app: log-tester
spec:
  containers:
  - name: log-tester
    image: busybox
    command: ["/bin/sh", "-c", "while true; do echo '{\"event\":\"payment\",\"status\":\"fail\",\"user\":\"bob\"}'; sleep 5; done"]

5. Real-Time Log Aggregation & Filtering

Kibana index patterns: banking-logs-*, kubernetes-*
Dashboards: Create visualizations for failed logins, transaction timeouts, payment gateway errors, order flows.
Filters: Use fields like kubernetes.namespace_name, log_level, transaction_id.
Live tailing: Use Kibana’s Discover with auto-refresh.

6. Alerting and Notifications

Kibana Alerting: Create rules for error rates, transaction failures, API timeouts.
Integrations: Use Kibana connectors for Slack, PagerDuty, Email.
Throttling: Set alert frequency and deduplication in rule settings.

7. Extendibility

Fluent Bit as forwarder: Deploy as DaemonSet, forward to Fluentd aggregator.
Multiple outputs: Fluentd config can output to both Elasticsearch and S3.
Metricbeat/Heartbeat: Deploy for infra and uptime monitoring.
Audit logs: Use Fluentd or Filebeat to collect from API Gateway, RDS, etc.

📁 Sample Config Files

fluentd.conf (see above in ConfigMap)

elasticsearch.yml (see above)

kibana.yml (see above)

📁 Helm values.yaml (Elasticsearch)

See above under Elasticsearch section.

📁 Kubernetes Manifests

Namespace: See above.
EBS StorageClass: See above.
Fluentd DaemonSet: See above.
Elasticsearch StatefulSet: Use official Helm chart or adapt above values.
Kibana Deployment: See above.
PVCs: Handled by Helm/StatefulSet.
Ingress: See above.

📁 RBAC Policies and ServiceAccounts

See above.

📁 EBS StorageClass

See above.

📁 Example: Fluentd DaemonSet, Elasticsearch StatefulSet, Kibana Deployment

See above for YAMLs.

📁 PVCs and Ingress

See above.

🏁 Summary

This setup provides a secure, scalable, real-time EFK stack on EKS, with:

Persistent, compliant log storage
TLS, RBAC, and IAM integration
Real-time alerting and dashboards
Extendibility for metrics, audit, and multi-output
Production best practices for banking/order processing

For further hardening:

Use AWS Secrets Manager for credentials
Enable audit logging on EKS and AWS resources
Regularly test backup/restore and failover

Let me know if you need a full Helm chart example, S3 archival config, or more advanced ILM/Curator policies!

PreviousELSTIC STACK NextAlerting

Last updated 6 months ago

hashtag1. Infrastructure Setup on EKS

hashtag1.1. Namespaces

hashtag1.2. EBS StorageClass

hashtag1.3. RBAC & ServiceAccounts

hashtagFluentd ServiceAccount (with IAM Role for S3)

hashtagRBAC for Fluentd

hashtag1.4. Node Affinity & Taints (for dedicated logging nodes)

hashtag2. Component Configuration

hashtag2.1. Elasticsearch

hashtagHelm values.yaml (elasticsearch)

hashtagelasticsearch.yml

hashtagIndex Lifecycle Policy (ILM)

hashtag2.2. Fluentd

hashtagDaemonSet Manifest (fluentd-daemonset.yaml)

hashtagConfigMap (fluentd-config.yaml)

hashtag2.3. Kibana

hashtagDeployment (kibana-deployment.yaml)

hashtagConfigMap (kibana-config.yaml)

hashtagIngress (with TLS)

hashtag3. Production-Grade Considerations

hashtag4. Testing, Access, and Troubleshooting

hashtagPort-forward for testing

hashtagCheck Fluentd logs

hashtagCheck Elasticsearch indices

hashtagTroubleshooting

hashtagTest log emission

hashtag5. Real-Time Log Aggregation & Filtering

hashtag6. Alerting and Notifications

hashtag7. Extendibility

hashtag📁 Sample Config Files

hashtagfluentd.conf (see above in ConfigMap)

hashtagelasticsearch.yml (see above)

hashtagkibana.yml (see above)

hashtag📁 Helm values.yaml (Elasticsearch)

hashtag📁 Kubernetes Manifests

hashtag📁 RBAC Policies and ServiceAccounts

hashtag📁 EBS StorageClass

hashtag📁 Example: Fluentd DaemonSet, Elasticsearch StatefulSet, Kibana Deployment

hashtag📁 PVCs and Ingress

hashtag🏁 Summary

1. Infrastructure Setup on EKS

1.1. Namespaces

1.2. EBS StorageClass

1.3. RBAC & ServiceAccounts

Fluentd ServiceAccount (with IAM Role for S3)

RBAC for Fluentd

1.4. Node Affinity & Taints (for dedicated logging nodes)

2. Component Configuration

2.1. Elasticsearch

Helm values.yaml (elasticsearch)

elasticsearch.yml

Index Lifecycle Policy (ILM)

2.2. Fluentd

DaemonSet Manifest (fluentd-daemonset.yaml)

ConfigMap (fluentd-config.yaml)

2.3. Kibana

Deployment (kibana-deployment.yaml)

ConfigMap (kibana-config.yaml)

Ingress (with TLS)

3. Production-Grade Considerations

4. Testing, Access, and Troubleshooting

Port-forward for testing

Check Fluentd logs

Check Elasticsearch indices

Troubleshooting

Test log emission

5. Real-Time Log Aggregation & Filtering

6. Alerting and Notifications

7. Extendibility

📁 Sample Config Files

fluentd.conf (see above in ConfigMap)

elasticsearch.yml (see above)

kibana.yml (see above)

📁 Helm values.yaml (Elasticsearch)

📁 Kubernetes Manifests

📁 RBAC Policies and ServiceAccounts

📁 EBS StorageClass

📁 Example: Fluentd DaemonSet, Elasticsearch StatefulSet, Kibana Deployment

📁 PVCs and Ingress

🏁 Summary