RAG Example

Nice — below is a detailed, runnable roadmap to build a production RAG system that connects your self-hosted LLM on EKS (Karpenter) to company data (Postgres/ClickHouse, PDFs, Confluence/Jira, ticket systems), plus code examples you can copy-paste and adapt.


TL;DR

  1. Deploy LLM inference service (TGI / vLLM / NIM) on EKS (GPU nodes provisioned by Karpenter). (Amazon Web Services, Inc.arrow-up-right)

  2. Build an ingestion pipeline: fetch docs (DB rows, PDFs, Confluence, Jira), split & clean, generate embeddings, upsert into a vector DB (Milvus/Qdrant/Weaviate). (Milvusarrow-up-right)

  3. Implement a retriever → composer → LLM pipeline (prompt + context assembly + call model endpoint). Use LangChain / LlamaIndex for scaffolding. (LangChainarrow-up-right)

  4. Secure, monitor, and autoscale with Karpenter, HPA/custom metrics, IAM, TLS and logging. (Amazon Web Services, Inc.arrow-up-right)


1 — High-level architecture (components)

  • Model serving — containerized LLM inference (TGI / vLLM / NVIDIA NIM) behind a Kubernetes Service (ClusterIP) + Ingress / NLB. (Hugging Facearrow-up-right)

  • Vector DB — Milvus / Qdrant / Weaviate (self-hosted on EKS or separate VMs) to store embeddings + metadata. (Milvusarrow-up-right)

  • Ingestion pipeline — workers (K8s Jobs/CronJobs) that read source systems (Postgres / ClickHouse / PDFs / Confluence / Jira / Zendesk), chunk text, create embeddings, and upsert to vector DB. (LangChainarrow-up-right)

  • API / App layer — FastAPI/Flask service that: (a) accepts user query, (b) retrieves top K chunks from vector DB, (c) assembles prompt + context and calls model endpoint, (d) returns answer + source attribution.

  • Ops & infra — EKS + Karpenter (GPU node provisioning), monitoring (Prometheus/Grafana), logs (ELK/CloudWatch), CI/CD (ArgoCD / GitHub Actions). (Karpenterarrow-up-right)


2 — Step-by-step implementation (detailed)

Phase A — Plan & requirements (do this first)

  • Identify 🌟 use-cases and SLAs: latency (P50/P95), throughput (QPS), allowed data leak risk, retention policies.

  • Choose model size by latency/quality tradeoff (7–8B for low cost, 13–70B for higher quality). Note GPU memory requirements.

  • Decide vector DB (Milvus/Qdrant/Weaviate) and embedding model (local HF model or hosted embedding service). (Milvusarrow-up-right)


Phase B — EKS + Karpenter baseline

  1. Provision EKS cluster (private subnets, control plane).

  2. Install Karpenter (follow official guide) and create a GPU provisioner (a.k.a. Provisioner manifest) that requests GPU instance types when pods request nvidia.com/gpu. See Karpenter docs for the exact provider fields for AWS. (Karpenterarrow-up-right)

Minimal example Provisioner (replace placeholders; consult Karpenter docs before applying):

(Use the official Karpenter docs for the exact fields and IAM setup). (Karpenterarrow-up-right)


Phase C — Model serving on EKS

  1. Choose inference engine:

  2. Docker image / Deployment: run the TGI container (or vLLM operator). Request nvidia.com/gpu: 1 in container resources, add tolerations/affinity.

  3. Expose a stable internal DNS (K8s Service) and optionally an API Gateway + auth in front.

Example deployment (TGI) snippet:

Then call via POST http://tgi-service:8080/v1/chat/completions. Example TGI usage documented by Hugging Face. (Hugging Facearrow-up-right)


Phase D — Vector DB & storage

  • Pick Milvus (scale, enterprise features) or Qdrant (lightweight, easy to run). Deploy via Helm on EKS or use managed (if acceptable). Configure persistence via PVCs (EBS / EFS for multi-AZ access). (Milvusarrow-up-right)

  • Create collections with an appropriate distance metric (cosine for most embeddings) and fields for metadata: source, doc_id, chunk_id, created_at, url, confidence, etc.


Phase E — Ingestion pipeline (connectors + embedding)

Goal: pull data, split into chunks, embed, upsert.

  1. Databases (Postgres/ClickHouse)

    • Export text columns (e.g., ticket body, comments, commit messages). Use incremental changelog (last_updated timestamp) to avoid reprocessing everything.

    • Sanitize & redact PII before embedding if required.

  2. PDFs & files

    • Use an extractor (Unstructured / PDFMiner / Tika). OCR with Tesseract when scans are images.

    • Chunk text into ~500 token chunks with 20–50% overlap.

  3. Confluence / Jira / Tickets

    • Use official REST APIs (Confluence, Jira) or LangChain / LlamaIndex loaders (ConfluenceLoader, JiraReader) to fetch pages & attachments. These loaders exist and are widely used. (LangChainarrow-up-right)

  4. Embeddings

    • Locally host an embedding model (sentence-transformers / small HF embedding model) or use your LLM inference provider’s embedding microservice. Generate vectors for each chunk.

  5. Upsert to Vector DB

    • Upsert id + vector + metadata. Keep a pointer to original location for source attribution.

Python ingestion example (LangChain + Qdrant + sentence-transformers):

(You can do the same for Confluence/Jira using ConfluenceLoader / JiraReader documented in LangChain/LlamaIndex). (LangChainarrow-up-right)


Phase F — Retriever + RAG pipeline

  1. Retriever: Query vector DB (top_k), optionally apply metadata filters (team=HR, doc_type=policy), and re-rank results with a small cross-encoder or BM25 hybrid.

  2. Context assembly: concatenate top chunks until you hit token budget (e.g., 2,000 tokens), include metadata snippets (source URLs).

  3. Prompt design: system prompt (role + instructions), include retrieved context, then user question. Provide instruction to the model to cite sources (return list of chunk ids/URLs).

  4. Call the model endpoint (TGI/vLLM) with messages or inputs. Support streaming for better UX.

Simple retrieval + call (pseudocode):

Key: always return source pointers and a confidence score or flag low-confidence answers for human review. (LangChainarrow-up-right)


Phase G — API & UI

  • Build a lightweight API (FastAPI) that:

    • Validates user & tenant (RBAC).

    • Calls retriever & model.

    • Returns answer + sources[] + thoughts (optional intermediate reasoning if you want auditing).

  • Add rate-limiting and per-user quotas.

  • Optionally stream tokens to UI for low-latency UX (TGI supports streaming). (Hugging Facearrow-up-right)


Phase H — Security, privacy & compliance

  • Network: run EKS in private subnets, use private NLB & VPC endpoints (S3, Secrets Manager).

  • Auth: front the API with OAuth2 / Cognito / Keycloak; use IAM roles for service accounts for AWS access.

  • Encryption: encrypt data at rest (KMS) and in transit (TLS).

  • PII redaction: run a PII detection step before indexing or mark sensitive fields as non-searchable.

  • Audit logs: keep query logs, responses, and source pointers (with retention policy) for compliance.


Phase I — Observability & SLOs

  • Metrics: request latency P50/P95, vector DB query time, model generation time, GPU utilization, pod spin-up time (Karpenter).

  • Tracing: OpenTelemetry to trace query → retrieval → model inference.

  • Alerts: node provisioning failures, high eviction rates, high error rate.


Phase J — Scaling & cost optimizations

  • Use Karpenter for on-demand GPU nodes; set ttlSecondsAfterEmpty to scale down idle nodes. (Karpenterarrow-up-right)

  • Use batching & speculative decoding to improve throughput (vLLM/TGI features). (DataCamparrow-up-right)

  • Use quantized models (4-bit) or smaller models for cheaper latency-sensitive routes.

  • Cache common queries responses at the API layer.


Phase K — Testing & evaluation

  • Unit tests for ingestion connectors (DB exports, Confluence) and integration tests for upsert & retrieval.

  • Evaluate answer quality vs. ground truth (accuracy, hallucination rate).

  • Run adversarial tests (prompt injection, data poisoning scenarios).

  • UAT with a small internal group, iterate prompts & retriever settings.


Phase L — Ops: CI/CD, model & index versioning

  • Model artifacts in S3 with version tags; use a model registry (MLflow or simple S3 + manifest).

  • Rebuild indexes when model/embedding changes. Keep mapping between vector DB collection and embedding model version.

  • Canary deploy new model/inference engine with a fraction of traffic.


3 — Minimal end-to-end example (fast prototype)

  1. Deploy TGI on a single GPU node (EKS + Karpenter will create it automatically if pod requests GPU). (Hugging Facearrow-up-right)

  2. Run Qdrant locally / as a K8s deployment. (Qdrantarrow-up-right)

  3. Run a Python script: load a PDF, embed with sentence-transformers, upsert to Qdrant, then run a simple FastAPI endpoint that queries Qdrant and calls TGI. (Examples above.)


4 — Example checklist (what to build in which order)

  1. Plan & pick model + vector DB.

  2. Create infra: EKS cluster + Karpenter + storage (S3 + PVC). (Karpenterarrow-up-right)

  3. Deploy vector DB (Qdrant/Milvus). (Milvusarrow-up-right)

  4. Deploy LLM inference (TGI / vLLM / NIM). (Hugging Facearrow-up-right)

  5. Implement ingestion pipelines (DB, PDFs, Confluence, Jira). (LangChainarrow-up-right)

  6. Build retriever & RAG API.

  7. Add auth, monitoring, and run pilot.


5 — Helpful references & tutorials (to follow next)


Final notes & next steps

  • Want me to:

    1. Generate the exact Kubernetes manifests for TGI + Qdrant + Karpenter provisioner (with your preferred instance types)?

    2. Produce a working repo scaffold (FastAPI + ingestion + LangChain example) that you can run locally and then port to EKS?

    3. Or do you want a short checklist tailored to your infra (you mentioned ClickHouse and ClickHouse → would you like ClickHouse ingestion example)?

Pick one and I’ll produce the code/manifests immediately.

Last updated