RAG Example
Nice — below is a detailed, runnable roadmap to build a production RAG system that connects your self-hosted LLM on EKS (Karpenter) to company data (Postgres/ClickHouse, PDFs, Confluence/Jira, ticket systems), plus code examples you can copy-paste and adapt.
TL;DR
Deploy LLM inference service (TGI / vLLM / NIM) on EKS (GPU nodes provisioned by Karpenter). (Amazon Web Services, Inc.)
Build an ingestion pipeline: fetch docs (DB rows, PDFs, Confluence, Jira), split & clean, generate embeddings, upsert into a vector DB (Milvus/Qdrant/Weaviate). (Milvus)
Implement a retriever → composer → LLM pipeline (prompt + context assembly + call model endpoint). Use LangChain / LlamaIndex for scaffolding. (LangChain)
Secure, monitor, and autoscale with Karpenter, HPA/custom metrics, IAM, TLS and logging. (Amazon Web Services, Inc.)
1 — High-level architecture (components)
Model serving — containerized LLM inference (TGI / vLLM / NVIDIA NIM) behind a Kubernetes Service (ClusterIP) + Ingress / NLB. (Hugging Face)
Vector DB — Milvus / Qdrant / Weaviate (self-hosted on EKS or separate VMs) to store embeddings + metadata. (Milvus)
Ingestion pipeline — workers (K8s Jobs/CronJobs) that read source systems (Postgres / ClickHouse / PDFs / Confluence / Jira / Zendesk), chunk text, create embeddings, and upsert to vector DB. (LangChain)
API / App layer — FastAPI/Flask service that: (a) accepts user query, (b) retrieves top K chunks from vector DB, (c) assembles prompt + context and calls model endpoint, (d) returns answer + source attribution.
Ops & infra — EKS + Karpenter (GPU node provisioning), monitoring (Prometheus/Grafana), logs (ELK/CloudWatch), CI/CD (ArgoCD / GitHub Actions). (Karpenter)
2 — Step-by-step implementation (detailed)
Phase A — Plan & requirements (do this first)
Identify 🌟 use-cases and SLAs: latency (P50/P95), throughput (QPS), allowed data leak risk, retention policies.
Choose model size by latency/quality tradeoff (7–8B for low cost, 13–70B for higher quality). Note GPU memory requirements.
Decide vector DB (Milvus/Qdrant/Weaviate) and embedding model (local HF model or hosted embedding service). (Milvus)
Phase B — EKS + Karpenter baseline
Provision EKS cluster (private subnets, control plane).
Install Karpenter (follow official guide) and create a GPU provisioner (a.k.a. Provisioner manifest) that requests GPU instance types when pods request
nvidia.com/gpu. See Karpenter docs for the exact provider fields for AWS. (Karpenter)
Minimal example Provisioner (replace placeholders; consult Karpenter docs before applying):
(Use the official Karpenter docs for the exact fields and IAM setup). (Karpenter)
Phase C — Model serving on EKS
Choose inference engine:
Hugging Face TGI (easy HTTP/gRPC API, supports streaming, chat API
/v1/chat/completions). Good production choice. (Hugging Face)vLLM (ultra-fast inference engine; good for batching).
NVIDIA NIM (if you want operator-managed NIM microservices on EKS). (Amazon Web Services, Inc.)
Docker image / Deployment: run the TGI container (or vLLM operator). Request
nvidia.com/gpu: 1in container resources, add tolerations/affinity.Expose a stable internal DNS (K8s Service) and optionally an API Gateway + auth in front.
Example deployment (TGI) snippet:
Then call via POST http://tgi-service:8080/v1/chat/completions. Example TGI usage documented by Hugging Face. (Hugging Face)
Phase D — Vector DB & storage
Pick Milvus (scale, enterprise features) or Qdrant (lightweight, easy to run). Deploy via Helm on EKS or use managed (if acceptable). Configure persistence via PVCs (EBS / EFS for multi-AZ access). (Milvus)
Create collections with an appropriate distance metric (cosine for most embeddings) and fields for metadata:
source,doc_id,chunk_id,created_at,url,confidence, etc.
Phase E — Ingestion pipeline (connectors + embedding)
Goal: pull data, split into chunks, embed, upsert.
Databases (Postgres/ClickHouse)
Export text columns (e.g., ticket body, comments, commit messages). Use incremental changelog (last_updated timestamp) to avoid reprocessing everything.
Sanitize & redact PII before embedding if required.
PDFs & files
Use an extractor (Unstructured / PDFMiner / Tika). OCR with Tesseract when scans are images.
Chunk text into ~500 token chunks with 20–50% overlap.
Confluence / Jira / Tickets
Use official REST APIs (Confluence, Jira) or LangChain / LlamaIndex loaders (ConfluenceLoader, JiraReader) to fetch pages & attachments. These loaders exist and are widely used. (LangChain)
Embeddings
Locally host an embedding model (sentence-transformers / small HF embedding model) or use your LLM inference provider’s embedding microservice. Generate vectors for each chunk.
Upsert to Vector DB
Upsert id + vector + metadata. Keep a pointer to original location for source attribution.
Python ingestion example (LangChain + Qdrant + sentence-transformers):
(You can do the same for Confluence/Jira using ConfluenceLoader / JiraReader documented in LangChain/LlamaIndex). (LangChain)
Phase F — Retriever + RAG pipeline
Retriever: Query vector DB (top_k), optionally apply metadata filters (team=HR, doc_type=policy), and re-rank results with a small cross-encoder or BM25 hybrid.
Context assembly: concatenate top chunks until you hit token budget (e.g., 2,000 tokens), include metadata snippets (source URLs).
Prompt design: system prompt (role + instructions), include retrieved context, then user question. Provide instruction to the model to cite sources (return list of chunk ids/URLs).
Call the model endpoint (TGI/vLLM) with
messagesorinputs. Support streaming for better UX.
Simple retrieval + call (pseudocode):
Key: always return source pointers and a confidence score or flag low-confidence answers for human review. (LangChain)
Phase G — API & UI
Build a lightweight API (FastAPI) that:
Validates user & tenant (RBAC).
Calls retriever & model.
Returns answer +
sources[]+thoughts(optional intermediate reasoning if you want auditing).
Add rate-limiting and per-user quotas.
Optionally stream tokens to UI for low-latency UX (TGI supports streaming). (Hugging Face)
Phase H — Security, privacy & compliance
Network: run EKS in private subnets, use private NLB & VPC endpoints (S3, Secrets Manager).
Auth: front the API with OAuth2 / Cognito / Keycloak; use IAM roles for service accounts for AWS access.
Encryption: encrypt data at rest (KMS) and in transit (TLS).
PII redaction: run a PII detection step before indexing or mark sensitive fields as non-searchable.
Audit logs: keep query logs, responses, and source pointers (with retention policy) for compliance.
Phase I — Observability & SLOs
Metrics: request latency P50/P95, vector DB query time, model generation time, GPU utilization, pod spin-up time (Karpenter).
Tracing: OpenTelemetry to trace query → retrieval → model inference.
Alerts: node provisioning failures, high eviction rates, high error rate.
Phase J — Scaling & cost optimizations
Use Karpenter for on-demand GPU nodes; set
ttlSecondsAfterEmptyto scale down idle nodes. (Karpenter)Use batching & speculative decoding to improve throughput (vLLM/TGI features). (DataCamp)
Use quantized models (4-bit) or smaller models for cheaper latency-sensitive routes.
Cache common queries responses at the API layer.
Phase K — Testing & evaluation
Unit tests for ingestion connectors (DB exports, Confluence) and integration tests for upsert & retrieval.
Evaluate answer quality vs. ground truth (accuracy, hallucination rate).
Run adversarial tests (prompt injection, data poisoning scenarios).
UAT with a small internal group, iterate prompts & retriever settings.
Phase L — Ops: CI/CD, model & index versioning
Model artifacts in S3 with version tags; use a model registry (MLflow or simple S3 + manifest).
Rebuild indexes when model/embedding changes. Keep mapping between vector DB collection and embedding model version.
Canary deploy new model/inference engine with a fraction of traffic.
3 — Minimal end-to-end example (fast prototype)
Deploy TGI on a single GPU node (EKS + Karpenter will create it automatically if pod requests GPU). (Hugging Face)
Run Qdrant locally / as a K8s deployment. (Qdrant)
Run a Python script: load a PDF, embed with
sentence-transformers, upsert to Qdrant, then run a simple FastAPI endpoint that queries Qdrant and calls TGI. (Examples above.)
4 — Example checklist (what to build in which order)
Plan & pick model + vector DB.
Create infra: EKS cluster + Karpenter + storage (S3 + PVC). (Karpenter)
Deploy vector DB (Qdrant/Milvus). (Milvus)
Deploy LLM inference (TGI / vLLM / NIM). (Hugging Face)
Implement ingestion pipelines (DB, PDFs, Confluence, Jira). (LangChain)
Build retriever & RAG API.
Add auth, monitoring, and run pilot.
5 — Helpful references & tutorials (to follow next)
LangChain RAG tutorial (good for pipeline scaffolding). (LangChain)
LlamaIndex RAG intros / low-level ingestion guides. (LlamaIndex)
Milvus RAG guide (vector DB). (Milvus)
Qdrant RAG quickstart. (Qdrant)
AWS blog: building RAG assistant on Amazon EKS + NVIDIA NIM (real EKS+Karpenter examples). (Amazon Web Services, Inc.)
Hugging Face TGI docs (exact HTTP API / streaming examples). (Hugging Face)
Final notes & next steps
Want me to:
Generate the exact Kubernetes manifests for TGI + Qdrant + Karpenter provisioner (with your preferred instance types)?
Produce a working repo scaffold (FastAPI + ingestion + LangChain example) that you can run locally and then port to EKS?
Or do you want a short checklist tailored to your infra (you mentioned ClickHouse and ClickHouse → would you like ClickHouse ingestion example)?
Pick one and I’ll produce the code/manifests immediately.
Last updated