RAG Example

Nice — below is a detailed, runnable roadmap to build a production RAG system that connects your self-hosted LLM on EKS (Karpenter) to company data (Postgres/ClickHouse, PDFs, Confluence/Jira, ticket systems), plus code examples you can copy-paste and adapt.

TL;DR

Deploy LLM inference service (TGI / vLLM / NIM) on EKS (GPU nodes provisioned by Karpenter). (Amazon Web Services, Inc.)
Build an ingestion pipeline: fetch docs (DB rows, PDFs, Confluence, Jira), split & clean, generate embeddings, upsert into a vector DB (Milvus/Qdrant/Weaviate). (Milvus)
Implement a retriever → composer → LLM pipeline (prompt + context assembly + call model endpoint). Use LangChain / LlamaIndex for scaffolding. (LangChain)
Secure, monitor, and autoscale with Karpenter, HPA/custom metrics, IAM, TLS and logging. (Amazon Web Services, Inc.)

1 — High-level architecture (components)

Model serving — containerized LLM inference (TGI / vLLM / NVIDIA NIM) behind a Kubernetes Service (ClusterIP) + Ingress / NLB. (Hugging Face)
Vector DB — Milvus / Qdrant / Weaviate (self-hosted on EKS or separate VMs) to store embeddings + metadata. (Milvus)
Ingestion pipeline — workers (K8s Jobs/CronJobs) that read source systems (Postgres / ClickHouse / PDFs / Confluence / Jira / Zendesk), chunk text, create embeddings, and upsert to vector DB. (LangChain)
API / App layer — FastAPI/Flask service that: (a) accepts user query, (b) retrieves top K chunks from vector DB, (c) assembles prompt + context and calls model endpoint, (d) returns answer + source attribution.
Ops & infra — EKS + Karpenter (GPU node provisioning), monitoring (Prometheus/Grafana), logs (ELK/CloudWatch), CI/CD (ArgoCD / GitHub Actions). (Karpenter)

2 — Step-by-step implementation (detailed)

Phase A — Plan & requirements (do this first)

Identify 🌟 use-cases and SLAs: latency (P50/P95), throughput (QPS), allowed data leak risk, retention policies.
Choose model size by latency/quality tradeoff (7–8B for low cost, 13–70B for higher quality). Note GPU memory requirements.
Decide vector DB (Milvus/Qdrant/Weaviate) and embedding model (local HF model or hosted embedding service). (Milvus)

Phase B — EKS + Karpenter baseline

Provision EKS cluster (private subnets, control plane).
Install Karpenter (follow official guide) and create a GPU provisioner (a.k.a. Provisioner manifest) that requests GPU instance types when pods request nvidia.com/gpu. See Karpenter docs for the exact provider fields for AWS. (Karpenter)

Minimal example Provisioner (replace placeholders; consult Karpenter docs before applying):

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-provisioner
spec:
  requirements:
    - key: "karpenter.k8s.aws/instance-type"
      operator: In
      values: ["g5.4xlarge","g5.12xlarge"]   # choose types you want
  provider:
    subnetSelector:
      karpenter.sh/discovery: "your-cluster"
    securityGroupSelector:
      kubernetes.io/cluster/your-cluster: "owned"
  ttlSecondsAfterEmpty: 300

(Use the official Karpenter docs for the exact fields and IAM setup). (Karpenter)

Phase C — Model serving on EKS

Choose inference engine:
- Hugging Face TGI (easy HTTP/gRPC API, supports streaming, chat API /v1/chat/completions). Good production choice. (Hugging Face)
- vLLM (ultra-fast inference engine; good for batching).
- NVIDIA NIM (if you want operator-managed NIM microservices on EKS). (Amazon Web Services, Inc.)
Docker image / Deployment: run the TGI container (or vLLM operator). Request nvidia.com/gpu: 1 in container resources, add tolerations/affinity.
Expose a stable internal DNS (K8s Service) and optionally an API Gateway + auth in front.

Example deployment (TGI) snippet:

apiVersion: apps/v1
kind: Deployment
metadata: { name: tgi }
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: tgi
          image: ghcr.io/huggingface/text-generation-inference:latest
          args: ["--model-id", "meta-llama/Llama-2-7b-chat-hf"]
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1

Then call via POST http://tgi-service:8080/v1/chat/completions. Example TGI usage documented by Hugging Face. (Hugging Face)

Phase D — Vector DB & storage

Pick Milvus (scale, enterprise features) or Qdrant (lightweight, easy to run). Deploy via Helm on EKS or use managed (if acceptable). Configure persistence via PVCs (EBS / EFS for multi-AZ access). (Milvus)
Create collections with an appropriate distance metric (cosine for most embeddings) and fields for metadata: source, doc_id, chunk_id, created_at, url, confidence, etc.

Phase E — Ingestion pipeline (connectors + embedding)

Goal: pull data, split into chunks, embed, upsert.

Databases (Postgres/ClickHouse)
- Export text columns (e.g., ticket body, comments, commit messages). Use incremental changelog (last_updated timestamp) to avoid reprocessing everything.
- Sanitize & redact PII before embedding if required.
PDFs & files
- Use an extractor (Unstructured / PDFMiner / Tika). OCR with Tesseract when scans are images.
- Chunk text into ~500 token chunks with 20–50% overlap.
Confluence / Jira / Tickets
- Use official REST APIs (Confluence, Jira) or LangChain / LlamaIndex loaders (ConfluenceLoader, JiraReader) to fetch pages & attachments. These loaders exist and are widely used. (LangChain)
Embeddings
- Locally host an embedding model (sentence-transformers / small HF embedding model) or use your LLM inference provider’s embedding microservice. Generate vectors for each chunk.
Upsert to Vector DB
- Upsert id + vector + metadata. Keep a pointer to original location for source attribution.

Python ingestion example (LangChain + Qdrant + sentence-transformers):

# pip install langchain qdrant-client sentence-transformers
from langchain.document_loaders import UnstructuredPDFLoader, ConfluenceLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Qdrant

# Load and split PDF
loader = UnstructuredPDFLoader("docs/contract.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Embeddings (local SBERT)
emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Upsert into Qdrant (assumes qdrant running on localhost)
qdrant = Qdrant.from_documents(chunks, emb, url="http://localhost:6333", collection_name="company-docs")

(You can do the same for Confluence/Jira using ConfluenceLoader / JiraReader documented in LangChain/LlamaIndex). (LangChain)

Phase F — Retriever + RAG pipeline

Retriever: Query vector DB (top_k), optionally apply metadata filters (team=HR, doc_type=policy), and re-rank results with a small cross-encoder or BM25 hybrid.
Context assembly: concatenate top chunks until you hit token budget (e.g., 2,000 tokens), include metadata snippets (source URLs).
Prompt design: system prompt (role + instructions), include retrieved context, then user question. Provide instruction to the model to cite sources (return list of chunk ids/URLs).
Call the model endpoint (TGI/vLLM) with messages or inputs. Support streaming for better UX.

Simple retrieval + call (pseudocode):

# 1) find top docs
results = qdrant.similarity_search(query_text, k=5)

# 2) assemble context
context = "\n\n---\n".join([r.page_content for r in results])

# 3) build prompt
prompt = f"""You are a helpful internal assistant.
Use the following context to answer the question. Cite sources with [source:<doc_id>].

CONTEXT:
{context}

QUESTION:
{user_question}
"""

# 4) call TGI (HTTP)
import requests
resp = requests.post("http://tgi-service:8080/v1/chat/completions",
    json={"model":"tgi","messages":[{"role":"system","content":"You are a helpful assistant."},
                                   {"role":"user","content":prompt}],
          "max_tokens": 400})
answer = resp.json()

Key: always return source pointers and a confidence score or flag low-confidence answers for human review. (LangChain)

Phase G — API & UI

Build a lightweight API (FastAPI) that:
- Validates user & tenant (RBAC).
- Calls retriever & model.
- Returns answer + sources[] + thoughts (optional intermediate reasoning if you want auditing).
Add rate-limiting and per-user quotas.
Optionally stream tokens to UI for low-latency UX (TGI supports streaming). (Hugging Face)

Phase H — Security, privacy & compliance

Network: run EKS in private subnets, use private NLB & VPC endpoints (S3, Secrets Manager).
Auth: front the API with OAuth2 / Cognito / Keycloak; use IAM roles for service accounts for AWS access.
Encryption: encrypt data at rest (KMS) and in transit (TLS).
PII redaction: run a PII detection step before indexing or mark sensitive fields as non-searchable.
Audit logs: keep query logs, responses, and source pointers (with retention policy) for compliance.

Phase I — Observability & SLOs

Metrics: request latency P50/P95, vector DB query time, model generation time, GPU utilization, pod spin-up time (Karpenter).
Tracing: OpenTelemetry to trace query → retrieval → model inference.
Alerts: node provisioning failures, high eviction rates, high error rate.

Phase J — Scaling & cost optimizations

Use Karpenter for on-demand GPU nodes; set ttlSecondsAfterEmpty to scale down idle nodes. (Karpenter)
Use batching & speculative decoding to improve throughput (vLLM/TGI features). (DataCamp)
Use quantized models (4-bit) or smaller models for cheaper latency-sensitive routes.
Cache common queries responses at the API layer.

Phase K — Testing & evaluation

Unit tests for ingestion connectors (DB exports, Confluence) and integration tests for upsert & retrieval.
Evaluate answer quality vs. ground truth (accuracy, hallucination rate).
Run adversarial tests (prompt injection, data poisoning scenarios).
UAT with a small internal group, iterate prompts & retriever settings.

Phase L — Ops: CI/CD, model & index versioning

Model artifacts in S3 with version tags; use a model registry (MLflow or simple S3 + manifest).
Rebuild indexes when model/embedding changes. Keep mapping between vector DB collection and embedding model version.
Canary deploy new model/inference engine with a fraction of traffic.

3 — Minimal end-to-end example (fast prototype)

Deploy TGI on a single GPU node (EKS + Karpenter will create it automatically if pod requests GPU). (Hugging Face)
Run Qdrant locally / as a K8s deployment. (Qdrant)
Run a Python script: load a PDF, embed with sentence-transformers, upsert to Qdrant, then run a simple FastAPI endpoint that queries Qdrant and calls TGI. (Examples above.)

4 — Example checklist (what to build in which order)

Plan & pick model + vector DB.
Create infra: EKS cluster + Karpenter + storage (S3 + PVC). (Karpenter)
Deploy vector DB (Qdrant/Milvus). (Milvus)
Deploy LLM inference (TGI / vLLM / NIM). (Hugging Face)
Implement ingestion pipelines (DB, PDFs, Confluence, Jira). (LangChain)
Build retriever & RAG API.
Add auth, monitoring, and run pilot.

5 — Helpful references & tutorials (to follow next)

LangChain RAG tutorial (good for pipeline scaffolding). (LangChain)
LlamaIndex RAG intros / low-level ingestion guides. (LlamaIndex)
Milvus RAG guide (vector DB). (Milvus)
Qdrant RAG quickstart. (Qdrant)
AWS blog: building RAG assistant on Amazon EKS + NVIDIA NIM (real EKS+Karpenter examples). (Amazon Web Services, Inc.)
Hugging Face TGI docs (exact HTTP API / streaming examples). (Hugging Face)

Final notes & next steps

Want me to:
1. Generate the exact Kubernetes manifests for TGI + Qdrant + Karpenter provisioner (with your preferred instance types)?
2. Produce a working repo scaffold (FastAPI + ingestion + LangChain example) that you can run locally and then port to EKS?
3. Or do you want a short checklist tailored to your infra (you mentioned ClickHouse and ClickHouse → would you like ClickHouse ingestion example)?

Pick one and I’ll produce the code/manifests immediately.

PreviousDeploy LLM model NextRunning multiple purpose nodes with karpenter

Last updated 4 months ago

hashtagTL;DR

hashtag1 — High-level architecture (components)

hashtag2 — Step-by-step implementation (detailed)

hashtagPhase A — Plan & requirements (do this first)

hashtagPhase B — EKS + Karpenter baseline

hashtagPhase C — Model serving on EKS

hashtagPhase D — Vector DB & storage

hashtagPhase E — Ingestion pipeline (connectors + embedding)

hashtagPhase F — Retriever + RAG pipeline

hashtagPhase G — API & UI

hashtagPhase H — Security, privacy & compliance

hashtagPhase I — Observability & SLOs

hashtagPhase J — Scaling & cost optimizations

hashtagPhase K — Testing & evaluation

hashtagPhase L — Ops: CI/CD, model & index versioning

hashtag3 — Minimal end-to-end example (fast prototype)

hashtag4 — Example checklist (what to build in which order)

hashtag5 — Helpful references & tutorials (to follow next)

hashtagFinal notes & next steps