Deploy LLM model

cloud deployed LLM model by Amazon bedrock

self-hosting on EKS using karpenter

chevron-rightHow to deploy LLMhashtag

Here are real-world / well documented examples of LLMs being self-hosted (or nearly) on AWS EKS + Karpenter, or similar setups. These are useful to understand what people are actually using in production or near-prod/testing. If you want, I can pull up code repos too.


✅ Examples of LLMs / models deployed with EKS + Karpenter

Model / Setup
Tools / Frameworks used
Highlights & learnings

Meta’s Llama-3-8B (instruct)

NVIDIA NIM container + Amazon EKS + Karpenter; deployed with StatefulSet and autoscaling via HPA. (Amazon Web Services, Inc.arrow-up-right)

GPU nodes are provisioned by Karpenter when the Llama-3-8B model is deployed. Efficient scale up/down; shared storage via EFS; ingress via NLB + Kubernetes service. Useful for serving medium-sized LLMs. (Amazon Web Services, Inc.arrow-up-right)

DeepSeek R1

vLLM on EKS + GPU nodes via Karpenter + Terraform + Helm. (GitHubarrow-up-right)

Example shows deploying “DeepSeek-R1-Distill-Llama-8B” model. Good illustration of custom node-pools for GPU, configuring node classes, Karpenter node scaling. (GitHubarrow-up-right)

Mistral 7B

vLLM + EKS + Karpenter + Hugging Face TGI server (text generation inference) (Mediumarrow-up-right)

Helps show deployment of smaller-/mid-size LLMs (7B) with GPU provisioning, monitoring (via NVIDIA device plugins etc.), use of PVCs for model storage. Useful when you need faster spin up. (Mediumarrow-up-right)

Small LLMs on Graviton (CPU) + Ray

Ray / Ray Serve + EKS + Karpenter node pools for CPU + GPU; deploying LLaMA.cpp & vLLM etc. (GitHubarrow-up-right)

If you don’t have GPU resources (or want cheaper inference), some workloads use CPU Graviton nodes. Useful to understand trade-offs. (GitHubarrow-up-right)


⚠️ What is mostly used / practical

From these cases, here are patterns that seem common / best practice:

  • Model size & type: 7-8B param models are very common (e.g. Llama 3-8B, Mistral 7B), because they balance performance vs cost / infra demands. Larger ones (e.g. 30-70B) are possible but require much more GPU memory & careful infra (sharding, multiple GPUs per node).

  • Serving frameworks: vLLM, NVIDIA NIM, Ray Serve, and text-gen inference engines are popularly used. They support GPU acceleration, efficient batching, etc.

  • Autoscaling & dynamic nodes: Karpenter is used to spin up GPU-enabled nodes when needed (in many examples), and scale them down when idle. Also leveraging HPA (Horizontal Pod Autoscaler) and metrics (prometheus etc) to drive scaling. Shared storage (EFS, PVCs) to cache model artifacts helps with startup latency. (Amazon Web Services, Inc.arrow-up-right)

  • Storage & model caching: Because model artifacts (tens of GB) are large, solutions often use persistent volumes (NFS, EFS, EBS, FSx etc) or cache images locally to reduce cold-start times. (Amazon Web Services, Inc.arrow-up-right)


If you like, I can gather open source repos of companies using large models (30B+) in EKS+Karpenter, so you can see how they manage big / production workloads. Do you want me to fetch those?

chevron-rightWhat to do with deployed LLMhashtag

Great follow-up 👍 Hosting an LLM on EKS with Karpenter means you’ve built the infrastructure to serve the model — but the real value is what you do with it afterward. Let me break it down in simple terms:


🔹 How you actually use your self-hosted LLM

Once deployed on EKS:

  1. The LLM runs inside a container (vLLM, TGI, NVIDIA NIM, etc.).

  2. You expose it as an API endpoint (usually REST or gRPC).

    • Example: POST /generate with { "prompt": "Hello world" }.

  3. Your apps (web, mobile, internal tools, microservices) can call this endpoint to get model outputs.

  4. You scale automatically with Karpenter when traffic grows.

So, the LLM becomes a private API service for your organization.


🔹 Where you can use it in real life

1. Enterprise Chatbot / Internal Copilot

  • Customer support bot for your website.

  • Internal IT/helpdesk chatbot (HR policies, troubleshooting, FAQs).

  • Developer assistant (like GitHub Copilot, but self-hosted for privacy).


2. RAG (Retrieval-Augmented Generation) Apps

  • Connect the LLM with your company’s data (databases, PDFs, Confluence, Jira, tickets).

  • Employees can “chat with company data” securely.

  • Example: “Show me Shiprocket’s sales for last month” → Query ClickHouse/Postgres and explain.


3. Document Processing

  • Summarize contracts, invoices, or financial reports.

  • Extract structured data (names, addresses, order details).

  • Automate compliance reporting.


4. Code & DevOps Use Cases

  • Review Terraform or Kubernetes YAML files for errors.

  • Generate SQL queries from natural language.

  • Automate writing CI/CD pipeline configs.

  • Assist in debugging logs or Kubernetes events.


5. Domain-specific AI assistants

  • Healthcare: Clinical note summarizer (privacy-preserving if self-hosted).

  • E-commerce: Personalized product recommendations, customer Q&A.

  • Finance: Risk analysis assistant, fraud detection with explanations.

  • Logistics: Query shipments, delays, ETAs in natural language.


🔹 Why self-host on EKS instead of API services?

  • Privacy: Data never leaves your infra (important for sensitive industries).

  • Cost: If you have constant high traffic, self-hosting may be cheaper than paying per-token API.

  • Customization: Fine-tune or LoRA-adapt the model on your own data.

  • Control: Decide scaling rules, GPUs, optimization techniques.


👉 So think of your self-hosted LLM as a private ChatGPT-like API, that you can embed into any business workflow: customer support, analytics dashboards, dev tools, or even IoT systems.

Would you like me to sketch a real-life architecture diagram (LLM on EKS + API Gateway + RAG with your database) so you can see how businesses actually connect it end-to-end?

Last updated