Deploy LLM model
cloud deployed LLM model by Amazon bedrock
self-hosting on EKS using karpenter
How to deploy LLM
Here are real-world / well documented examples of LLMs being self-hosted (or nearly) on AWS EKS + Karpenter, or similar setups. These are useful to understand what people are actually using in production or near-prod/testing. If you want, I can pull up code repos too.
✅ Examples of LLMs / models deployed with EKS + Karpenter
Meta’s Llama-3-8B (instruct)
NVIDIA NIM container + Amazon EKS + Karpenter; deployed with StatefulSet and autoscaling via HPA. (Amazon Web Services, Inc.)
GPU nodes are provisioned by Karpenter when the Llama-3-8B model is deployed. Efficient scale up/down; shared storage via EFS; ingress via NLB + Kubernetes service. Useful for serving medium-sized LLMs. (Amazon Web Services, Inc.)
DeepSeek R1
vLLM on EKS + GPU nodes via Karpenter + Terraform + Helm. (GitHub)
Example shows deploying “DeepSeek-R1-Distill-Llama-8B” model. Good illustration of custom node-pools for GPU, configuring node classes, Karpenter node scaling. (GitHub)
⚠️ What is mostly used / practical
From these cases, here are patterns that seem common / best practice:
Model size & type: 7-8B param models are very common (e.g. Llama 3-8B, Mistral 7B), because they balance performance vs cost / infra demands. Larger ones (e.g. 30-70B) are possible but require much more GPU memory & careful infra (sharding, multiple GPUs per node).
Serving frameworks: vLLM, NVIDIA NIM, Ray Serve, and text-gen inference engines are popularly used. They support GPU acceleration, efficient batching, etc.
Autoscaling & dynamic nodes: Karpenter is used to spin up GPU-enabled nodes when needed (in many examples), and scale them down when idle. Also leveraging HPA (Horizontal Pod Autoscaler) and metrics (prometheus etc) to drive scaling. Shared storage (EFS, PVCs) to cache model artifacts helps with startup latency. (Amazon Web Services, Inc.)
Storage & model caching: Because model artifacts (tens of GB) are large, solutions often use persistent volumes (NFS, EFS, EBS, FSx etc) or cache images locally to reduce cold-start times. (Amazon Web Services, Inc.)
If you like, I can gather open source repos of companies using large models (30B+) in EKS+Karpenter, so you can see how they manage big / production workloads. Do you want me to fetch those?
What to do with deployed LLM
Great follow-up 👍 Hosting an LLM on EKS with Karpenter means you’ve built the infrastructure to serve the model — but the real value is what you do with it afterward. Let me break it down in simple terms:
🔹 How you actually use your self-hosted LLM
Once deployed on EKS:
The LLM runs inside a container (vLLM, TGI, NVIDIA NIM, etc.).
You expose it as an API endpoint (usually REST or gRPC).
Example:
POST /generatewith{ "prompt": "Hello world" }.
Your apps (web, mobile, internal tools, microservices) can call this endpoint to get model outputs.
You scale automatically with Karpenter when traffic grows.
So, the LLM becomes a private API service for your organization.
🔹 Where you can use it in real life
1. Enterprise Chatbot / Internal Copilot
Customer support bot for your website.
Internal IT/helpdesk chatbot (HR policies, troubleshooting, FAQs).
Developer assistant (like GitHub Copilot, but self-hosted for privacy).
2. RAG (Retrieval-Augmented Generation) Apps
Connect the LLM with your company’s data (databases, PDFs, Confluence, Jira, tickets).
Employees can “chat with company data” securely.
Example: “Show me Shiprocket’s sales for last month” → Query ClickHouse/Postgres and explain.
3. Document Processing
Summarize contracts, invoices, or financial reports.
Extract structured data (names, addresses, order details).
Automate compliance reporting.
4. Code & DevOps Use Cases
Review Terraform or Kubernetes YAML files for errors.
Generate SQL queries from natural language.
Automate writing CI/CD pipeline configs.
Assist in debugging logs or Kubernetes events.
5. Domain-specific AI assistants
Healthcare: Clinical note summarizer (privacy-preserving if self-hosted).
E-commerce: Personalized product recommendations, customer Q&A.
Finance: Risk analysis assistant, fraud detection with explanations.
Logistics: Query shipments, delays, ETAs in natural language.
🔹 Why self-host on EKS instead of API services?
Privacy: Data never leaves your infra (important for sensitive industries).
Cost: If you have constant high traffic, self-hosting may be cheaper than paying per-token API.
Customization: Fine-tune or LoRA-adapt the model on your own data.
Control: Decide scaling rules, GPUs, optimization techniques.
👉 So think of your self-hosted LLM as a private ChatGPT-like API, that you can embed into any business workflow: customer support, analytics dashboards, dev tools, or even IoT systems.
Would you like me to sketch a real-life architecture diagram (LLM on EKS + API Gateway + RAG with your database) so you can see how businesses actually connect it end-to-end?
Last updated