Technology · Ebook
Volcano-Kthena: LLM Inference Serving on Kubernetes
by Shriira Press
Volcano-Kthena (Kthena) is the Kubernetes-native LLM inference serving system in the Volcano ecosystem — providing efficient, scalable serving of large language models with intelligent routing, autoscaling, and GPU-efficient techniques, on top of optimized inference engines like vLLM and SGLang. This free book teaches it from the ground up: the LLM inference serving problem and what Kthena is, LLM inference concepts (prefill/decode, KV cache, batching), Kthena's architecture (controller, router, inference workloads), deploying models (the serving resource), inference engines (vLLM, SGLang, and integration), intelligent routing (LLM-aware, KV-cache-aware), autoscaling (LLM-aware metrics, scale-to-zero), advanced serving (disaggregated prefill/decode, KV cache management, model parallelism), multi-model, multi-tenancy, and GPU efficiency, and using Kthena in practice. Ten focused chapters with clear diagrams that make LLM serving concrete — leverage optimized engines, route for KV cache reuse, autoscale with demand, and maximize GPU utilization (the dominant cost) — serving many models for many tenants cost-effectively and performantly on Kubernetes.
Contents
- 1Preface
- 2Chapter 1 — What Volcano-Kthena Is
- 3Chapter 2 — LLM Inference Concepts
- 4Chapter 3 — Volcano-Kthena Architecture
- 5Chapter 4 — Deploying Models
- 6Chapter 5 — Inference Engines
- 7Chapter 6 — Intelligent Routing
- 8Chapter 7 — Autoscaling
- 9Chapter 8 — Advanced Serving
- 10Chapter 9 — Multi-Model, Multi-Tenancy, and GPU Efficiency
- 11Chapter 10 — Volcano-Kthena in Practice
