Technology · Ebook

Volcano-Kthena: LLM Inference Serving on Kubernetes

by Shriira Press

4.8(296)188 pagesPublished 2026

Volcano-Kthena (Kthena) is the Kubernetes-native LLM inference serving system in the Volcano ecosystem — providing efficient, scalable serving of large language models with intelligent routing, autoscaling, and GPU-efficient techniques, on top of optimized inference engines like vLLM and SGLang. This free book teaches it from the ground up: the LLM inference serving problem and what Kthena is, LLM inference concepts (prefill/decode, KV cache, batching), Kthena's architecture (controller, router, inference workloads), deploying models (the serving resource), inference engines (vLLM, SGLang, and integration), intelligent routing (LLM-aware, KV-cache-aware), autoscaling (LLM-aware metrics, scale-to-zero), advanced serving (disaggregated prefill/decode, KV cache management, model parallelism), multi-model, multi-tenancy, and GPU efficiency, and using Kthena in practice. Ten focused chapters with clear diagrams that make LLM serving concrete — leverage optimized engines, route for KV cache reuse, autoscale with demand, and maximize GPU utilization (the dominant cost) — serving many models for many tenants cost-effectively and performantly on Kubernetes.

1Preface
2Chapter 1 — What Volcano-Kthena Is
3Chapter 2 — LLM Inference Concepts
4Chapter 3 — Volcano-Kthena Architecture
5Chapter 4 — Deploying Models
6Chapter 5 — Inference Engines
7Chapter 6 — Intelligent Routing
8Chapter 7 — Autoscaling
9Chapter 8 — Advanced Serving
10Chapter 9 — Multi-Model, Multi-Tenancy, and GPU Efficiency
11Chapter 10 — Volcano-Kthena in Practice

Volcano-Kthena: LLM Inference Serving on Kubernetes

Contents