Volcano-Kthena: LLM Inference Serving on Kubernetes cover

Technology · Ebook

Volcano-Kthena: LLM Inference Serving on Kubernetes

by Shriira Press

4.8(296)188 pagesPublished 2026

Volcano-Kthena (Kthena) is the Kubernetes-native LLM inference serving system in the Volcano ecosystem — providing efficient, scalable serving of large language models with intelligent routing, autoscaling, and GPU-efficient techniques, on top of optimized inference engines like vLLM and SGLang. This free book teaches it from the ground up: the LLM inference serving problem and what Kthena is, LLM inference concepts (prefill/decode, KV cache, batching), Kthena's architecture (controller, router, inference workloads), deploying models (the serving resource), inference engines (vLLM, SGLang, and integration), intelligent routing (LLM-aware, KV-cache-aware), autoscaling (LLM-aware metrics, scale-to-zero), advanced serving (disaggregated prefill/decode, KV cache management, model parallelism), multi-model, multi-tenancy, and GPU efficiency, and using Kthena in practice. Ten focused chapters with clear diagrams that make LLM serving concrete — leverage optimized engines, route for KV cache reuse, autoscale with demand, and maximize GPU utilization (the dominant cost) — serving many models for many tenants cost-effectively and performantly on Kubernetes.

Contents

  1. 1Preface
  2. 2Chapter 1 — What Volcano-Kthena Is
  3. 3Chapter 2 — LLM Inference Concepts
  4. 4Chapter 3 — Volcano-Kthena Architecture
  5. 5Chapter 4 — Deploying Models
  6. 6Chapter 5 — Inference Engines
  7. 7Chapter 6 — Intelligent Routing
  8. 8Chapter 7 — Autoscaling
  9. 9Chapter 8 — Advanced Serving
  10. 10Chapter 9 — Multi-Model, Multi-Tenancy, and GPU Efficiency
  11. 11Chapter 10 — Volcano-Kthena in Practice