Reading: Volcano-Kthena: LLM Inference Serving on Kubernetes

Serve large language models efficiently on Kubernetes. LLM-aware routing, autoscaling, and GPU-efficient techniques on top of optimized inference engines like vLLM and SGLang.

Welcome to Volcano-Kthena: LLM Inference Serving on Kubernetes.

Volcano-Kthena (Kthena) is the Kubernetes-native LLM inference serving system in the Volcano ecosystem — providing efficient, scalable serving of large language models with intelligent routing, autoscaling, and GPU-efficient techniques, on top of optimized inference engines like vLLM and SGLang. This free book teaches it from the ground up: the LLM inference serving problem and what Kthena is, LLM inference concepts (prefill/decode, KV cache, batching), Kthena's architecture (controller, router, inference workloads), deploying models (the serving resource), inference engines (vLLM, SGLang, and integration), intelligent routing (LLM-aware, KV-cache-aware), autoscaling (LLM-aware metrics, scale-to-zero), advanced serving (disaggregated prefill/decode, KV cache management, model parallelism), multi-model, multi-tenancy, and GPU efficiency, and using Kthena in practice. Ten focused chapters with clear diagrams that make LLM serving concrete — leverage optimized engines, route for KV cache reuse, autoscale with demand, and maximize GPU utilization (the dominant cost) — serving many models for many tenants cost-effectively and performantly on Kubernetes.

This title is part of the ShriIra library and is free to read in full, right here — our small contribution to making world-class knowledge easy to reach.

A note on reading it: open the Contents menu at the top of the reader to jump between chapters, use the Aa menu to set a comfortable text size, theme (light, sepia, or night), and single- or two-page layout. Your place is saved automatically, so you can always pick up where you left off.

We hope it serves you well.

— Shriira Press

Preface

Contents