Reading: KServe: Model Serving on Kubernetes

Serve ML models — including LLMs — at scale. Deploy, autoscale (to zero), and manage inference on Kubernetes with KServe.

Welcome to KServe: Model Serving on Kubernetes.

KServe is a standardized, serverless model inference platform on Kubernetes — making it simple to deploy, scale, and manage machine learning models in production. This free book teaches it from the ground up: the model serving problem and what KServe is, the serving landscape (training vs serving), KServe's architecture (the InferenceService, serverless foundation), the InferenceService abstraction, serving runtimes (multi-framework support), the standard inference protocol (the Open Inference Protocol), autoscaling and serverless (scale-to-zero), advanced inference (transformers, explainers, inference graphs), production serving (canary, monitoring, payload logging), and operating KServe in practice (including LLM serving with vLLM and OpenAI-compatible APIs). Ten focused chapters with clear diagrams that demystify model serving — turning trained models into scalable, standard, cost-efficient production inference, including the LLMs at the center of modern AI.

This title is part of the ShriIra library and is free to read in full, right here — our small contribution to making world-class knowledge easy to reach.

A note on reading it: open the Contents menu at the top of the reader to jump between chapters, use the Aa menu to set a comfortable text size, theme (light, sepia, or night), and single- or two-page layout. Your place is saved automatically, so you can always pick up where you left off.

We hope it serves you well.

— Shriira Press

Preface

Contents