Generative AI · Ebook

Text-to-Speech: From Phonemes to Neural Voices

by Shriira Press

4.4(7,420)99 pagesPublished 2026

A comprehensive, self-contained guide to how machines learn to turn text into a speaking voice — from the linguistic frontend that decides what to say to the neural acoustic models, vocoders, codec language models, and diffusion systems that decide how it sounds. This is the fifth volume in a series; it blends intuition, mathematics, and runnable code, and builds on its companions on machine learning, image generation, video generation, and especially music generation.

1Preface
2Chapter 1 — What Is Text-to-Speech?
3Chapter 2 — Speech, Sound, and Language
4Chapter 3 — The Text Frontend
5Chapter 4 — Acoustic Models I: Sequence-to-Sequence and Tacotron
6Chapter 5 — Acoustic Models II: Non-Autoregressive TTS
7Chapter 6 — Vocoders: From Spectrogram to Waveform
8Chapter 7 — End-to-End and Flow-Based TTS
9Chapter 8 — Neural Codec Language Models for Speech
10Chapter 9 — Diffusion Models for Speech
11Chapter 10 — Voice Cloning, Speaker Adaptation, and Expressive Control
12Chapter 11 — Evaluating Synthetic Speech
13Chapter 12 — Systems, Real-Time, and Deployment
14Chapter 13 — Ethics: Voice Cloning, Consent, and Deepfakes
15Appendix A — Notation and Symbols
16Appendix B — Further Reading

Text-to-Speech: From Phonemes to Neural Voices

Contents