Generative AI · Ebook
Text-to-Speech: From Phonemes to Neural Voices
by Shriira Press
4.4(7,420)99 pagesPublished 2026
A comprehensive, self-contained guide to how machines learn to turn text into a speaking voice — from the linguistic frontend that decides what to say to the neural acoustic models, vocoders, codec language models, and diffusion systems that decide how it sounds. This is the fifth volume in a series; it blends intuition, mathematics, and runnable code, and builds on its companions on machine learning, image generation, video generation, and especially music generation.
Contents
- 1Preface
- 2Chapter 1 — What Is Text-to-Speech?
- 3Chapter 2 — Speech, Sound, and Language
- 4Chapter 3 — The Text Frontend
- 5Chapter 4 — Acoustic Models I: Sequence-to-Sequence and Tacotron
- 6Chapter 5 — Acoustic Models II: Non-Autoregressive TTS
- 7Chapter 6 — Vocoders: From Spectrogram to Waveform
- 8Chapter 7 — End-to-End and Flow-Based TTS
- 9Chapter 8 — Neural Codec Language Models for Speech
- 10Chapter 9 — Diffusion Models for Speech
- 11Chapter 10 — Voice Cloning, Speaker Adaptation, and Expressive Control
- 12Chapter 11 — Evaluating Synthetic Speech
- 13Chapter 12 — Systems, Real-Time, and Deployment
- 14Chapter 13 — Ethics: Voice Cloning, Consent, and Deepfakes
- 15Appendix A — Notation and Symbols
- 16Appendix B — Further Reading
