Text-to-Speech: From Phonemes to Neural Voices cover

Generative AI · Ebook

Text-to-Speech: From Phonemes to Neural Voices

by Shriira Press

4.4(7,420)99 pagesPublished 2026

A comprehensive, self-contained guide to how machines learn to turn text into a speaking voice — from the linguistic frontend that decides what to say to the neural acoustic models, vocoders, codec language models, and diffusion systems that decide how it sounds. This is the fifth volume in a series; it blends intuition, mathematics, and runnable code, and builds on its companions on machine learning, image generation, video generation, and especially music generation.

Contents

  1. 1Preface
  2. 2Chapter 1 — What Is Text-to-Speech?
  3. 3Chapter 2 — Speech, Sound, and Language
  4. 4Chapter 3 — The Text Frontend
  5. 5Chapter 4 — Acoustic Models I: Sequence-to-Sequence and Tacotron
  6. 6Chapter 5 — Acoustic Models II: Non-Autoregressive TTS
  7. 7Chapter 6 — Vocoders: From Spectrogram to Waveform
  8. 8Chapter 7 — End-to-End and Flow-Based TTS
  9. 9Chapter 8 — Neural Codec Language Models for Speech
  10. 10Chapter 9 — Diffusion Models for Speech
  11. 11Chapter 10 — Voice Cloning, Speaker Adaptation, and Expressive Control
  12. 12Chapter 11 — Evaluating Synthetic Speech
  13. 13Chapter 12 — Systems, Real-Time, and Deployment
  14. 14Chapter 13 — Ethics: Voice Cloning, Consent, and Deepfakes
  15. 15Appendix A — Notation and Symbols
  16. 16Appendix B — Further Reading