Text-to-Speech: From Phonemes to Neural Voices

Shriira Press

Preface

A comprehensive, self-contained guide to how machines learn to turn text into a speaking voice — from the linguistic frontend that decides what to…

Welcome to Text-to-Speech: From Phonemes to Neural Voices.

A comprehensive, self-contained guide to how machines learn to turn text into a speaking voice — from the linguistic frontend that decides what to say to the neural acoustic models, vocoders, codec language models, and diffusion systems that decide how it sounds. This is the fifth volume in a series; it blends intuition, mathematics, and runnable code, and builds on its companions on machine learning, image generation, video generation, and especially music generation.

This title is part of the ShriIra library and is free to read in full, right here — our small contribution to making world-class knowledge easy to reach.

A note on reading it: open the Contents menu at the top of the reader to jump between chapters, use the Aa menu to set a comfortable text size, theme (light, sepia, or night), and single- or two-page layout. Your place is saved automatically, so you can always pick up where you left off.

We hope it serves you well.

— Shriira Press

Contents

  1. Chapter 1 — What Is Text-to-Speech?
  2. Chapter 2 — Speech, Sound, and Language
  3. Chapter 3 — The Text Frontend
  4. Chapter 4 — Acoustic Models I: Sequence-to-Sequence and Tacotron
  5. Chapter 5 — Acoustic Models II: Non-Autoregressive TTS
  6. Chapter 6 — Vocoders: From Spectrogram to Waveform
  7. Chapter 7 — End-to-End and Flow-Based TTS
  8. Chapter 8 — Neural Codec Language Models for Speech
  9. Chapter 9 — Diffusion Models for Speech
  10. Chapter 10 — Voice Cloning, Speaker Adaptation, and Expressive Control
  11. Chapter 11 — Evaluating Synthetic Speech
  12. Chapter 12 — Systems, Real-Time, and Deployment
  13. Chapter 13 — Ethics: Voice Cloning, Consent, and Deepfakes
  14. Appendix A — Notation and Symbols
  15. Appendix B — Further Reading
0%
1/1