Paperback available now · Kindle pre-order (Jul 1, 2026)

Building Speech AI

Speech Representation, Understanding & Synthesis
A Practitioner's Guide

From the physics of pressure waves to production voice systems. Twelve chapters that connect acoustic fundamentals, modern transformer architectures, and the engineering trade-offs that decide whether speech AI ships. Every concept ships with runnable code.

12
Chapters
12
Notebooks
12
CLI Scripts
320+
Pages
About the Book

The book I wish I'd had when I started.

Speech AI sits at an unusual intersection: signal processing, linguistics, machine learning, and systems engineering. Academic papers assume you know Fourier analysis; DSP textbooks ignore neural networks; ML courses treat audio as just another input modality.

This is a builder's book. Whether you're a machine-learning engineer exploring audio for the first time, a software developer integrating voice into a product, a researcher pushing state of the art, or a technical leader evaluating voice strategies — you'll find the conceptual foundations and the practical implementations together.

If you can't run it, you don't really understand it.

Every concept ships with working code. Not toy examples, but real implementations you can run, modify, and extend. We resist the temptation to hide complexity behind library calls. When we use Whisper or wav2vec 2.0, we understand what's happening inside.

Four Parts. One Arc.

Sound → Meaning → Voice → Production.

01

Foundations of Sound

Pressure waves, sampling theory, spectrograms, and the perceptual quirks of human hearing that shape how machines represent audio.

Chapters 1–3
02

Speech Recognition

From CTC and HMM-GMM through attention-based encoder–decoders. Whisper, wav2vec 2.0, HuBERT, Conformer — and what each gets right.

Chapters 4–6
03

Speech Synthesis

Neural TTS from WaveNet to VITS. Voice cloning, prosody, voice conversion — and the ethics that have to ride alongside the math.

Chapter 7
04

Production Voice AI

Audio language models, real-time pipelines, edge deployment, deepfakes, bias, and the engineering trade-offs that decide what ships.

Chapters 8–12
Inside the Book

Twelve chapters of speech AI, from first principles to production.

Hover a tile to peek at what to look for — click to read a sample.

01

Introduction to Speech & Audio AI

The convergence of four disciplines. The human instrument. Applications that listen and speak.

What to look for
  • The four disciplines you must hold at once: signal processing, linguistics, ML, systems
  • Why "the human instrument" sets a ceiling speech tech still chases
  • The central transformation: pressure waves in air → numbers in RAM
Click to read the sample →
02

Understanding Audio Data

Why audio is fundamentally different from text and images: temporal essence, the fragile dance with noise.

What to look for
  • Why audio is not like text or images — the fundamental divide
  • The 16,000-samples-per-second computational storm, made tangible
  • Noise as a feature of the medium, not a bug to remove
Click to read the sample →
03

Signal Processing Fundamentals

The Fourier transform. Spectrograms. Filtering. The mathematical foundations that don't go away.

What to look for
  • The Fourier intuition without the calculus pain
  • Spectrogram as the Rosetta Stone between time and frequency
  • Filtering as selective attention for a machine
Click to read the sample →
04

The Evolution of Speech Tech

HMM-GMM, n-grams, the neural revolution, and the path to end-to-end learning.

What to look for
  • HMM-GMM as a snapshot of pre-deep-learning reasoning
  • The n-gram arithmetic that ran search and dictation for a decade
  • Why end-to-end neural ate the field
Click to read the sample →
05

Modern ASR Architectures

Wav2Vec, HuBERT, Conformer, Whisper. Streaming ASR. Self-supervised learning as the dark matter of audio AI.

What to look for
  • Whisper's "weak supervision at scale" trick — and why it generalizes
  • Self-supervised learning as the field's dark matter
  • Streaming vs offline: the engineering you can't skip
Click to read the sample →
06

Audio Representations & Embeddings

The mel scale, vector quantization, contrastive learning. What embeddings encode about content, speaker, and emotion.

What to look for
  • What embeddings encode — content, speaker, emotion are separable
  • Why the mel scale mirrors the cochlea
  • Contrastive learning as cheap supervision
Click to read the sample →
07

Text-to-Speech Synthesis

WaveNet, FastSpeech, diffusion, VITS, VALL-E. Voice cloning, voice conversion, and prosody.

What to look for
  • The vocoder revolution: WaveNet → HiFi-GAN → diffusion
  • VITS as the first end-to-end neural TTS that just works
  • Voice cloning ethics ride alongside the math, every paragraph
Click to read the sample →
08

Advanced Applications

Diarization, verification, audio generation, multimodal fusion, speech translation, deepfakes, bias, privacy.

What to look for
  • Diarization as a clustering problem in disguise
  • Audio deepfakes — detection lags generation, always
  • Bias in speech AI is bias in the data, recorded
Click to read the sample →
09

Audio Language Models

Discrete audio tokens, SpeechGPT, AudioLM. Real-time conversational voice AI from pipelines to native models.

What to look for
  • The cascade problem: ASR → LLM → TTS leaks latency and meaning
  • Discrete audio tokens as a first-class language
  • Why real-time speech-to-speech changes the API itself
Click to read the sample →
10

Ethics, Society & the Future

Robustness, bias mitigation, edge AI, brain-computer interfaces. A speech-AI timeline you can hold in one hand.

What to look for
  • Robustness ≠ accuracy — and why it matters more in voice
  • Edge AI: where compute, privacy, and latency converge
  • A speech-AI timeline you can hold in one hand
Click to read the sample →
11

Voice as HCI

Why voice is different from text. Habit inertia, edge architectures, keyword spotting, model compression.

What to look for
  • Why voice is fundamentally different from text as an interface
  • The habit-inertia problem — why we still don't say "computer, ..."
  • Latency budgets for conversational systems
Click to read the sample →
12

Hands-On Implementation

The capstone. Whisper, wav2vec 2.0, SpeechT5, embeddings, end-to-end pipeline, and notebook-to-production deployment.

★ Companion Code
What to look for
  • Notebook → production: the gap that kills most projects
  • Memory management for 8 GB GPUs is a real, learnable skill
  • A field guide to today's open-source models you'll actually use
Click to read the sample →
Companion Codebase

Twelve runnable notebooks. Zero magic.

Every chapter has a pre-executed Jupyter notebook and a CLI script you can run in minutes. The notebooks are committed already executed — you can browse them on GitHub before installing a thing. Click any card to jump straight to the code.

01
CPU

Foundations

GPU detection, audio as numbers, sampling theory, the Nyquist theorem in code.

  • torch
  • torchaudio
  • numpy
02
CPU

Whisper Basics

Loading Whisper, basic transcription, language detection, and the cost of getting it wrong.

  • whisper
  • transformers
  • ffmpeg
03
4 GB GPU

Whisper Advanced

Beam vs. greedy decoding. Word-level timestamps. Long-audio chunking and the silence problem.

  • whisper
  • pyannote
  • beam search
04
CPU

wav2vec 2.0 + CTC

Frame-by-frame CTC decoding visualized. See what self-supervised models actually predict at every 20 ms tick.

  • wav2vec2
  • CTC
  • matplotlib
05
4 GB GPU

SpeechT5 TTS

Text → mel-spectrogram → audio. Speaker embeddings as a control surface. Four distinct voices from one model.

  • SpeechT5
  • HiFi-GAN
  • x-vectors
06
4 GB GPU

Audio Embeddings

Content vs. speaker. WavLM-SV verifier. Why "the same words in two voices" land far apart in embedding space.

  • WavLM
  • cosine sim
  • x-vectors
07
4 GB GPU FLAGSHIP

Voice Assistant Pipeline

The full cascade: ASR → agent → TTS. End-to-end voice loop on a single GPU, microphone in, synthesized voice out.

  • Whisper
  • SpeechT5
  • cascade
08
CPU

Audio Visualization

STFT, mel-spectrograms, MFCCs, energy envelopes. The visual language for debugging audio models.

  • librosa
  • matplotlib
  • STFT
09
8 GB GPU

Fine-tuning Whisper

Adapting Whisper to Hindi using FLEURS. A gentle introduction to fine-tuning, with WER you can chart against the baseline.

  • FLEURS
  • HF Trainer
  • WER
10
8 GB GPU ETHICS

Voice Cloning with F5-TTS

Zero-shot cloning, ethics-first. We build the system, then we discuss what should and shouldn't be built with it.

  • F5-TTS
  • flow-matching
  • consent
11
16 GB GPU DEEP DIVE

Whisper-large + LoRA

A dozen fine-tune runs charted side by side. Full fine-tune vs. LoRA vs. frozen-encoder. When does parameter-efficient win?

  • Whisper-large-v3
  • peft / LoRA
  • scaling
12
8 GB GPU

Audio Language Models

Qwen2-Audio with bitsandbytes 4-bit quantization. The next paradigm after the cascade — audio as a first-class token.

  • Qwen2-Audio
  • bnb 4-bit
  • multimodal

Notebooks are committed pre-executed — you can read every output on GitHub before installing a thing.

Browse all 12 notebooks →
Free Companion Resources

Speech AI infographics — print, pin, share.

Three free one-page guides pulled from across the book. Designed to be screenshot-able, printable, and immediately useful at your desk. Scan any QR to come back to this site.

Pulled from chapters 5–9. Task router (ASR / TTS / VC / Audio LMs / real-time agents) → ASR decision tree by labeled data budget → TTS picker → 11-model reference card.
Pulled from chapters 6 & 9. The four-stage pipeline (sample → encode → quantize → stack) with code-block visualizations of the tensor at each step, plus EnCodec / SoundStream / DAC / HuBERT / AudioLM / Mimi compared.
Cross-chapter reference. Eight models graded A–F across six dimensions (accuracy, speed, multilingual, open-weights, production readiness, cost), the four use-case winners, and a "pick by your hard constraint" matrix.

More cheat sheets coming: Voice Agent Latency Budget · Speech AI Ethics Checklist

Quick Start

From clone to transcription in five commands.

1

Clone the companion repo

# Bring it home
git clone https://github.com/prdeepakbabu/building-speech-ai.git
cd building-speech-ai
2

Create a virtual environment

python3 -m venv .venv
# macOS / Linux
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1
3

Install

pip install -r requirements.txt
# Optional — for fine-tuning chapters 9–12
pip install -r requirements-extras.txt
4

Verify your environment

python examples/01_device_setup.py
# → CUDA / MPS / CPU detected, sample rate sanity checks
5

Synthesize, then transcribe

# Generate a 7-second test clip with SpeechT5
python examples/05_speecht5_tts.py \
  --text "The quick brown fox jumps over the lazy dog." \
  --output samples/test.wav

# Transcribe it with Whisper-tiny
python examples/02_whisper_basic.py --audio samples/test.wav

Hardware reality check

CPU
Chapters 1, 2, 4, 8 — runs anywhere Python runs.
4 GB GPU
Chapters 3, 5, 6, 7 — entry-level GPU is plenty.
8 GB GPU
Chapters 9, 10, 12 — fine-tuning & audio LMs.
16 GB GPU
Chapter 11 — Whisper-large LoRA scaling runs.

Verified end-to-end on an NVIDIA Tesla T4 (16 GB). Apple Silicon supports chapters 1–8 via MPS; CUDA required for fine-tuning chapters.

About the Author

Written by a builder, for builders.

Portrait of Deepak Babu Piskala

Deepak Babu Piskala

Speech & Language AI · San Francisco

I've spent years building speech and language systems, from developing speech recognition and language understanding models at Alexa to architecting search and information-retrieval systems that evolved from bag-of-words to today's BERT-like embeddings. I've watched the field transform from hand-engineered features and HMM-GMM pipelines to end-to-end neural systems that match or exceed human performance.

I've seen models fail spectacularly in production and others succeed beyond anyone's expectations. This book distills what I've learned along the way — the conceptual foundations and the engineering trade-offs that decide whether a speech AI system actually ships.

prdeepakbabu.github.io

Get the Book

Three formats. One ASIN per format. Pick your fit.

Available now

Paperback

$44.95
  • 320 pages
  • 7 × 10 in trim
  • Standard color interior
  • ISBN: 979-8-249-50140-2
Buy on Amazon →
Pre-order · Releases Jul 1, 2026

Kindle eBook

$39.99
  • EPUB / Kindle format
  • Reflowable text
  • MathML for equations
  • ASIN: B0GX2WG2ZC
Pre-order on Amazon →
Coming soon

Hardcover

~$69.99
  • 320 pages
  • 7 × 10 in trim
  • Premium color, case-laminate
  • ISBN: 979-8-196-45507-0
In review at KDP

Worldwide availability via Amazon (US, UK, EU, Canada, Australia, Japan, India*). Each Amazon marketplace converts the price to local currency at order time. *India: paperback & Kindle only; hardcover not yet supported on Amazon.in.

For Academic Use

How to cite this book.

If this book helped your research, teaching, or production work, please cite it. Citations help self-published technical books reach the readers who'd benefit most.

APA 7
Piskala, D. B. (2026). Building Speech AI: A practitioner's guide to speech recognition, synthesis, and audio language models with Python. Independently published. ISBN 979-8-249-50140-2.
MLA 9
Piskala, Deepak Babu. Building Speech AI: A Practitioner's Guide to Speech Recognition, Synthesis, and Audio Language Models with Python. Independently published, 2026.
Chicago (author-date)
Piskala, Deepak Babu. 2026. Building Speech AI: A Practitioner's Guide to Speech Recognition, Synthesis, and Audio Language Models with Python. Independently published.
BibTeX
@book{piskala2026buildingspeech,
  author    = {Piskala, Deepak Babu},
  title     = {Building Speech AI: A Practitioner's Guide to Speech Recognition,
               Synthesis, and Audio Language Models with Python},
  year      = {2026},
  publisher = {Independently published},
  isbn      = {979-8-249-50140-2},
  url       = {https://www.amazon.com/dp/B0H1ZXP6YS},
  note      = {Companion code: \url{https://github.com/prdeepakbabu/building-speech-ai}}
}

Paperback ISBN 979-8-249-50140-2 · Hardcover ISBN 979-8-196-45507-0 · Kindle ASIN B0GX2WG2ZC. Cite the paperback ISBN by default — it is the canonical print edition.

Read the book. Run the code. Ship the system.

The future of how humans and machines communicate is being written now, by people like you.