Beyond Text: A Speech-First Future for Foundation Models

The Position

One sentence to remember.

Our Position

The ML community should prioritize building speech-native foundation models — architectures that learn from audio as a first-class modality rather than grafting speech onto text-pretrained systems — because the data ecosystem underlying AI is poised to shift toward speech-first knowledge generation.

"Text feels natural largely because it is habitual. Speech, by contrast, has been treated as a convenience feature — yet it is humanity's most fundamental communication modality, preceding writing by millennia."

This page distills the paper's argument into interactive visualizations: the data-distribution shift, the technical readiness, the habit barrier, the counterarguments we address, and the call to action for researchers, industry, and the broader community.

§1 · The Data Shift

Present is keyboards. Future is everywhere else.

As interaction shifts from keyboard- and screen-based inputs toward speech-first wearables and ambient devices, the dominant data modality transitions from text-heavy corpora to audio-rich representations. Models trained on yesterday's data distribution will be increasingly misaligned with tomorrow's.

Keyboard era to ambient era — modality shift

Figure 1. Evolution of human–computer interaction interfaces and their implications for training data distributions. The dominant data modality transitions from text-heavy corpora to audio-rich representations. Generated with TikZ

The feedback loop that favored text — for decades.

① Interfaces shape expression. Search boxes condition users to express queries as keyword lists; voice enables natural language questions with context and nuance.

② Expression shapes data. What users externalize becomes the training corpus for future models.

③ Data shapes models. The modalities, structures, and biases in training data determine what AI systems learn and how they represent knowledge.

Each step reinforces the next. Speech-first interfaces are how we reroute the loop.

Text-favoring feedback loop: interfaces → expression → data → models

The feedback loop that favored text. TikZ

§2 · Why Speech, Why Now

As complexity grows, typing breaks down.

For short queries, both modalities are roughly comparable. But as expressiveness rises — emotion, multi-turn reasoning, long compositions — typing scales non-linearly. Speech remains nearly flat. This crossover is where adoption tips.

Speech, on mobile

~3×

faster than smartphone typing — ~153 WPM spoken vs ~52 WPM typed (Ruan et al., 2016).

Speech doesn't just save time. It changes what users are willing to express in the first place.

Typing Speech

Figure 2. As query complexity increases, speech becomes increasingly efficient relative to typing.

§3 · Technical Readiness

The barriers have already fallen.

Speech recognition has transformed from a research curiosity into infrastructure: accurate enough for deployment, efficient enough for on-device, scalable enough to process the world's audio. The remaining gap is architectural — speech is accessible to LLMs, but not yet native.

Cascaded vs. Speech-Native

Legacy · Cascaded

ASR → LLM → TTS

Audio is converted to text, processed by a text-trained LLM, and synthesized back to speech. Paralinguistic information — tone, emphasis, hesitation, emotion — is discarded at the bottleneck.

🎙 Audio→ 📝 ASR text→ 🧠 LLM→ 📝 Text reply→ 🔊 TTS

Total latency≈ 1,200 ms

Speech-Native · GPT-4o, Gemini Live

End-to-end audio

Audio tokens are processed directly by the foundation model alongside text tokens; the model emits interleaved audio + text and a vocoder synthesizes voice. Paralinguistic information is preserved.

🎙 Audio tokens→ 🧠 Audio-Text LMM→ 🔊 Audio tokens→ 🎵 Vocoder

Total latency≈ 232 ms → within human range

Cascaded latency budget — where the milliseconds go

Each stage of a cascaded pipeline contributes to total latency, far exceeding the ~200ms human conversational norm.

1,200 ms

Cascaded pipelineSix times the human norm. Conversation feels broken.

232 ms

Speech-native (GPT-4o)End-to-end audio · within human turn-taking range.

Figure 6. End-to-end voice assistant pipeline showing latency budget. Generated with TikZ

Inside a speech-native foundation model

A key architectural pattern: an LLM checkpoint serves as the foundational backbone, extended with custom modal tokens. Audio is discretized into tokens (via HuBERT or wav2vec 2.0) and interleaved with text. The model autoregressively produces both — audio passed through a vocoder for synthesis, text emitted directly.

Speech-native foundation model architecture

Figure 4. Speech-native architecture: audio encoder → joint audio+text token stream → bootstrapped LMM → interleaved audio+text output → vocoder + detokenizer. Generated with TikZ

The trajectory of foundation speech models

From supervised speech-only systems to self-supervised multimodal architectures jointly processing audio, text, and vision — fewer labels, more modalities.

Speech-only Speech + Text Audio + Vision + Text

Figure 3. Evolution of speech and multimodal models, from supervised speech-only systems to self-supervised multimodal architectures (Baevski et al., 2020; Hsu et al., 2021; Girdhar et al., 2023).

§4 · Habit Inertia

Not a tech problem. A human one.

If the technical barriers have largely been removed, why hasn't voice replaced text? The cognitive and behavioral cost of switching from a familiar workflow to an unfamiliar one — even when the new workflow is objectively superior — is the primary friction. This is a human problem with technical consequences.

A 4× gap in turn-taking timing

Human conversation has gaps clustered tightly around ~200ms. Today's voice assistants typically respond in ~900–1,000ms — feeling slow, prompting interrupts and disengagement.

Frequency of response times. Adapted from Stivers et al. (2009); Levinson & Torreira (2015).

~200 ms

Human turn-taking gap (median, multiple languages)

~900 ms

Typical assistant response delay (cascaded pipelines)

Three classes of friction

Technical

Largely solved

Accuracy & robustness · Latency & endpointing · Noise & environmental factors · Out-of-domain generalization. ASR is at near-human WER on standard benchmarks; remaining gaps are accent and domain coverage.

Social

Context-dependent

Public awkwardness · Privacy perception · Shared spaces. 100% of voice users in our survey reported using it at home, but only 25% at work and even fewer in public.

Behavioral

Primary remaining barrier

Habit inertia · Discoverability · Trust calibration. Decades of typing have made it feel "natural" — but naturalness is acquired, not inherent. Many users simply forget the capability exists.

↳

Implication for ML

Why this matters here

As long as users default to text, the data ecosystem stays text-dominated, reinforcing text-centric model development. Breaking the cycle requires either compelling interfaces or models that can learn from speech that already exists — podcasts, meetings, lectures, voice messages.

Figure 5. Taxonomy of barriers to voice adoption — adapted from Klein et al. (2024).

§4.1 · Empirical Evidence

What 200 voice users actually told us.

We conducted an informal survey (N=200) of voice interface users — illustrative evidence, not statistically representative. The pattern is revealing: the top barriers (slowness, accuracy) are technical issues that are rapidly improving. The persistent ones are behavioral and social.

Primary barriers cited

Too Slow / Prefer Typing

48%

Poor Quality

43%

Single-Turn Limitation

38%

Social Discomfort

33%

Privacy Concerns

29%

Not Intuitive

19%

Figure 8. Survey results — speed and quality lead, but social and behavioral factors persist.

Adoption & perceived quality

Use voice 76%

Don't use 24%

Quality "Great" 6%

Quality "Average" 69%

Quality "Poor" 25%

100%

use voice at home

25%

use voice at work

63%

invoke via wake word

rate recognition "Great"

Among those who rated recognition quality, only 6% described it as "Great" while 69% rated it "Average." Technical improvements alone are insufficient: voice interfaces must also address user perception, social context, and discoverability to achieve mainstream adoption.

§5 · Counterarguments

The strongest objections — and our response.

Position papers earn their keep by engaging serious counterarguments. Here are the four strongest objections to a speech-first agenda, and how we address each.

Text is more computationally efficient.

Our response

This argument optimizes for the current cost ratio, which is fixed. Self-supervised learning has already collapsed audio processing costs — wav2vec 2.0 and HuBERT enable training on audio without transcription labels. Hierarchical tokenization (AudioLM, SoundStorm) compresses audio to 50–75 tokens/sec, approaching text density. The question is not which modality is cheaper today, but which will be more representative tomorrow.

Paralinguistic information doesn't justify the cost.

Our response

This reflects a text-centric definition of "what matters." For many real-world applications — customer service, healthcare, education, mental-health support — emotional state and speaker intent are central, not peripheral. Sarcasm, rhetorical questions, and emphasis change meaning without changing words.

User preference for text reflects genuine advantages.

Our response

We agree text has real advantages — privacy in shared spaces, scannability, archival. Our position is not that speech replaces text. It's that the proportion of knowledge generated via speech will grow, and models should be prepared for a data ecosystem where both modalities are common, rather than treating text as the default and speech as an edge case.

The habit shift may never happen.

Our response

QWERTY persisted because the switching cost exceeded the benefit. For speech, the switching cost is lower (no new skill to learn) and the benefit higher (3× input speed, hands-free, accessibility). The shift is already visible: voice search queries exceed 20% of mobile searches; smart-speaker adoption is growing; voice messaging is exploding worldwide. The question is not whether speech usage will grow, but how fast — and whether the ML community will be prepared.

§6 · Call to Action

Concrete steps toward a speech-native future.

The preceding analysis motivates specific actions. Three audiences, three agendas — but a shared direction.

1
Develop audio-native benchmarks

Current evaluation suites (GLUE, SuperGLUE, MMLU) are text-centric. We need benchmarks measuring reasoning over audio — comprehension, summarization, QA on audio inputs without intermediate text. ProfASR-Bench (Piskala, 2025) is one example.
2
Explore speech-first pretraining

Rather than fine-tuning text-trained models on audio, investigate what happens when models are pretrained on audio from the start. Does it produce different representational structures? Does it generalize differently downstream?
3
Design semantic audio tokens

Move beyond phonetic tokenization (HuBERT, wav2vec) toward representations that preserve meaning and paralinguistic information — audio encoders whose latent spaces reflect semantic similarity, not just acoustic similarity.

1
Invest in speech-data infrastructure

The podcasts, meetings, and voice messages being generated today are tomorrow's training data. Building pipelines to collect, clean, and curate speech — with appropriate consent and privacy protections — is a strategic investment.
2
Deploy speech-native interfaces

Rather than retrofitting voice onto text-designed products, design experiences where speech is the primary modality. This generates training signal for speech-native models and shifts user habits.
3
Open-source speech-native models

GPT-4o and Gemini Live are proprietary. The research community needs open alternatives to study speech-native architectures, identify failure modes, and iterate on designs.

1
Diversify speech datasets

Current speech corpora overrepresent English and standard accents. Expanding to underrepresented languages, dialects, and speaking styles ensures speech-native models serve global populations.
2
Address privacy concerns

Speech data is uniquely sensitive (voiceprints, ambient sounds, emotional state). Privacy-preserving techniques — federated learning, on-device models, differential privacy — are an ethical imperative and an adoption requirement.
3
Engage with HCI researchers

The habit barriers we identified require interdisciplinary solutions. ML researchers should collaborate with HCI, cognitive science, and design communities to build interfaces users actually want to speak to.

§7 · Conclusion

The transition won't happen overnight. But it's already underway.

The technical barriers have largely been removed. Modern ASR achieves near-human accuracy. Speech-native models demonstrate sub-300ms latency. Self-supervised learning has made audio processing tractable at scale. What remains is primarily habit inertia — users default to text not because it is superior, but because decades of keyboard-centric interfaces have made it familiar.

As voice becomes habitual, the data ecosystem underlying ML will shift toward speech-first knowledge generation. Models designed for text-centric assumptions will be increasingly misaligned with future training distributions. The transition will not happen overnight, and text will remain important — but the ML community should anticipate this shift rather than react to it.

Models that can learn from the speech data that already exists — podcasts, meetings, lectures, voice messages — and architectures that preserve paralinguistic information natively will be better positioned for a future where humans increasingly speak their knowledge before writing it down.

BibTeX

@inproceedings{dbpiska2026beyondtext,
  title     = {Position: The Text-Centric Bias in Foundation Models
               Must Be Revisited for a Speech-First Future},
  author    = {Deepak Babu Piskala},
  booktitle = {Proceedings of the 43rd International Conference
               on Machine Learning (ICML)},
  year      = {2026},
  note      = {Spotlight Presentation (top 5\% of accepted papers)}
}

One sentence to remember.

Present is keyboards. Future is everywhere else.

The feedback loop that favored text — for decades.

As complexity grows, typing breaks down.

The barriers have already fallen.

Cascaded vs. Speech-Native

ASR → LLM → TTS

End-to-end audio

Cascaded latency budget — where the milliseconds go

Inside a speech-native foundation model

The trajectory of foundation speech models

Not a tech problem. A human one.

A 4× gap in turn-taking timing

Three classes of friction

What 200 voice users actually told us.

Primary barriers cited

Adoption & perceived quality

The strongest objections — and our response.

Concrete steps toward a speech-native future.

Develop audio-native benchmarks

Explore speech-first pretraining

Design semantic audio tokens

Invest in speech-data infrastructure

Deploy speech-native interfaces

Open-source speech-native models

Diversify speech datasets

Address privacy concerns

Engage with HCI researchers

The transition won't happen overnight. But it's already underway.

BibTeX