ElevenLabs co-founder and CEO Mati Staniszewski agrees. “The core technology behind synthetic voices is on a clear trajectory towards commoditization, even as we continue to invest heavily in building world-class audio models.” Put differently: today’s advantage is still the model, but defensibility will move to everything else wrapped around it—data, safety, multimodal orchestration, distribution, and product experience.
The above-mentioned core technology’s commoditization is likely due to numerous factors. First, speech synthesis and voice cloning architectures are converging. Diffusion, transformers, and flow-based models are being refined across labs. Techniques like distillation, quantization, and on-device caching are dramatically reducing inference latency. Just as seen in text LLMs, the price per token plummeted as competitors entered the market. Audio is following suit; per-minute cost is dropping rapidly, and inference is improving as GPU utilization increases with newer Nvidia generations and runtime specializations. Open research is accelerating the curve.

While Meta’s AudioCraft family, academic research on expressive TTS and zero-shot cloning, and open-source vocoders are just beginning, the “hows” of generating speech are becoming common knowledge. Staniszewski’s conclusions follow a well-trodden pattern in AI infrastructure.
Data rights, safety, and provenance will decide winners
Proprietary data and rights management will be a decisive factor as well. Licensed voice corpora—consent-based—and the tooling to manage them create defensibility unlike raw model IP. Increasingly, enterprise buyers seek provenance and opt-in records, especially for celebrity voice, brand voice, and brand ambassadors, or in regulated sectors such as financial services and healthcare.
Safety and provenance will also separate winners. The FCC has made efforts to crack down on AI voice robocalls, and the EU AI Act mandates clear labeling of synthetic media. Watermarking, detection, and C2PA-aligned content credentials will shift from “nice to have” to procurement checkboxes. The vendors who can guarantee low rates of false positives while maintaining audio file quality will win the trust of studios, platforms, and public institutions.
Finally, the product matters. As Staniszewski hinted at above with the Apple analogy, the moat forms where model and application meet: script-to-screen pipelines for dubbing, one-click localization at broadcast quality, voice-over interfaces tuned for game engines, or CRM-integrated conversational agents. Owning the workflow and the use case matters more than owning a single model. The multilingual mode boosts voice technology.

Real-time voice and orchestration are raising the bar
The next phase is fused systems. Voice models will closely pair with large language models for real-time turn-taking, emotion control, and tool use, and with video models for synchronized lip movements and scene timing. Google’s Veo 3 shows how generative video models benefit when multiple models align; the same is true for voice-led experiences, including virtual hosts, live customer support, and interactive learning.
Recent rollouts of real-time voice in flagship assistants—from conversational modes in leading LLMs to enterprise contact center platforms—show how latency budgets under 300 milliseconds and dynamic prosody control are becoming table stakes. In this world, orchestration, not any single model, determines the experience.
Partnerships and open-source strategy to integrate best tools
ElevenLabs plans to pair its audio stack with third-party and open-source components. That hybrid approach mirrors how the ecosystem is maturing: teams mix commercial TTS with open ASR, dubbing with studio-grade denoisers, or a smaller on-device voice for privacy with a cloud LLM for reasoning. The goal is practical: reduce cost, meet compliance, and ship faster without compromising quality.
The company’s stance also reflects its market reality. ElevenLabs, which has raised significant capital and built a recognizable brand for natural-sounding voices and dubbing, needs to compete on breadth—support for more languages and styles—while enabling developers to swap in components as needs change. Being the best integrator can matter as much as being the best model builder.
What it means for developers and brands adopting voice AI
- Plan for portability. Use abstraction layers so you can switch voice providers without refactoring your app. Track latency, cost, and failure modes across vendors, and run periodic bake-offs, especially if you operate at scale or across regions with different privacy rules.
- Invest in consent and compliance early. Capture voice rights, talent contracts, and disclosure flows up front.
- Adopt watermarking and C2PA metadata to future-proof distribution on platforms that will increasingly require provenance signals.
- Design for multimodal from day one. Define guardrails for tone, emotion, and timing, and instrument experiments to measure engagement and trust, not just cost per minute.
The value curve in synthetic audio is shifting beyond models
The takeaway from Staniszewski’s forecast is not that models don’t matter today—they do—but that the value curve is moving. As audio generation gets cheaper and more uniform, the durable edge will live in licensed data, safety, distribution, and the polished products that make synthetic voice feel seamless, useful, and responsibly deployed.
