Text-to-speech latency: the jambonz leaderboard

Dave Horton

The emergence of AI and Large Language Models (LLMs) onto the tech landscape promises to reshape everything: how we work, how we play, and how we engage with others. Of course - let’s be honest: not much of that has happened yet. Someday we’ll surely experience the “sonic boom” moment when the actual rate of progress catches up to the hype, but sorry folks, we’re not there yet.

Instead, the most notable impact to date has been the refocusing of huge amounts of private and public capital into any and all product categories thought to either benefit from or drive AI technologies. Those of us laboring to make our daily bread in the CX/AI space find ourselves the lucky beneficiaries of this effect. We get to play with new speech technologies developed by startup companies newly flush with VC cash and eager to brag about how many NVIDIA GPUs they bought over the weekend. For those of us working adjacent to them and out of the VC spotlight, it’s like eating at the high school table with the rich kids who suddenly and inexplicably want to share their nicely packaged lunches with us.

I’ll be honest, the sprouting-up of new text-to-speech (TTS) vendors that we’ve seen over the past year or so was not something I expected because, quite frankly, I thought your dad’s Google TTS and Microsoft TTS were pretty damn fine, not to mention that the investment theme is lost on me when a market already has close to commodity-level pricing. Oh well, that just goes to show what I know.

In our upcoming jambonz 0.9.0 release we’ve added support for TTS services from a bunch of these sassy newcomers that want to challenge the giants, and we thought it would be a good time to put them to test. What age-old story are we going to see here: the new upstarts disrupting the failing dinosaurs? Or would it be the well-heeled Daddy Warbucks incumbents quashing the neophytes? Let’s find out!

Introducing your contestants!

The jambonz open source voice gateway for CX/AI providers has been widely adopted by many CX/AI providers, including the leading vendors in the space. Our “bring your own everything” design enables customers to connect their preferred carriers and speech vendors and so we have always made it our mission to give our customers the broadest selection of speech vendors for both text-to-speech and speech-to-text.

As well, we strive to give customers detailed insights into the behavior of their service through an open telemetry observability framework that reports on data such as time-to-first-byte for TTS requests.

jambonz observability dashboard showing TTS time-to-first-byte telemetry

In upcoming release 0.9.0 we added several new vendors for text-to-speech, and we’ve also made an effort to support streaming APIs where available to reduce the latency experienced by users, so it seemed like a good time to do some benchmark testing and establish a leaderboard. In our testing we compared:

*With all other vendors we measured time-to-first-byte; however with Google we were forced to measure time-to-last-byte as we have not implemented a proper streaming API integration for them (yet).

The testbed

We ran the tests using a jambonz server running in AWS us-east-1 region on a single EC2 t2-medium instance. We ran against the hosted SaaS service for each of the vendors. We tested two different scenarios, both common to conversational AI:

We tested 5 variations of short and long prompts on each TTS engine, using English language:

short prompts

longer prompts

In all cases (except Google, as described above) we measured the time from sending the request to the service to receiving the first byte of audio. We give more details on the configuration of each TTS service later in this blog post.

Results

Before we review the results, there is one additional subtlety to be aware of when measuring latency. Here we are measuring time to first byte, which is an important metric. However, all providers send a small amount of silence at the beginning of generated audio, and that amount we found to differ by provider. The experience of the user will be the time to first byte plus the duration of leading silence. In our experience, the vendors fell into two categories:

Keep these in mind as we review the results.

Without further ado, here are the results from the tests using the short audio requests.

Bar chart of short-prompt time-to-first-byte by vendor

and here are the results from testing the longer audio segments.

Bar chart of long-prompt time-to-first-byte by vendor

And here are this detailed data from the tests.

short audio tests - time to first byte (ms)
Prompt Deepgram Elevenlabs Google Microsoft PlayHT RimeLabs Whisper
1 324 481 183 127 121 326 405
2 348 451 173 345 61 248 649
3 355 613 185 293 75 250 601
4 324 645 274 427 59 201 472
5 355 471 192 318 50 187 470
avg. 341 532 201 302 73 242 519
long audio tests - time to first byte (ms)
Prompt Deepgram Elevenlabs Google Microsoft PlayHT RimeLabs Whisper
1 460 771 450 293 67 642 600
2 363 739 420 345 50 320 772
3 357 833 465 356 168 306 581
4 435 1404 338 364 65 322 781
5 472 783 367 409 111 340 830
avg. 417 906 408 353 92 386 712

Our findings

Wow! We were not expecting this.

Summary

Our main takeaway is how fast all of these vendors are. A year ago, we would have been happy with sub-second results - now we are hungering for, and in some cases getting, ttfb times of less than 100 milliseconds. All of these vendors provide great products that are worth evaluating for those planning their CX/AI rollout. We’re looking forward to the vendors polishing things like speech cadence and the minor imperfections that we encountered.

We should note that we are also happy to work directly with any of these vendors to collaborate on testing or on fine-tuning our integrations if necessary to improve performance and overall user experience. We will update our leaderboard from time to time, and we are always adding support for new vendors so reach out to us if you provide a TTS service and would like to be included in future reports.

Also, feel free to create a free account on the jambonz cloud to try out jambonz!

Appendix: Notes on our configuration

A few notes on how we configured each speech service.

vendor model voice
Deepgram Aura Asteria
Elevenlabs turbo-v2 Serena
Google Wavenet-C
Microsoft Ava (multi-lingual)
PlayHT PLayHT2.0-Turbo Jennifer
RimeLabs Mist Abby
Whisper tts-1 Alloy