Continua's Past, Present, and Future with Podcast Audio Generation
Today’s post rounds out our series on podcast creation and covers audio generation. Future posts will discuss the work beyond the podcast in terms of interactivity, building out a UI, etc. See our previous posts, here: [1], [2], [3], [4]. While arguably the largest piece of a podcast, the audio is the piece we have the least control over and the area we see the greatest potential for future exploration. In this post, I’ll walk you through how we generated audio in the past, the model we use now and some of its quirks, and what we plan to do in the future.
What We Tried Before
When we first started working on generative podcasts back in October, there was no speech dialogue model available. Instead, we used single voice models to generate each host’s lines independently and stitched the segments together. We began by experimenting with Open AI’s audio API as well as Google’s one speaker Studio Voices. Below are examples of the same dialogue generated by both single-speaker models.
OpenAI:
Google One Speaker Studio:
Our preference was Google’s voices, so we started using them in production. The main issue with Google’s TTS models is that the better the voice quality, the lower the controllability. For instance, Google’s older models allow for SSML input, which lets you control pauses, pronunciation, emphasis, and other details of speech. Google’s one speaker studio voice allows some customizability, but not <emphasis> and <prosody pitch>, which control how excited a speaker sounds, how quickly they speak, and the pitch of their voice. It’s also important to note that neither the one or two speaker Studio voices support streaming, which is essential for development with respect to real-time speech.
In the following clip, I’m using the Standard Google voices with SSML input. If you listen closely, you’ll note subtle differences in speed, pitch, and volume throughout the dialogue.
Google Standard voice:
"<speak><prosody rate=\"fast\" pitch=\"+5st\">Oh my gosh, I just heard the most insane news.</prosody></speak>"
"<speak><prosody rate=\"x-slow\" pitch=\"-5st\">What's going on?!</prosody></speak>"
"<speak><prosody rate=\"x-fast\" pitch=\"+10st\">They found life on Mars!!</prosody></speak>"
"<speak><prosody volume=\"soft\">No way. Where'd you even hear that?</prosody></speak>"
Regardless of controllability or lack thereof, one speaker models still lack the flow of natural dialogue. When humans talk to one another, we naturally add disfluencies (uh, um, mmm), we laugh or sigh, and we talk over each other. In fact, overlapping conversation accounts for anywhere from 10-20% of spoken conversation! The former two features we could maybe code into a one speaker model, but the latter is impossible to capture when we process audio segments independently. Lucky for us, Google released their two speaker studio voices a couple months into developing our podcast product – thank you friends!
What We’re Doing Now
Google’s two speaker Studio voices automatically infer tone from a transcript, pitch speaker voices, and add disfluencies and overlap. Here’s the same transcript as before, this time generated by the two speaker Studio model.
Google two speaker Studio voice:
You’ll notice that the model automatically makes the female voice sound excited as she reveals that she heard insane news. The male voice responds with an intrigued “oooo”. The female then pitches her voice up as she reveals that there’s life on Mars. The contextual inference happens under the hood, and generally, the model works pretty well. However, we have no control over the output. There’s no guidance on how to force a tone or behavior, but there are a few things we’ve reverse-engineered since using the model. These could change with new releases or model updates, but as of today’s post, they’re valid hacks.
You can induce whispering by including some form of {{ whisper }} or [[whisper]] and wrapping lines with that marker. Notice that in the sample, all of the lines are whispered, but only the first three lines are wrapped with {{ whisper }}.
{"speaker": "Clara", "text": "{{ whisper }} Oh my gosh, I just heard the most insane news {{ whisper }}"},
{"speaker": "Leo", "text": "{{ whisper }} What's going on?!{{ whisper }}"},
{"speaker": "Clara", "text": "{{ whisper }} They found life on Mars!!{{ whisper }}"},
{"speaker": "Leo", "text": "No way. Where'd you even hear that?"}
You can also induce amusement by wrapping lines with {{ laugh }}.
{"speaker": "Clara", "text": "{{ laugh }} Oh my gosh, I just heard the most insane news. {{ laugh }}"},
{"speaker": "Leo", "text": "What's going on?!"},
{"speaker": "Clara", "text": "{{ laugh }} They found life on Mars!! {{ laugh }}"},
{"speaker": "Leo", "text": "No way. Where'd you even hear that?"}
But If you don’t wrap the lines fully, it breaks the output. In this example, including {{ laugh }} at the end Clara’s first line causes the model to glitch and assign only the laugh to Leo. The model has to operate back and forth, so the remaining lines have their speaker switched as well.
{"speaker": "Clara", "text": "Oh my gosh, I just heard the most insane news. {{ laugh }}"},
{"speaker": "Leo", "text": "What's going on?!"},
{"speaker": "Clara", "text": "{{ laugh }} They found life on Mars!!"},
{"speaker": "Leo", "text": "No way. Where'd you even hear that?"}
If you don’t want to risk introducing unintended speech artifacts by adding markers, you can simply change the context of the conversation. For example, the model will add amusement to the speakers’ voices itself.
{"speaker": "Clara", "text": "Oh my gosh, I just heard the funniest news."},
{"speaker": "Leo", "text": "What's going on?!"},
{"speaker": "Clara", "text": "I can't stop laughing. They found life on Mars!"},
{"speaker": "Leo", "text": "Stop. That's hilarious. Where'd you even hear that?"}
You can induce an air of surprise/shock with {{ gasp }}.
{"speaker": "Clara", "text": "{{ gasp }} Oh my gosh, I just heard the most insane news. {{ gasp }}"},
{"speaker": "Leo", "text": "{{ gasp }} What's going on?! {{ gasp }}"},
{"speaker": "Clara", "text": "They found life on Mars!!"},
{"speaker": "Leo", "text": "{{ gasp }} No way. {{ gasp }} Where'd you even hear that?"}
Fillers have the potential to make a conversation sound more natural. Though interestingly, when we tried this, it annoyed our users that AI generated speech was less than perfect.
{"speaker": "Clara", "text": "Oh my gosh, I just heard like the most insane news."},
{"speaker": "Leo", "text": "Um...uh...what's going on?"},
{"speaker": "Clara", "text": "Uh, they found life on Mars!!"},
{"speaker": "Leo", "text": "No way. Where'd you even hear that?"}
Sometimes when you get a marker wrong (?), it can really mess with the output. Is {{ sigh }} just a bad marker? I’m not sure, but I have no clue what happened to Leo’s voice in the second line or what happened at the end of this audio clip…
{"speaker": "Clara", "text": "Oh my gosh, I just heard the most insane news."},
{"speaker": "Leo", "text": "{{ sigh }} what's going on? {{ sigh }}"},
{"speaker": "Clara", "text": "They found life on Mars!!"},
{"speaker": "Leo", "text": "{{ sigh }} No way. Where'd you even hear that? {{ sigh }}"}
Which brings us to one of the biggest issues with the model: weird things happen all the time. Our hosts laugh at inappropriate moments, the audio at the end of the intro is often oddly faster than the rest of the dialogue, the hosts’ tones can be weird, characters like “-” are read out loud, “...” can trigger odd noises, numbers are read incorrectly, and things are constantly mispronounced. If you listen to our podcasts for long enough, you’ll start to catch onto some of the recurring quirks of the TTS model. The most frustrating part? We have no way to fix it.
Where We’re Going
So where does that leave us? The state of the art isn’t good enough, but as a startup, we have to find avenues to innovate quickly and cheaply. There are a couple of areas in which we see opportunity.
The first opportunity is in the dialogue model itself. A huge issue with TTS models is their inability to properly discern how to respond. TTS models may know what to say, but they don’t always choose the right way to say it. This post by Sesame clearly articulates the issue. The Sesame team conducted a study that evaluated human vs generated speech. In the first phase of the study, they presented users with two audio clips and asked, “which rendition feels more like human speech?” When listeners had no knowledge of context, there was a 50/50 split on preference, suggesting that the best TTS models are of high enough quality to be accepted as natural speech. When listeners were given audio and text context and asked “which rendition feels like a more appropriate continuation of the conversation,” the human speech won the vast majority of the time. Clearly, there’s still a gap to fill here. In fact, in our own user interviews, we repeatedly had interviewees state the voices sounded real, but the way they were saying things or the sounds they were making did not. Consider the following clip.
It’s easy for a human listener to recognize that something is off: Leo’s cheer sounds way too exaggerated, while Clara’s sounds void of emotion, but capturing that with a model is extremely complicated. To train our own dialogue speech model would mean allocating precious time and resources, and it’s something we’d have to consider carefully. On top of that, we know that larger companies are actively improving their own dialogue models. If you listen to NotebookLM’s podcasts, their audio is markedly better than ours, and it’s likely only a matter of time until Google releases an updated speech API that includes such advances.
Another area for development is interactivity, and we’re seeing a lot of movement with respect to this feature in the industry already. NotebookLM recently released interactive podcasts (skip to 1 min to see an example). Users can press a button to interrupt their episode to ask a question or make a suggestion, and the remainder of the episode will be re-generated accordingly. We envision a future with even less friction: no button clicks and the ability to communicate in a free-form back and forth conversation. Recall that Google's Studio voice APIs (both one and two speakers) do not support streaming. While Google recently released a single-speaker streaming model, we have yet to see a multi-agent conversation paradigm. We want to be able to converse with both podcast hosts simultaneously. Even with the single agent model, we still see quality issues: Google mentions that their model may struggle with hallucinating dropping words and numbers. Another prominent example of a one-speaker interactive model is ChatGPT’s voice agent. The drawback we see here is no controllability. For example, users can’t control agent personality, where the agent is getting its information from, or how the agent responds.
At Continua, we see interactivity as a hotbed for potential innovation and we are actively working to incorporate it into our current podcast product. At the moment, we have a text-based episode chat, where users can go to ask questions about their podcast content, and we see a path toward creating an amazing interactive speech experience that surmounts the problems listed above.
We’re extremely grateful that we’re able to build on top of models that were expensive to train, but we’re also extremely excited to build beyond them, addressing some of the limitations we’ve seen and pushing entirely new interaction patterns. We want to anticipate both how and why users want to interact and make the experience as seamless as possible. I think this post has made clear that speech is far from “solved.” At Continua, our ambition is to create the world’s best interactive podcast with incredible naturalness and social etiquette. We hope you join us on our journey.