The buzz around GPT-3 and ChatGPT has got us excited and like many developers around the world, our team got their hands dirty with what’s possible. More precisely, we are looking for how GPT-3 can be a significant game-changer when paired with speech synthesis, or voice cloning.
Intro to GPT-3
GPT-3, by Open AI, is the largest language model ever created – with over 175 billion parameters. By far, the biggest achievement of GPT-3 is how well a generic language model, provided just enough data*, can solve natural language processing tasks that it has never encountered.
In simple terms, GPT-3 can be prompted with just a few (1-3 examples) and be tuned to start writing. This is a big accomplishment compared to previous natural language models where developers would have to put together datasets and fine-tune using expensive computing resources.
The list of tasks that GPT-3 can solve is massive, and developers are still discovering more as they dig further. We’ve seen it make presentations, write articles, generate financial statements, and even layout a design in figma.
In general, GPT-3 can handle chats, Q&As, summarize complex pieces of text, fix grammar, parse unstructured data, and classification tasks.
Hooking up Voice Cloning to GPT-3
Out of the box, the thing that struck out to me the most was how GPT-3 was able to capture the persona of the individual and spit out noticeably different vocabulary that is tuned to the speaker (even double negatives! 🤯)
To test, we took a blurb from our Linus article, and used that as the prompt to generate further output:
With Resemble’s web platform, we were able to quickly iterate on the script, and fine-tune the parts of the performance that the team wanted to tweak. More than the Emotion Gradients that we introduced last month, we also demonstrated how CopyPaste, a new approach to speech synthesis could take another audio clip and extract speech patterns to apply directly to the target speaker. This enabled the team to get really creative with the audio output.
Here’s the first output that it generated:
By using CopyPaste, not only can we manipulate words, but also apply audio-specific manipulation to particular parts of the sentence.
Not bad for the first try as it actually understood what CopyPaste was. Here are a few more examples of where we added an autocomplete button to our editor. We would use the existing documented as a prompt and let GPT-3 fill in the rest of the block.
Storyboarding with GPT-3
With Resemble, a large goal of ours is to allow our users to quickly iterate on scripts and craft output in real-time with highly realistic voices. One of the common issues we face when generating content is that the written words don’t mesh with the persona of the voice.
This is where GPT-3 is so powerful. Take the following as an example. Imagine, we want to create a life lesson in the voice of Sri Sri Ravi Shankar, an Indian spiritual leader. Sri Sri has a unique way of speaking with a serene voice that is perfect for meditation.
Let’s see if we can get GPT-3 to sound like him by asking him the meaning of Life:
> Q. What is the meaning of life?
> H.H Sri Sri Ravi Shankar (GPT-3): Everyone is caught in the thought process of ‘I, Me and Mine’. All this does is create a thought process. When you reach your Self, you will know the meaning of life.
Now, let’s mix that with his voice generated by Resemble and hear what these words would sound like if they were coming from him:
The attributes of the voice and the words that were generated by GPT-3 go hand in hand to create a performance that sounds cohesive.
Another example with Muhammad Ali about the right of free speech:
> Q. Do you resent what was done in Chicago? Do you believe they infringed on your right of free speech?
> Muhammad Ali (GPT-3): No, because I really know what happened. As far as right of free speech, they’re right there in the United States. The United States gave me this great body, and I’m not trying to take a privilege away from nobody.
Notice at the end, GPT-3 includes a double negative. This is fascinating since double negatives are largely considered grammatically incorrect. However, in this scenario, you could picture Muhammad Ali, emphasizing his point by using a negative.
The team was able to put together other interesting experiments that are being integrated into the Resemble product. Some examples:
Text to Phoneme (IPA) conversion
Text: it’s soon flies over the present failure and begins to hope again
IPA: ɪts suːn flaɪz oʊvɚ ðə pɹɛzənt feɪlɪɹ ænd bɪɡɪnz tə hoʊp ɐɡɛn
Q: Jimmy owns a bicycle and he sells it to Jane. Jane decides later to sell it to Tom. Who owns the bike?
A: ‘ Tom does because ownership transfer was from Jane to Tom’
Hello darkness, my old friend. : [Sad]
Why would you ever think that is ok? : [Anger]
I can’t wait for tonight! : [Excited]