Check out Speech-to-Speech by Resemble AI


Build Custom Neural Voices through Resemble’s API

One of the power features that Resemble enables is the use of our API to programmatically build voices. Over the last few months, we’ve helped dozens of businesses scale the way they create neural synthetic voices by enabling them to feed data into Resemble automatically. From Agents within call centers, to health applications, creating dozens or hundreds of custom voices enables applications to create a more immersive experience for their end users.

We’re now rolling out API access for voice building for more of our enterprise or startup users. With that, we wanted to briefly go over how it works, and how you can get started.


Verify and Gather Consent

Before we begin with collecting, and uploading data, the user must provide consent from the voice talent (if looking to clone a third party’s voice) to use the API to build voices. Access can be acquired by emailing us at [email protected]. We will ask for video or contractual proof from the voice talent before you can use the API. For more information, please visit our ethics statement.


Collect the Data

When you’re ready to create a custom voice, the first step is to gather the appropriate data in a suitable format. There are multiple formats that are acceptable for uploading data:


Name Description Use Case Quantity Needed
Raw Audio Upload a raw audio file (.wav) of a single speaker. When you have tons of unannotated and unsegmented raw audio previously recorded. 10+ minutes recommended for English, 1+ hours for other languages.
Audio with Transcriptions A collection of audio clips (no longer than 20 seconds each), paired with transcripts. Acceptable file formats are .zip or .tar.gz. When you have professional line-by-line recordings with the transcriptions. Minimum 50 audio clips recommended.
Short Audio Clips without Transcriptions A collection of audio clips (no longer than 20 seconds each) without any transcripts. Acceptable file formats are .zip or .tar.gz. Only audio files recorded. Minimum 50 audio clips recommended.


Each audio file in those formats should abide by the following guidelines:


Property Value
File Format RIFF (.wav)
Sampling Rate One of 8,000hz, 16,000hz, 22050hz, 44100hz, 48000hz
Sample Format PCM, 16-bit
Number of Channels Mono
File Name Alpha-numeric with .wav extension
Audio Length Between 1.5 seconds to 20 seconds


If you pick the option with transcripts, the folder has to be structured as following:


Where metadata.csv contains all of your transcriptions in the following format:

file1|This is the text that is included in file one.

*Note that the filename does not contain the .wav extension!

Upload Voice Data to Resemble

Once you have your data ready, it’s time to send it over to us. You can always upload data directly on our beautiful web platform on the voices page. But since you’re a programmer, I know you’d rather take the programmatic path.

Uploading data through the API involves 3 very simple steps:

  1. POST request to to retrieve a signed URL.
  2. PUT request to upload your data to the signed URL returned in step 1 with the headers sent by the request from step 1.
  3. POST request to with the voice uuid from step 1.

Let’s go through each step. For simplicity, we’ll write these out with cURL, but it should be fairly trivial to port to any language of your choice.

For step 1, we’ll make a POST request to to retrieve a signed URL. You’ll need to make a request to with a JSON blob that includes name for the voice name, filename for the name of the file you’re uploading, byte_size, and the MD5 checksum. Optionally, you can provide a callback_uri to be notified when your voice is ready. Here’s what that will look like:


curl --request POST '' \
 -H 'Authorization: Token token=YOURAPIKEY' \
 -H 'Content-Type: application/json' \
 --data-raw '{
   "filename": "<YOURFILENAME.wav>",
   "byte_size": 78183182,
   "checksum": "ce1231231231231231",
   "content-type": "audio/x-wav",
   "name": "<NAME OF YOUR VOICE>",


You’ll receive a JSON response. Remember to save the URL and headers since we’ll be using those in the next step.

   "url": "Signed URL",
   "headers": {
     "Content-Type": "audio/x-wav",
     "Content-MD5": "FILE_MD5"


For step 2, you’ll make a PUT request to the signed URL from step 1. Include all of the headers that were returned in step 1 as the headers in this request:


curl --request PUT 'URL_FROM_STEP_1' \
     --header 'Content-Type: audio/x-wav' \
     --header 'Content-MD5: <CONTENT-MD5 FROM STEP 1>' \
     --header 'Content-Length: <CONTENT-LENGTH FROM STEP 1>' \
     --data-binary '@/path/to/your/file'

For step 3, you’ll trigger a build by making a POST request to This will trigger the job to execute on one of our servers to train your voice. As a parameter, pass in the voice UUID:


curl --request POST \
     -H 'Authorization: Token token=YOURAPIKEY' \
     -H 'Content-Type: application/json' \
     --data-raw '{
         "voice": "VOICE UUID"


If you have more data to upload in the future, you can view the documentation for updating a voice in our API docs.


Check the quality of your Voice

Once your voice has completed training, we provide metrics on how well we were able to learn the voice and where the pitfalls were. Often these metrics give a good glance on where you can improve the data to obtain a better AI voice. We surface 4 values at the moment:

  • voice_similarity: Between 0 to 1. Tells how effectively the voice attributes was captured by the AI. Higher is better. A low voice similarity typically means noisy data.
  • fluency: Between 0 to 1. Tells how well the AI is generating speech without hiccups. Higher is better. A low fluency score typically suggests data with multiple speakers, slurring, or non-speech elements.
  • pauses: Between 0 to 1. Tells how often the AI puts unintended pauses compared to the ground truth data. Lower is better. A high score typically suggests a lot of pauses in the original data provided.
  • resemble_score: Between 0 to 1. Overall score of the voice including other elements such as transcription quality.

If you’re interested in getting verified to use the voice building API, reach out to [email protected].

More From This Category

Enhance Your NPCs Quality of Life Through Generative AI

Enhance Your NPCs Quality of Life Through Generative AI

In my 25+ years of gaming, I've never been so hyped to talk about NPCs (non-playable characters)! There's a buzz around NPCs like never before. I grew up in the world of NPCs with boilerplate responses and limited interactivity. Thanks to Generative AI these...

read more
How Hollywood Studios Are Dabbling in Generative Voice AI

How Hollywood Studios Are Dabbling in Generative Voice AI

Have you ever wondered how AI might be improving your industry and you may be oblivious to it? Stop feeling like a procrastinator. We’ll bring you up to speed on the unique ways entertainment studios are using Generative Voice AI to build interactive marketing...

read more
Introducing Neural Speech Watermarker

Introducing Neural Speech Watermarker

Tools for Verifying Safe Generative Voice AI As artificial intelligence (AI) generated voices become increasingly close to human-level quality, Resemble AI is providing additional tools to help the industry tackle malicious use and stop misinformation. To deploy safe...

read more