One of the power features that Resemble enables is the use of our API to programmatically build voices. Over the last few months, we’ve helped dozens of businesses scale the way they create neural synthetic voices by enabling them to feed data into Resemble automatically. From Agents within call centers, to health applications, creating dozens or hundreds of custom voices enables applications to create a more immersive experience for their end users.

We’re now rolling out API access for voice building for more of our enterprise or startup users. With that, we wanted to briefly go over how it works, and how you can get started.

 

Verify and Gather Consent

Before we begin with collecting, and uploading data, the user must provide consent from the voice talent (if looking to clone a third party’s voice) to use the API to build voices. Access can be acquired by emailing us at team@resemble.ai. We will ask for video or contractual proof from the voice talent before you can use the API. For more information, please visit our ethics statement.

 

Collect the Data

When you’re ready to create a custom voice, the first step is to gather the appropriate data in a suitable format. There are multiple formats that are acceptable for uploading data:

 

Name Description Use Case Quantity Needed
Raw Audio Upload a raw audio file (.wav) of a single speaker. When you have tons of unannotated and unsegmented raw audio previously recorded. 10+ minutes recommended for English, 1+ hours for other languages.
Audio with Transcriptions A collection of audio clips (no longer than 20 seconds each), paired with transcripts. Acceptable file formats are .zip or .tar.gz. When you have professional line-by-line recordings with the transcriptions. Minimum 50 audio clips recommended.
Short Audio Clips without Transcriptions A collection of audio clips (no longer than 20 seconds each) without any transcripts. Acceptable file formats are .zip or .tar.gz. Only audio files recorded. Minimum 50 audio clips recommended.

 

Each audio file in those formats should abide by the following guidelines:

 

Property Value
File Format RIFF (.wav)
Sampling Rate One of 8,000hz, 16,000hz, 22050hz, 44100hz, 48000hz
Sample Format PCM, 16-bit
Number of Channels Mono
File Name Alpha-numeric with .wav extension
Audio Length Between 1.5 seconds to 20 seconds

 

If you pick the option with transcripts, the folder has to be structured as following:

data/
 metadata.csv
 wavs/
   file1.wav
   file2.wav
   file3.wav

Where metadata.csv contains all of your transcriptions in the following format:

file1|This is the text that is included in file one.

*Note that the filename does not contain the .wav extension!

Upload Voice Data to Resemble

Once you have your data ready, it’s time to send it over to us. You can always upload data directly on our beautiful web platform on the voices page. But since you’re a programmer, I know you’d rather take the programmatic path.

Uploading data through the API involves 3 very simple steps:

  1. POST request to https://app.resemble.ai/api/v1/voices to retrieve a signed URL.
  2. PUT request to upload your data to the signed URL returned in step 1 with the headers sent by the request from step 1.
  3. POST request to https://app.resemble.ai/api/v1/voices/build with the voice uuid from step 1.

Let’s go through each step. For simplicity, we’ll write these out with cURL, but it should be fairly trivial to port to any language of your choice.

For step 1, we’ll make a POST request to https://app.resemble.ai/api/v1/voices to retrieve a signed URL. You’ll need to make a request to https://app.resemble.ai/api/v1/voices with a JSON blob that includes name for the voice name, filename for the name of the file you’re uploading, byte_size, and the MD5 checksum. Optionally, you can provide a callback_uri to be notified when your voice is ready. Here’s what that will look like:

 

curl --request POST 'https://app.resemble.ai/api/v1/voices' \
 -H 'Authorization: Token token=YOURAPIKEY' \
 -H 'Content-Type: application/json' \
 --data-raw '{
   "filename": "<YOURFILENAME.wav>",
   "byte_size": 78183182,
   "checksum": "ce1231231231231231",
   "content-type": "audio/x-wav",
   "name": "<NAME OF YOUR VOICE>",
   "callback_uri": "<OPTIONAL URI FOR CALLBACK WHEN VOICE IS READY>"
 }'

 

You’ll receive a JSON response. Remember to save the URL and headers since we’ll be using those in the next step.

{
   "url": "Signed URL",
   "headers": {
     "Content-Type": "audio/x-wav",
     "Content-MD5": "FILE_MD5"
   }
}

 

For step 2, you’ll make a PUT request to the signed URL from step 1. Include all of the headers that were returned in step 1 as the headers in this request:

 

curl --request PUT 'URL_FROM_STEP_1' \
     --header 'Content-Type: audio/x-wav' \
     --header 'Content-MD5: <CONTENT-MD5 FROM STEP 1>' \
     --header 'Content-Length: <CONTENT-LENGTH FROM STEP 1>' \
     --data-binary '@/path/to/your/file'

For step 3, you’ll trigger a build by making a POST request to https://app.resemble.ai/api/v1/voices/build. This will trigger the job to execute on one of our servers to train your voice. As a parameter, pass in the voice UUID:

 

curl --request POST 'https://app.resemble.ai/api/v1/voices/build' \
     -H 'Authorization: Token token=YOURAPIKEY' \
     -H 'Content-Type: application/json' \
     --data-raw '{
         "voice": "VOICE UUID"
     }'

 

If you have more data to upload in the future, you can view the documentation for updating a voice in our API docs.

 

Check the quality of your Voice

Once your voice has completed training, we provide metrics on how well we were able to learn the voice and where the pitfalls were. Often these metrics give a good glance on where you can improve the data to obtain a better AI voice. We surface 4 values at the moment:

  • voice_similarity: Between 0 to 1. Tells how effectively the voice attributes was captured by the AI. Higher is better. A low voice similarity typically means noisy data.
  • fluency: Between 0 to 1. Tells how well the AI is generating speech without hiccups. Higher is better. A low fluency score typically suggests data with multiple speakers, slurring, or non-speech elements.
  • pauses: Between 0 to 1. Tells how often the AI puts unintended pauses compared to the ground truth data. Lower is better. A high score typically suggests a lot of pauses in the original data provided.
  • resemble_score: Between 0 to 1. Overall score of the voice including other elements such as transcription quality.

If you’re interested in getting verified to use the voice building API, reach out to team@resemble.ai.