Creating a Voice Clone with ElevenLabs
Have you ever wondered what it's like to clone your own voice using AI and whether it can actually replace a human narrator?
Well, we're going to get into that here.
Bringing to this challenge my 11+ years experience as a professional voiceover artist as well more than 12 years background as a solution engineer in the technology space, I set out to test ElevenLabs voice cloning option using a sample of three hours of professional quality recordings.
This post will outline the setup process and what I learned.
If you prefer to listen to this article, please use the media player below:
I personally use a Rode NT1 for my in-home studio setup. This is paired with a Focusrite scarlett solo audio interface. When I was shopping for my equipment, I found the most competitive deal on B&H Photo but there are a number of places where you can find Rode microphones for purchase. I bought the bundle including a shock mount setup with a metal material pop filter. There is a lot of discussion around metal pop filters versus the more traditional nylon based type. Specifically with plosives such as harsh “p” and “b” sounds, metal filters are often credited with creating cleaner recording They are also easier to clean and more durable while nylon filters are a more budget friendly option.
For a more detailed article on microphone types and to play samples from different microphones I’ve tested, please see: (will be updated when posted).
There are a number of DAWs available, all with different UI experiences and advantages. DAW is an acronym for “digital audio workstation” or in common terms, a software interface for audio recording and editing. Some of the common options you’ll see used in professional studios include: Pro Tools, Logic Pro, Ableton, and Adobe. I use Audacity which is a free software option compatible with both Mac and PC platforms.
In creating an audio sample for an ElevenLabs voice clone, more audio is better. Though you have options to create clones based on recordings of less than a minute, 2 minutes, and 5 minutes, my minimum recording recommendation threshold is 45 minutes. This ensures that there is enough content to train the model for reasonable consistency with the output.
For a more detailed article on how I configure my DAW and post-production techniques with progression samples from audio recordings, please see: (will be updated when posted).
To create a voice clone in ElevenLabs, you are going to go to the left-hand menu -> voices and on the right hand side of the page you will see the option to “create or clone a voice.” You will need to be at least on the “Creator Plan” to be able to create a Professional Voice clone. In the cloning interface, upload your produced audio file. If you have produced it well, you should not have to use the option to remove any background noise and doing so could possibly distort the quality of the sample input. Once, the file is uploaded and processed, you will be able to click “Next” to go to the next page where you can give your voice clone a name, language label, description and save it.
Wow, this actually worked.
But can you tell? That this isn't me recording but instead is audio based off a clone of my voice recording?
Crazy right!
My professional voice clone is available on the ElevenLabs platform here: ElevenLabs - Jenn
Have fun testing it with your own text!
Elevenlabs is an impressive platform that provides the fluency of natural speech in its outputs. I personally would leverage it for short projects, such as providing an audio recording of this article for readers to listen to. It is also a fun tool for creative projects that might require voices you don’t have access to (ie, accents or character inflections). However, where challenges may come up include:
Pitching for different characters
The tool can start to guess where an audiobook narrator might pitch up or down, but the change isn’t always obvious and the consistency with the main voice can affect how a listener might interpret which character is talking. An alternative to this would be to leverage the other voices on the platform, creating distinct voice types for each character. However, in editing and producing the audio file, that can be a lot to manage.
Emotive technique
The allure of a tool to generate an audio output for you especially with large multi-hour projects is compelling. Most people don't realize that the minimum ratio for audio production is at least 3 to 1. What this means is that for every minute of producted audio, at least 3 minutes went into creating it, listening back, and making any necessary adjustments. Often, this takes even more time than that. However, I would still at this point opt to narrate my own audiobooks for the performative quality. Changes of pitch and temp can be created in the tool, but they will not be as organic as what a voice actor could create and additionally, with the amount of effort involved in tuning this, it may actually be easier to create a voice project through a standard recording process.