Blog

Battle of the Clones

In a previous post about creating an AI voice clone: ElevenLabs Voice Clone, I mentioned that the more audio you provide ElevenLabs, the better the quality of the output tends to be. It’s a simple equation: more data equals more accuracy. But how much of a difference does it really make? Is the leap from a 10-second sample to a 3-hour professional recording as dramatic as it sounds? Today, we’re diving into the nitty-gritty of voice cloning by testing audio samples of varying lengths—10 seconds, 1 minute, 2 minutes, and a professional-grade 3-hour recording. Let’s see how these clones stack up against each other and, of course, against my real voice.

The Setup

Keeping It Fair

The Text-to-Voice Option

The OG

Does Length Matter?

The Setup

Keeping It Fair

To ensure this test is as controlled as possible, I used the same audio file for all the samples. The only difference was the length of the clip. This eliminates variability in input quality and ensures that any differences in the clones are due to the amount of data ElevenLabs had to work with, not the quality of the recording itself.

One important note: I opted not to use ElevenLabs’ background noise removal feature when uploading the audio. While this feature can be helpful, it can also distort the vocal sample depending on the thresholds set by the system. Instead, I relied on the post-production work I had already done on the recordings. If you’re curious about creating clean recordings, I cover that in another article: post-production techniques.

Now, let’s get to the fun part: the clones.

The 10-Second Clone

Let’s start with the bare minimum—a 10-second audio sample. Honestly, I didn’t have high hopes for this one. Ten seconds is barely enough time to say, “Hi, I’m Jenn,” let alone provide the nuances of tone, inflection, and cadence that make a voice unique. Surprisingly, it wasn’t as bad as I expected. The clone managed to capture the general pitch and rhythm of my voice, but it felt... off. The pronunciation was overly precise, almost like a robot trying to sound human. There was a hint of emotion, but it came across as forced, like an actor over-enunciating their lines. It’s passable for a quick, fun experiment, but it’s not fooling anyone into thinking it’s human.

The 1-Minute Clone

With a full minute of audio, the clone had more to work with. The result was noticeably better than the 10-second version. The voice sounded more natural, with improved pacing and a slightly more human-like quality. However, it still lacked the warmth and subtle variations that make a voice feel natural. One thing I noticed was a flatness in the delivery, especially at the end of sentences. Questions, for example, didn’t have the natural upward inflection you’d expect. It’s as if the clone was trying to mimic my voice but missed the emotional cues that give it personality.

The 2-Minute Clone

At two minutes, the clone started to feel more like “me.” The nasal quality I noticed in the shorter samples was less pronounced, and the cadence was closer to my natural speech pattern. That said, there was still something off. It sounded human, but also not quite.

If I had to describe it, I’d say it was like hearing a recording of myself played back through a distorted speaker. It’s recognizable, but it doesn’t quite capture the full range of my voice. Still, for most practical purposes, this version could pass in a pinch.

The 3-Hour Clone

Now, for the pièce de résistance: the 3-hour professional-grade recording. With this much data, ElevenLabs had everything it needed to create a near-perfect replica of my voice. The result was, frankly, uncanny. The clone captured not just the pitch and rhythm of my voice but also the subtle inflections and emotional nuances that make it uniquely mine. Listening to this version was a little eerie. And for short audio recordings I would struggle to hear a difference between that and something I manually recorded. However, as we go into in the ElevenLabs Voice Clone article, though this clone is good, it still can't quite capture performative elements and pitching a voice up or down for characterization. Audio book narrators - you don't have to worry about AI replacing your jobs yet.

The Text-to-Voice Option

As a bonus, I decided to test ElevenLabs’ text-to-voice feature. This option doesn’t require any audio input, instead, you describe the voice you want, and the AI generates it. To give it the best possible parameters, I ran a recording through ChatGPT and asked it to describe the audio as a means to refine the prompt. Please ignore how I misspelled "audio" in the file name. We all make type-os sometimes. And, unlike the voice clones, I am human.

This is what was used to generate the voice:

"A mid-30s female voice that is warm and approachable, clear and articulate, and a neutral American accent. "

The result? Let’s just say it was... interesting. The voice was nowhere near a match for mine, but it was a fun experiment. This feature might be better suited for creating fictional characters or experimenting with accents rather than replicating a specific voice.

The OG

For the ultimate comparison, I've also included a sample of my own voice - an actual manual recording with no AI involved. This "control" serves as a comparison to stack each clone against the genuine article, highlighting just how close, or how far, each rendition came to truly replicating my unique vocal fingerprint. Hearing them side-by-side really clarifies the subtle differences the earlier comparisons hinted at.

Does Length Matter?

I mean that in terms of audio samples, no innuendos. How much of a difference does the length of the audio sample make? In short - a lot. The 10-second clone was fun but far from convincing. A great example of the result being a reflection of the effort. Minimal effort in, not a great quality result in return. The 1-minute and 2-minute clones showed significant improvement, but they still fell short of feeling natural in cadence and quality. The 3-hour clone is concerningly convincing. Remind me to call my parents and make sure they know the secret word if ever a scammer tries to impersonate me on a phone call. Which, little bit of technical hygeine in addition to what was already included in the post: Data Hygiene, having a family specific "password" is a good practice to employ as digital tools develop and bad actors try to take advantage. Unfortunately not everyone follows Spiderman principles.

If you’re considering creating a voice clone, my advice is simple: the more data, the better. While shorter samples can work for casual use, they won’t hold up under scrutiny. For professional applications, investing the time to create a longer recording is worth it. And if you’re curious about how to create clean recordings or optimize your audio for cloning, stay tuned for my upcoming posts!

Page updated

Google Sites

Report abuse