Polly want a TTS Comparison?
There are several notable text-to-speech options on the market and today we’re going to do a comparison of ElevenLabs Speechify, and Amazon Polly.
Now, there are differences in how these solutions are positioned, so before we get into active comparisons, let’s build a bit of background and go through an overview of each of these tools and the use-cases to which they are most closely aligned.
If you prefer to listen to this article, please use the media player below:
This audio research and deployment company was co-founded in 2022 by Piotr Dąbkowski (CTO, ex-Google machine learning engineer) and Mati Staniszewski (CEO, ex-Palantir deployment strategist). They were childhood friends who grew up in Poland and the inspiration for the company is rumored to have come from watching inadequately dubbed American films. Since I cannot claim to be an expert in the Polish language, we will be running our tests for the sake of this post in English, but it should be noted that ElevenLabs supports a variety of languages.
The beta platform was released in 2023. Browser-based, ElevenLabs offers text-to-speech that can be based on the voice options through the platform or a voice clone end users can create by uploading a sample. For more on this topic and process, see the post: “Creating a Voice Clone with Eleven Labs.” The offering is both accessibility and marketing support with this platform being used for content creation as well as providing lifelike speech for users who want a tool to read text out loud. Among the list of use cases I found during my research was providing AI voices for individuals who have lost theirs due to conditions such as ALS. With the cloning option, one could also reasonably anticipate that this platform might be a way for some to preserve the voices of loved ones who may have passed.
By the way, speaking of morbid things, if you haven’t set up legacy settings for your Google account, I would recommend doing so as part of good digital hygiene. More information on this topic can be found here: Google Legacy Settings
If you go into your Amazon Console, there are 230 services you can choose from. I had to copy and paste the list into an excel spreadsheet and remove the headers to get you this number. If you try to google it, the answer is simply “you can deploy multiple services in an AWS instance.” We’ll call it 200+ for accuracy as number of services is likely to be subject to change. Also admittedly, my number might be slightly off because my foster kitten decided to pounce on my keyboard while I was working with the spreadsheet. Here’s a picture of the little orange boy if anyone needs a visual representation.
Amazon Polly was officially launched on November 29, 2016. This makes it the most mature solution in our comparison (based on time at market). It should also be noted that in terms of technology timelines, this an impressively exact date.
Polly was created to provide developers with an accessible and scalable way to integrate lifelike speech into their applications. The core idea was to leverage Amazon's deep learning expertise to convert text into natural-sounding audio. Because with AI there is a cost and this is directly correlated to compute power, the scalability ethos on which AWS services are built translates into an offering of four voice options. The base option is called “standard” and moves up in quality and generative cost to “neural,” “long-form,” and finally “generative” with the scale of natural sound also progressing upwards with each engine.
Polly is only available in certain regions, and there are engine restrictions with certain regions, so make sure you consider that when setting it up. I used us-east-1, North Virginia for my instance. This service is part of the free tier offering with AWS if you want to try it out. Which is exactly what I did for the purpose of this test.
For more information on Polly and pricing, here is the link to the AWS overview: https://aws.amazon.com/polly/
Admittedly, I was really excited to find this solution. After weeks of sloughing through technical documentation, I wanted a tool that would read the articles to me from my browser. There are free chrome extensions to do this, but the tinny-robotic voice options only served to cause me to feel all itchy and uncomfortable while I listened. Now I have a whole commentary on what I think of the Speechify platform, compliments for the soothing richness of Henry’s voice, and thoughts on how this product is positioned, but in the interest of trying to be organized, you can find all that here: (link to article) and we’ll stay high-level for this bit of background.
Founded by Cliff Weitzman, a dyslexic student who attended Brown University, Speechify is positioned as a platform that makes reading accessible particularly for those with learning differences. One of the value propositions behind this tool is improved focus. Which, I would agree, when you are tying two sensory inputs into learning and understanding new content, activating both the visual input of reading with the auditory input of listening promotes better retention. There is also the handy option to adjust the pace at which text is read with increases in this being translated into an efficiency metric. For example, listening at a pace of 1.10 makes you 10% more efficient in your ability to get through material.
There is a bit of debate as to if the company was founded in 2015 or 2017, but that tends to happen in tech companies as startups try to seem more mature in their early years. This could also be in part because Weitzman was developing the system during his college years and his initial self-built version was the genesis of Speechify. Among celebrity voices such as Snoop Dog and Gwenyth Paltrow, you can select to have Cliff read you a page of text with his solution.
Browser based with a chrome extension, this app is geared towards accessibility and text narration.
With the test for these services, I thought it would be fun to give the AI’s the challenge of a tongue twister, since this is something even most people can struggle with. To showcase emotive quality, we started off with a question. The test text was as follows:
“How much wood would a woodchuck chuck if a woodchuck could chuck wood? Well, my friends, it turns out the answer is zero, because these adorable, chunky rodents, also known as groundhogs, are far more interested in munching on plants and tunneling through the earth than engaging in any lumberjack activities. But, did you know that a woodchuck's burrows can be quite extensive, sometimes reaching up to 66 feet in length with multiple entrances? These creatures are also surprisingly fast runners despite their stocky build. Not to be confused with their beaver cousins, woodchucks also happen to be excellent swimmers.
When they’re not munching or running or swimming, they’re usually sleeping. Woodchucks are among a group of animals that participate in the “great sleep,” slowing their heart rate and body temperature dramatically to hibernate during the cold months. Which brings us to a particularly famous woodchuck and furry meteorologist, Punxsutawney Phil. Punxsutawney Phil, is at the heart of Groundhog Day. Every February 2nd, this little fellow is roused from his winter slumber in Punxsutawney, Pennsylvania. Seeing his shadow and running back into his snug hole would suggest we're in for six more weeks of winter. If he doesn't, spring is supposedly just around the corner. While Phil's forecasting accuracy has a high False Negative Rate (FNR), his annual appearance undeniably provides a much-needed dose of mid-winter whimsy and a reminder that even the most grounded of us can, in certain moments, inspire a bit of fun.”
I did not make any adjustments to the audio outputs although that is an option with these tools. This was intended to be a raw test of capabilities, specifically looking at fluency and naturalness of speech output.
With all the outputs, I was impressed they didn’t stumble on words such as “Punxsutawney,” which would have been a pronunciation I would likely find difficult. That is probably because it is an established proper noun with a wikipedia page: https://en.wikipedia.org/wiki/Punxsutawney,_Pennsylvania and pronunciation guide somewhere. AI tools do have issues with made-up words which I go into in more detail in my adjusting ElevenLabs pronunciation guide: (insert link when written)
All the tools did struggle with the hyphen though, breaking the words “mid” and “winter” in a way that makes the audio samples easy to identify as not human. Which, that was part of the test. If you encounter a similar issue, you can simply remove the hyphen for a better output.
Also did anyone else notice that the AWS long-form cut out what was in the parenthetical? In the AWS documentation, I wasn't able to find anything that suggested this would be expected and thought that was interesting as other samples did not have that issue.
There is a great comparison in the samples between the ElevenLabs voice clone and the Speechify Voice clone as they were based on the same voice recording but different lengths. The ElevenLabs voice clone for this test was based on 3 hours of audio compared to Speechify’s 1-minute input. Input quantity to output quality is a topic we look at in detail in another blog post: (insert when finished). To give you the summary of what I noticed with this example, the Speechify had more what I would call “robotic edges” to the vocal quality where it didn’t sound as natural or energetic as what we got with the ElevenLabs output.
Another interesting comparison is across the AWS engines. I tried to use a good cross section of male and female voices to provide examples of each. Regardless, there is a significant drop in fluidity moving from the AWS Generative and Long Form engines to Neural and Standard outputs.
Eleven Labs Voice Clone
Eleven Labs Nature Narrator
AWS Polly Standard
AWS Polly Neural
AWS Polly Long Form
AWS Polly Generative
Speechify Custom Voice
Speechify Snoop Dogg
Speechify Gwyneth Paltrow
Of course we had to use a nature narrator voice for ElevenLabs and David Attenborough enthusiast I am, that voice was my favorite of the bunch. Still, being critical of it, if I were to produce something of a more finished quality, I would be inclined to speed up the pace given the narration feels a bit unnaturally slow. This is easily accomplished in the ElevenLabs interface by adjusting the speed as you can see in the below screenshot. Attenborough gravitations aside, my impression of these tools stack ranked in order of quality and fluency is ElevenLabs followed by Speechify, followed by Polly given that I just hear more robot in the AWS option than the other two. However, from a cost perspective, Polly is the more economically competitive and integrated solution especially for deployments already in AWS.