They have been developed by IVONA and feature their new BrightVoice™ technology which employs a new language model to provide intelligent interpretation of text, and crystal clear sound achieved using noise and distortion reduction. Not only that, the new text to speech engine is up to 10x faster than the previous version, which is a real advantage when converting lengthy ebooks into audiobooks.
I have also decided to remove from sale all of the Nuance Realspeak voices apart from the Australian voices Karen and Lee. The main reason for this is due to the licensing agreement placed on us by Nuance which makes it uneconomical to continue to offer a large selection of RealSpeak voices. However now that we can offer the premium IVONA voices in both US and UK accents, I don’t really think they will be missed. Note that existing customers who have purchased RealSpeak voices will still be supported.
2011 is going to be an exciting year for text to speech. As ebooks continue to take off, there is going to be more and more choice available to convert to audiobooks. I’m also hoping to finally be able to offer some high quality French, German and Spanish voices for use with Text2Go this year.
Happy New Year!
The other day I needed to splice some voice samples together for my post on RealSpeak Voice Pronunciation. I was using the free audio editing tool Audacity and happened to notice something disturbing about the waveform that had been generated. I was using the RealSpeak Samantha voice and it was quite clear that a certain amount of audio clipping had occurred.
You can see this in the regions I’ve highlighted in red, where the natural shape of the waveform looks to be cutoff or clipped.
If we zoom right in so the individual waves are visible, you can clearly see that each peak has been chopped off.
Does this matter?
Yes. I’m no audio expert but we’re actually throwing away part of the signal and this will produce some audio distortion.
Can it be fixed?
Yes. The fix is as simple as adjusting the volume of the voice (don’t confuse this with the volume on your PC). You can adjust the volume of an individual voice using the Text2Go Options page. By default, the volume of all voices is set to Normal. By lowering this a couple of notches, the output for Samantha will no longer be clipped.
Converting the same text to speech produced the following waveform.
You can see that the waveform is no longer clipped at the top or the bottom.
Similarly, when we zoom in, each peak is nicely rounded and no longer chopped off.
Do other RealSpeak Voices suffer the same problem?
Serena and Tom also suffer some clipping, so if you use these voices make sure you adjust the volume setting down one or two notches. The other RealSpeak voices are not clipped at the Normal volume setting and don’t need to be adjusted.
During the course of adding a pronunciation editor to Text2Go, I’ve discovered some of the strengths and weaknesses of the RealSpeak voices when it comes to pronunciation. Pronunciation errors are quite rare, making it hard to build up a large collection of mispronounced words. Text2Go’s new pronunciation editor makes this very easy.
Now that I’ve identified an extensive list of mispronounced words, it’s possible to spot some trends and discover which voice is the most accurate.
Firstly, I’ve found that compound words can cause problems (e.g. afterword, longterm, screenshot ). Most common compound words are fine but often brand names that are made up of two words run together can be mispronounced. It’s very easy to correct these mispronunciations – you just separate the two words with a hypen or space (e.g. after-word, long term, screen-shot ). This occurs often enough that I’ve added a way to identify compound words in the pronunciation editor. I’ve found that Samantha is significantly better at pronouncing compound words that all the other RealSpeak voices.
A similar problem occurs with words having the prefix re-. For example reprogram, repurposed, rereleased . In these cases the re- is not identified as the re- prefix. Again the solution is simple, just add a hypen after the re (e.g. re-program, re-purposed, re-released ). Once again, Samantha does a better job of pronouncing re- prefixed words.
In order to hear the differences for yourself, I’ve chosen 10 mispronounced words and 4 voices. The table below contains each voice’s attempt to speak the word without any correction applied. Note – I’ve used a dash to indicate a passable but not perfect pronunciation.
The Uncorrected row contains the voice’s uncorrected pronunciation attempt and the Corrected row contains the pronunciation after corrections have been applied. Notice once corrections have been applied, all voices pronounce all words correctly.
One set of results that surprised me were those for Tom. When I started writing up this post I was sure that Samantha was way ahead of the other voices. However this result shows that Tom is also a worthy contender. I’m still sure that Samantha has the most accurate pronunciation but the margin is not as great as I imagined.
My hunch is that Samantha is based on slightly newer technology and it’s the reason why the Samantha voice file is around 110MB in size whereas the others are around 70-90MB.
So does this mean that Samantha is the best voice and the one I should always use? What about regional differences?
Do voices from different regions pronounce words differently? Most definitely! Take the Australian voices Karen and Lee as examples. Not only do they have Australian accents, they correctly pronounce local Australian place names, whereas the other English voices can be way off. Listen to the following Australian place names (of aboriginal origin) spoken first by Samantha (US English) and then Karen (Australian English)
Pronunciation is only one criteria on which to choose a voice. I believe it’s more important to choose a voice you enjoy listening to. If you like the sound of Samantha then definitely choose her but if you prefer the sound of one of the other voices, go with them. Remember that pronunciation errors in normal day to day text are quite rare for any of the RealSpeak voices.
To test these assumptions, I dug up some of my recent sales data for computerized voices. When you purchase Text2Go you can also choose to purchase one or more high quality RealSpeak voices. I looked at all purchases that included a single voice. I disregarded any purchase that included both male and female voices and any subsequent voice purchases. I also disregarded all sales of the Indian English female voice Sangeeta as there is no corresponding male Indian English voice.
It’s clear that men have a strong preference for the female voice and it’s what I expected. What is surprising is that women also prefer the female voice, albeit to a much lesser extent. In fact the statistics show woman don’t have a strong preference one way or the other. My wife certainly prefers the male computerized voices, so I’d expected this to be the case for the general population.
Not wanting to let hard data get in the way of assumptions and gut feelings, here are a couple of reasons that may explain the unexpected results for the women.
1. There are characteristics of the female voice that make it easier to produce a more natural sounding computerized voice.
2. We currently sell 7 female voices but only 3 male voices. This extra choice may give some bias to female voices. It may also be an indication that the developers of computerized voices have recognised the popularity of the female voice. Note that the US, UK and Australian accents have both male and female voices.
What’s your preference? Have a listen to the computerized voices on the Text2Go website and let me know.
We’ve found that sometimes the RealSpeak text to speech voices have a tendency to speak too quickly, making some text difficult to follow. Here is an example of two passages of text read by RealSpeak Karen, an Australian English voice, the first spoken at normal speed, the second spoken at a slightly slower speed.
Photo courtesy of Tub Gurnard
You can easily adjust the speed of a voice using the Text2Go options page. The slower speed sample was generated with the voice speed reduced a couple of notches. You can adjust the voice speed and hear the effect immediately by using the play button to listen to the sample text on the options page.
You may find that your preferences change over time. As you become more accustomed to listening to a voice, you may wish to increase its speed again.
I know that many visually-impaired users adjust the voices of their screen readers to speak at very high rates so they can efficiently navigate around the screen.