eSpeak and Festival

jsd@xxxxxxxxxxx (Jonathan Duddington) · Sun, 13 Jul 2008 12:54:19 +0100

On 13 Jul, Hynek Hanke <hanke at brailcom.org> wrote:

> It would be great if somebody who thinks that Festival
> is actually worse than eSpeak in quality of speech
> could try to elaborate more about the reasons.

It depends what you mean by "quality".

There is no doubt that the good Festival voices sound more human than
eSpeak.

I'm not blind, but I use text-to-speech a lot for reading blogs, news
articles, etc.  The main reasons why I prefer to listen to eSpeak
rather than Festival are:

1.  Clarity.  The eSpeak voice (I use British English) sounds more
clear, and sharp, and more articulated.  An alternative description
might be "artificial and harsh".

The perceived quality of eSpeak may depend on your loudspeakers.  I use
a domestic sound system with big speakers and it sounds good to me. 
But eSpeak has less "bass" and more mid-frequencies than other
synthesizers, and perhaps that's less suitable for small computer
speakers where it sounds more "harsh"?  People have experimented with
new eSpeak "voice variants" with changes to the "tone" and "formant"
parameters to change the tonal balance.

2.  Intonation (the changes in pitch during a sentence).  Festival
seems more "flat" or "boring".  I prefer eSpeak's more lively
intonation (although that may not sound good for some languages). 
Perhaps it's possible to make a new improved intonation algorithm in
Festival.

Note that you can use eSpeak as a front-end to a Mbrola diphone voice,
so you get eSpeak's intonation with a more natural sounding voice
(intonation with Mbrola was improved in eSpeak version 1.31 and later).
http://espeak.sf.net/mbrola.html.
Try comparing Festival with eSpeak+Mbrola.

> This is why eSpeak is the current default in Speech Dispatcher
> because it is initially easier to get running and it covers a great
> span of languages. The documentation however strongly suggest
> users whose language is supported by Festival to try it as their
> primary syntesizer for a better voice quality.

That is good advice, especially since the quality of different
languages in eSpeak is very variable.