Re: Coqui TTS has blew my mind!

Linux for blind general discussion <blinux-list@xxxxxxxxxx> · Wed, 9 Feb 2022 21:09:09 +0800 (CST)

Sorry! My English is not well. How to let it speakChinese? Please give a 
sample of command-line. Thank you!

On Wed, 9 Feb 2022, Linux for blind general discussion wrote:

Date: Wed, 09 Feb 2022 11:59:26 +0000
From: Linux for blind general discussion <blinux-list@xxxxxxxxxx>
To: blinux-list@xxxxxxxxxx
Subject: Re: Coqui TTS has blew my mind!

Hello Chrys,

I think the problem is that Python 3.10 is not supported as of now.

https://pypi.org/project/TTS/

Though I'm not sure why. May be some of the backing libraries are not
yet compatible, I remember this being a problem in the past with new
releases of TensorFlow.

Perhaps a virtual environment with lower Python version could do the trick?

Best regards

Rastislav

Dňa 9. 2. 2022 o 11:48 Linux for blind general discussion napísal(a):
Howdy,

just want to try coqui again (after a while) and just got this:
$ pip3 install TTS
Defaulting to user installation because normal site-packages is not
writeable
ERROR: Could not find a version that satisfies the requirement TTS
ERROR: No matching distribution found for TTS

any ideas?

cheers chrys

Am 09.02.22 um 11:40 schrieb Linux for blind general discussion:
Howdy Rastislav,

yea Coqui is awsome. it was initial part of mozillas TTS and STT efforts.
we really should have  an speech-dispatcher driver for that :).

by the way, keep up your great work! Just take a look at the C#
speech-dispatcher bindings.

cheers chrys

Am 09.02.22 um 11:25 schrieb Linux for blind general discussion:
Hello everyone,

may be I've discovered America, but yesterday I mostly randomly came
across:

https://erogol.github.io/ddc-samples/

And the voice has completely blew my mind!

Like, I knew the TTS area has advanced significantly in the recent
years, but I thought the new neural voices are mostly closed features of
companies like Google or Microsoft.

I had no idea we had something so beautiful on linux and completely
open-source!

Plus, it's not just the license that makes this so interesting, but also
the usability.

There were the Deepmind papers even before and some open projects trying
to implement them, but the level of completeness and usability varied
significantly, even if a project was usable, getting it to work required
some effort (at least the projects I saw).

With Coqui, the situation is completely differrent.

As the above mentioned blog says, all you need to do is:

$ pip3 install TTS

$ tts --text "Hello, this is an experimental sentence."

And you have a synthesized result!

Or you can launch the server:

$ tts-server

And play in the web browser. Note that the audio is sent only after it's
fully synthesized, so you'll need to wait a bit to listen it.

The only problematic part is the limit of decoder steps, which is set to
500 by default.

I'm not sure why did they put it so low, with this value, the TTS is
unable to speak longer sentences.

Fortunately, the fix is very easy. All I needed to do was to open
~/.local/lib/python3.8/site-packages/TTS/tts/configs/tacotron_config.py

and modify the line:

       max_decoder_steps: int = 500

to

       max_decoder_steps: int = 0

which seems to disable the limit.

After this step, I can synthesize very long sentences, and the quality
is absolutely glamorous!

So I wanted to share. I may be actually the last person discoverying it
here, though I did not see it mentioned in TTS discussions on this list.

I've even thought about creating a speech dispatcher version of this. It
would certainly be doable, though I'm afraid what would the synthesis
sound like with the irregularities of navigation with a screenreader.
These voices are intended for reading longer texts and consistent
phrases, with punctuation, complete information etc.

The intonation would probably get a bit weird with for example just a
half sentence, as happens when navigating a document or webpage line by
line.

Another limitation would be the one of speed. On my laptop, the realtime
factor (processing duration / audio length) is around 0.8, what means it
could handle real-time synthesis at the default speed without delays.

The situation would get more complicated with higher speeds, though.

It wouldn't be impossible, but one would need a GPU to handle
significantly higher speech rates.

So I wonder.

But anyway, this definitely made my day. :)

Best regards

Rastislav

_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list

_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list

_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list

_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list
_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list