Hello everyone, may be I've discovered America, but yesterday I mostly randomly came across: https://erogol.github.io/ddc-samples/ And the voice has completely blew my mind! Like, I knew the TTS area has advanced significantly in the recent years, but I thought the new neural voices are mostly closed features of companies like Google or Microsoft. I had no idea we had something so beautiful on linux and completely open-source! Plus, it's not just the license that makes this so interesting, but also the usability. There were the Deepmind papers even before and some open projects trying to implement them, but the level of completeness and usability varied significantly, even if a project was usable, getting it to work required some effort (at least the projects I saw). With Coqui, the situation is completely differrent. As the above mentioned blog says, all you need to do is: $ pip3 install TTS $ tts --text "Hello, this is an experimental sentence." And you have a synthesized result! Or you can launch the server: $ tts-server And play in the web browser. Note that the audio is sent only after it's fully synthesized, so you'll need to wait a bit to listen it. The only problematic part is the limit of decoder steps, which is set to 500 by default. I'm not sure why did they put it so low, with this value, the TTS is unable to speak longer sentences. Fortunately, the fix is very easy. All I needed to do was to open ~/.local/lib/python3.8/site-packages/TTS/tts/configs/tacotron_config.py and modify the line: max_decoder_steps: int = 500 to max_decoder_steps: int = 0 which seems to disable the limit. After this step, I can synthesize very long sentences, and the quality is absolutely glamorous! So I wanted to share. I may be actually the last person discoverying it here, though I did not see it mentioned in TTS discussions on this list. I've even thought about creating a speech dispatcher version of this. It would certainly be doable, though I'm afraid what would the synthesis sound like with the irregularities of navigation with a screenreader. These voices are intended for reading longer texts and consistent phrases, with punctuation, complete information etc. The intonation would probably get a bit weird with for example just a half sentence, as happens when navigating a document or webpage line by line. Another limitation would be the one of speed. On my laptop, the realtime factor (processing duration / audio length) is around 0.8, what means it could handle real-time synthesis at the default speed without delays. The situation would get more complicated with higher speeds, though. It wouldn't be impossible, but one would need a GPU to handle significantly higher speech rates. So I wonder. But anyway, this definitely made my day. :) Best regards Rastislav _______________________________________________ Blinux-list mailing list Blinux-list@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/blinux-list