It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.
I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.
Oh this is sweet, thanks for sharing! I've been a huge fan of Kokoro and event setup my own fully-local voice assistant [1]. Will definitely give Pocket TTS a go!
I love that everyone is making their own TTS model as they are not as expensive as many other models to train. Also there are plenty of different architecture.
Love this.
It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.
Good quality but unfortunately it is single language English only.
I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.
Cool tech demo though!
Nice!
Just made it an MCP server so claude can tell me when it's done with something :)
https://github.com/Marviel/speak_when_done
Oh this is sweet, thanks for sharing! I've been a huge fan of Kokoro and event setup my own fully-local voice assistant [1]. Will definitely give Pocket TTS a go!
[1] https://github.com/acatovic/ova
Kokoro is better for tts by far
For voice cloning, pocket tts is walled so I can't tell
Thanks for sharing your repo..looks super cool.. I'm planning to try out. Is it based on mlx or just hf transformers?
Thank you, just transformers.
It's cool how lightweight it is. Recently added support to Vision Agents for Pocket. https://github.com/GetStream/Vision-Agents/tree/main/plugins...
Is there something similar for STT? I’m using whisper distill models and they work ok. Sometimes it gets what I say completely wrong.
from the other day https://github.com/cjpais/Handy
I love that everyone is making their own TTS model as they are not as expensive as many other models to train. Also there are plenty of different architecture.
Another recent example: https://github.com/supertone-inc/supertonic
Another one is Soprano-1.1.
It seems like it is being trained by one person, and it is surprisingly natural for such a small model.
I remember when TTS always meant the most robotic, barely comprehensible voices.
https://www.reddit.com/r/LocalLLaMA/comments/1qcusnt/soprano...
https://huggingface.co/ekwek/Soprano-1.1-80M
In-browser demo of Supertonic with WASM:
https://huggingface.co/spaces/Supertone/supertonic-2
Thank you. Very good suggestion with code available and bindings for so many languages.
Relative to AmigaOS translator.device + narrator.device, this sure seems bloated.