VibeVoice: A Frontier Open-Source Text-to-Speech Model

(microsoft.github.io)

448 points | by lastdong 6 days ago ago

94 comments

  • simiones 6 days ago ago

    I read the comments praising these voices as very life like, and went to the page primed to hear very convincing voices. That is not at all what I heard though.

    The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring. The AI generated voice you hear all over YouTube shorts is at least as good as most of the samples on this page.

    The only part that seemed impressive to me was the English + (Mandarin?) Chinese sample, that one seemed to switch very seamlessly between the two. But this may well be simply because (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that, and (2) the different character systems make it extremely clear that the model needs to switch between different languages. Peut-être que cela n'aurait pas été si simple if it had been switching between two languages using the same writing system - I'm particularly curious how it would have read "simple" in the phrase above (I think it should be read with the French pronunication, for example).

    And, of course, the singing part is painfully bad, I am very curious why they even included it.

    • Uehreka 6 days ago ago

      Their comments about the singing and background music are odd. It’s been a while since I’ve done academic research, but something about those comments gave me a strong “we couldn’t figure out how to make background music go away in time for our paper submission, so we’re calling it a feature” vibe as opposed to a “we genuinely like this and think its a differentiator” vibe.

    • jstummbillig 6 days ago ago

      Is there any better model you can point at? I would be interested in having a listen.

      There are people – and it does not matter what it's about – that will overstate the progress made (and others will understate it, case in point). Neither should put a damper on progress. This is the best I personally have heard so far, but I certainly might have missed something.

    • rcarmo 6 days ago ago

      One of the things this model is actually quite good at is voice cloning. Drop a recorded sample of your voice into the voices folder, and it just works.

    • IshKebab 6 days ago ago

      I agree. For some reason the female voices are waaay more convincing than the male ones too, which sound barely better than speech synthesis from a decade ago.

    • iansinnott 6 days ago ago

      The English/Mandarin section was VERY impressive. The accents of both the woman speaking English and the man speaking Chinese were spot on. Both sound very convincingly like they are speaking a second language, which anyone here can hear from the Chinese woman speaking English voice. I'd like to add that the foreigner speaking Chinese was also spot on.

    • odie5533 6 days ago ago

      It's good but not the best free model. I find Chatterbox to be more realistic with no robot-sounding and better (though not perfect) intonation.

    • echelon 6 days ago ago

      This is close to SOTA emotional performance, at least the female voices.

      I trust the human scores in the paper. At least my ear aligns with that figure.

      With stuff like this coming out in the open, I wonder if ElevenLabs will maintain its huge ARR lead in the field. I really don't see how they can continue to maintain a lead when their offering is getting trounced by open models.

    • skripp 6 days ago ago

      The male Chinese speakers had THICK American accents. Nothing really wrong with the language, but think the stereotype German speaking English. That was kind of strange to me.

    • mclau157 6 days ago ago

      ElevenLabs has a much more convincing voice model

    • johanyc 6 days ago ago

      The Chinese is good. The Mandarin to English example she sounds native. The English to Mandarin sounds good too but he does have an English speaker's accent, which I think is intentional.

    • MengerSponge 6 days ago ago

      > (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that

      https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

  • giancarlostoro 6 days ago ago

    I really hope someone within Microsoft is naming their open source coding agent Microsoft VibeCode. Let this be a thing. Its either that or "Lo" then you can have Lo work with Phi, so you can Vibe code with Lo Phi.

    https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl...

    • simiones 6 days ago ago

      Knowing the history of Microsoft marketing, it will either be called something like "Microsoft Copilot Code Generator for VSCode" or something like "Zunega"...

    • watsonmusic 6 days ago ago

      genius

  • malnourish 6 days ago ago

    This is clearly high quality but there's something about the voices, the male voices in particular, which immediately register as computer generated. My audio vocabulary is not rich enough to articulate what it is.

    • heeton 6 days ago ago

      I'm no audio engineer either, but those computer voice sound "saw-tooth"y to me.

      From what I understand, it's more basic models/techniques that are undersampling, so there is a series of audio pulses which give it that buzzy quality. Better models are produced smoother output.

      https://www.perfectcircuit.com/signal/difference-between-wav...

    • codebastard 6 days ago ago

      I would describe it as blockly, as if we visualise the sound wave it seems to be without peaks and cut upwards and downwards producing a metallic boxy echo.

    • lvncelot 6 days ago ago

      After hearing them myself, I think I know what you mean. The voices get a bit warbly and sound at times like they are very mp3-compressed.

  • strangescript 6 days ago ago

    The male voices seem much worse than the female voices, borderline robotic. Every sample of their website starts with a female voice. They clearly are aware of the issue.

    • jsomedon 6 days ago ago

      I felt the same, male voice feels kinda artificial.

  • davorak 6 days ago ago

    Any insight on my the code and the large model were removed? Some copies are floating around and are MIT licensed. In cases like this I do not know why the projects are yanked. If the project was mistakenly released under MIT, copied elsewhere, is any damage control possible by yanking the copies you have control over? Mostly seems like bad PR, if minor.

    • androiddrew 5 days ago ago

      Ok anyone have a link to the code and weights?

    • fivestones 6 days ago ago

      Wondering this too.

  • aargh_aargh 6 days ago ago

    Is there a current, updated list (ideally, a ranking) of the best open weights TTS models?

    I'm actually more interested in STT (ASR) but the choices there are rather limited.

  • TheAceOfHearts 6 days ago ago

    Unfortunately it's not usable if you're GPU-poor. Couldn't figure out how to run this with an old 1080. I tried VibeVoice-1.5B on my old CPU with torch.float32 and it took 832 seconds to generate a 66 second audio clip. Switching from torch.bfloat16 also introduced some weird sound artifacts in the audio output. If you're GPU-poor the best TTS model I've tried so far is Kokoro.

    Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass.

    • tempodox 6 days ago ago

      This is ludicrous. macOS has had text-to-speech for ages with acceptable quality, and they never needed energy- and compute-expensive models for it. And it reacts instantly, not after ridiculous delays. I cannot believe this hype about “AI”, it’s just too absurd.

  • baxuz 6 days ago ago

    Looking forward to the day when tts and speech recognition will work on Croatian, or other less prevalent languages.

    It seems that it's only variants of English, Spanish and Chinese which are somewhat working.

    • lukax 6 days ago ago

      Have you tried Soniox for speech recognition? It supports Croatian. Or are you just looking for self-hosted open-source models? Soniox is very cheap ($0.1/h for async, $0.12/h for real-time) and you get $200 free credits on signup.

      https://soniox.com/

      Disclaimer: I used to work for Soniox

  • Insanity 6 days ago ago

    What an odd name to me, becaus "Vibe" is, in my mind, equal to somewhat poor quality. Like "Vibe Coding". But that's probably just some bias from my side.

    • mxfh 6 days ago ago

      Vibe coding just became a term this spring. I doubt that that the substantial part, like giving it a project code name and getting company approval of this research project started after that. It's not libe vibe has a negative connotation in general yet.

    • andrew_lettuce 6 days ago ago

      Vibe always meant "specific feel" and makes sense related to AI coding "by touch" vs. understanding what's actually happening. It's just the results have now made the word pejorative.

  • rafaelmn 6 days ago ago

    The Spontaneous Emotion dailog sounds like a team member venting through LLMs.

    They could have skipped the singing part, it would be better if the model did not try to do that :)

  • Meneth 6 days ago ago

    Open-source, eh? Where's the training data, then?

    • Joel_Mckay 6 days ago ago

      Most scraped data is often full of copyright, usage agreement, and privacy law violations.

      Making it "open" would be unwise for a commercial entity. =3

  • crvdgc 6 days ago ago

    Very impressive that it can reproduce the Mandarin accent when speaking English and English accent when speaking Mandarin.

  • stuffoverflow 6 days ago ago

    VibeVoice-Large is the first local TTS that can produce convincing Finnish speech with little to no accent. I tinkered with it yesterday and was pleasantly surprised at how good the voice cloning is and how it "clones" the emotion in the speech as well.

  • lxe 6 days ago ago

    There are 2 "best" TTS models out right now: HiggsAudio and VibeVoice. I found that Higgs is both faster and much higher fidelity than Vibe. Can't speak to expressiveness, but don't sleep on it.

  • data-ottawa 6 days ago ago

    Looks like the repo went private

    https://github.com/microsoft/VibeVoice

    I was trying to get this working on strix halo.

  • glenstein 6 days ago ago

    Very good and I could see how I might believe they are real people if I let my guard down. The male voice sounded a little sedated though and there was a smoothness to it that could be samey over long stretches.

    Still not at the astonishing level of Google Notebook text to speech which has been out for a while now. I still can't believe how good that one is.

  • regularfry 6 days ago ago

    Ok, this is nit-picking, but it's very obvious that the sample voices these were trained with were captured in different audio environments. There's noticeable reverb on the male voice that's not there on the other.

    So that's a useful next step: for multi-voice TTS models, make them sound like they're in the same room.

  • viggity 6 days ago ago

    I feel like this is a step in the right direction, but a lot of emotive text-to-speech models are only changing the duration and loudness of each word, the timing/pauses are better too.

    I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point.

  • mpaepper 6 days ago ago

    Unfortunate naming given I named my repo which does open source locally running speech to text vibevoice 7 months ago:

    https://github.com/mpaepper/vibevoice

  • cush 6 days ago ago

    To me this is like early generative AI art, where the images came out very "smooth" and visually buttery, but instead there's no timbre to the voices. Intonation issues aside, these models could use a touch of vocal fry and some body to be more believable

  • bityard 6 days ago ago

    I thought the name sounded familiar, I'm guessing its no relation to this project which has been around for 7 months? https://github.com/mpaepper/vibevoice

  • ndkap 6 days ago ago

    Here is AI being as close as possible to the most animated person I know and here I am sounding robotic in every conversation I have, despite my best efforts to sound otherwise. Sometimes, I just wish I could have an AI speak for me

  • faxmeyourcode 6 days ago ago

    I tried the colab notebook that they link to and couldn't replicate the quality for whatever reason. I just swapped out the text and let it run on the introduction paragraph of Metamorphosis by Franz Kafka and it seemingly could not handle the intricacies.

  • wewewedxfgdf 6 days ago ago

    I'm really hoping one day there will be TTS does that does really nice British accents - I've surveyed them all deeply, none do.

    Most that claim to do a British accent end up sounding like Kelsey Grammer - sort of an American accent pretending to be British.

    • specproc 6 days ago ago

      I'd like one that really nails Brummie.

    • xp84 5 days ago ago

      I’m just a yank, but a lot of the AI-voiced videos on YouTube that I’ve been listening to while I’m falling asleep lately have British voices that sound quite nice to me.

  • lyu07282 6 days ago ago

    Did they delete the repo? It's 404 for me now: https://github.com/microsoft/VibeVoice

    • RealtyDAO 5 days ago ago

      they must have removed it.. been down for hrs.

  • bazlan 6 days ago ago

    Sad to not see vui on the comparisons!

    A 100M podcast model

    https://huggingface.co/spaces/fluxions/vui-space

  • ementally 6 days ago ago

    they vibecoded their demo website? the text is invisible on Firefox.

    • double_one 6 days ago ago

      Same problem here. A quick refresh solved it for me — maybe try that?

    • recursive 6 days ago ago

      Works for me

  • anarticle 6 days ago ago

    The first example sounds like a cry for help.

    Some of them have tone wobbles which iirc was more common in early TTS models. Looks like the huge context window is really helping out here.

  • qwertytyyuu 6 days ago ago

    Woah they even immitate the western chinese accent well

  • baal80spam 6 days ago ago

    Wow. I admit that I am not a native speaker, but this looks (or rather, sounds) VERY impressive and I could mistake it for hearing two people talking.

    • x187463 6 days ago ago

      The giveaway is they will never talk over each other. Only one speaker at a time, consistently.

    • tracker1 6 days ago ago

      Yeah, a lot of the TTS has gotten really impressive in general. Definitely a clear leap from the TTS stuff I worked with for training simulations a bit over a decade ago. Aside: Installing a sound card (unused) on a windows server just to be able to generate TTS was interesting. It was required by the platform, even if it wasn't used for it.

      I generally don't like a lot of the AI generated slop that's starting to pop up on YouTube these days... I do enjoy some of the reddit story channels, but have completely stopped with it all now. With the AI stuff, it really becomes apparent with dates/ages and when numbers are spoken. Dates/ages/timelines are just off as far as story generation, and really should be human tweaked. As to the voice gen, saying a year or measurement is just not how English speakers (US or otherwise) speak.

  • ml_basics 6 days ago ago

    what's the relationship between this work and the recently announced voice models from Microsoft AI? https://microsoft.ai/news/two-new-in-house-models/

  • ehutch79 6 days ago ago

    The examples are kind of off-putting. We're definitely in uncanny valley territory here.

  • nextworddev 6 days ago ago

    Still haven’t found anything better than kokoro tts. Anyone know something better?

  • egorfine 6 days ago ago

    [deleted - I'm an idiot]

    • x187463 6 days ago ago

      Whisper is speech-to-text. VibeVoice is text-to-speech.

  • weeb 6 days ago ago

    does anyone know of recent TTS options that let you specify IPA rather than written words? Azure lets you do this, but something local (and better than existing OS voices) would be great for my project.

    • andybug 6 days ago ago

      I'm using Kokoro via https://github.com/remsky/Kokoro-FastAPI. It has a `generate_audio_from_phonemes()` endpoint that I'm sure maps to the Kokoro library if you want to use it directly.

      My usage is for Chinese, but the phonemes it generated looked very much like IPA.

  • swiftcoder 6 days ago ago

    Ah, yes, the Furious 7 soundtrack. Definitely something everyone recalls

    • closewith 6 days ago ago

      The most popular song of the year from one of the most popular movie franchises that had been in the global news due to the death of its star. Probably the most memorable song from a soundtrack of the century so far.

  • tehlike 6 days ago ago

    The comments in the html code is chinese, which is very interesting.

  • throwaw12 6 days ago ago

    Will there be a support for SSML to have more control of conversation?

  • Havoc 6 days ago ago

    MIT license - very nice!

    • ComputerGuru 6 days ago ago

      The application of known FOSS licenses to what is effectively a binary-only release is misleading and borderline meaningless.

    • em-bee 6 days ago ago

      what does that mean in this context? it seems to depend on an LLM. so can i run this completely offline? if i have to sign up and pay for an LLM to make it work, then it's not really more useful than any other non-free system

    • watsonmusic 6 days ago ago

      Microsoft is cool

  • lagniappe 6 days ago ago

    Bots should never sing.

  • agos 6 days ago ago

    seemingly supports only English, Indian and Chinese

    • plingamp 6 days ago ago

      Indian and Chinese are not languages

  • cush 6 days ago ago

    I tried using the demo but it just errors out

  • amelius 6 days ago ago

    I tried some TTS models a while ago, but I noticed that none of them allowed to put markup statements in the text. For example, it would be nice to do something like:

         Hey look! [enthusiastic] Should we tell the others? Maybe not ... [giggles]
    
    etc.

    In fact, I think this kind of thing is absolutely necessary if you want to use this to replace a voice actor.

  • sciencesama 6 days ago ago

    Need this for mac

    • double_one 6 days ago ago

      I tried it on my MacBook Pro — works great!

  • watsonmusic 6 days ago ago

    one of the best models built by Microsoft

  • enigma101 5 days ago ago

    only microsoft could come up with such a name rofl

    • defrost 5 days ago ago

      Lippy got vetoed.