Finetune Guide

#3
by rastegar - opened

Hi there
Is there any finetune guide or example code with example dataset?

Since this is based on StyleTTS, it is quite straightforward to finetune. I assume you want to do something like expand language coverage. You happen to be talking to the right guy. I successfully fully reverse-engineered Kokoro, also based on StyleTTS, and expanded it to speak 72 languages. The only issue is you would need to perform additional training runs w/ this model to get more phonemes inside the model so that the vocoder has the ability to actually formualte the phonemes that you want it to -- then you can do things like cross-phoneme mapping and post-hoc tuning and training. Let me now if you have any questions and I can guide you.

@comey113 wow, this is the first time I’ve heard something like this. How were you able to reverse engineer Kokoro and train it in other languages? That’s a big achievement , could you share more details? I have around four to five hundred thousand hours of data and I want to try training experiments with smaller models.

Kitten ML org

@rastegar we will likely release the ability to get custom models w custom voices after our next launch. rn the highest priority is to launch the next model next month. we dont have the bandwidth to release training code rn, unfortunately :( maybe in May or so this will change

@cmoney113 would you know how to generate new style vectors that can be added to voices.npz? I was trying to use the StyleTTS2 encoders but results are terrible.

@androiddrew Yes. I have built out an entire infrastructure for exactly this on a StyleTTSS2 architecture. Can you dm me on Discord, cmoney112 --or shoot me an emaill - [email protected], and I'd be happy to help.

@Anilosan15 Yes, I have never seen anyone do what I have done honestly. And I have gone back and forth as to releasing it all in a gh repo, or as like a modified Kokoro model on here, or what to do.
Also, if you have such a significant data set, we could really use that. Contact me on Discord or via email. Thist could be exciting, frankly. We could create a generalized phonemizer that could speak every language on earth. This is why the g2p architecture is so underrated. It would be a matter of training the model on a targeted set of ponemes that would represent all 600 phonemes that exist in all languages around the world, perform targeted cross-phoneme mapping, and give it the ability to do dynamic prosody adjustments per language and per-speaker embedding. I have already done the last two parts. The main thing is training it on a clean data set so that the model has the230 phonemes that could be cross-linked to represent all 600. This would be fun and honestly potentially groundbreaking.

Sign up or log in to comment