Finetune Guide

by rastegar - opened Feb 23

Discussion

rastegar

Feb 23

Hi there
Is there any finetune guide or example code with example dataset?

cmoney113

Feb 25

Since this is based on StyleTTS, it is quite straightforward to finetune. I assume you want to do something like expand language coverage. You happen to be talking to the right guy. I successfully fully reverse-engineered Kokoro, also based on StyleTTS, and expanded it to speak 72 languages. The only issue is you would need to perform additional training runs w/ this model to get more phonemes inside the model so that the vocoder has the ability to actually formualte the phonemes that you want it to -- then you can do things like cross-phoneme mapping and post-hoc tuning and training. Let me now if you have any questions and I can guide you.

Anilosan15

Feb 26

@comey113 wow, this is the first time I’ve heard something like this. How were you able to reverse engineer Kokoro and train it in other languages? That’s a big achievement , could you share more details? I have around four to five hundred thousand hours of data and I want to try training experiments with smaller models.

stellon-admin

Kitten ML org Feb 28

@rastegar we will likely release the ability to get custom models w custom voices after our next launch. rn the highest priority is to launch the next model next month. we dont have the bandwidth to release training code rn, unfortunately :( maybe in May or so this will change

androiddrew

18 days ago

@cmoney113 would you know how to generate new style vectors that can be added to voices.npz? I was trying to use the StyleTTS2 encoders but results are terrible.

cmoney113

5 days ago

•

edited 5 days ago

@androiddrew Yes. I have built out an entire infrastructure for exactly this on a StyleTTSS2 architecture. Can you dm me on Discord, cmoney112 --or shoot me an emaill - [email protected], and I'd be happy to help.

cmoney113

5 days ago

•

edited 5 days ago

@Anilosan15 Yes, I have never seen anyone do what I have done honestly. And I have gone back and forth as to releasing it all in a gh repo, or as like a modified Kokoro model on here, or what to do.
Also, if you have such a significant data set, we could really use that. Contact me on Discord or via email. Thist could be exciting, frankly. We could create a generalized phonemizer that could speak every language on earth. This is why the g2p architecture is so underrated. It would be a matter of training the model on a targeted set of ponemes that would represent all 600 phonemes that exist in all languages around the world, perform targeted cross-phoneme mapping, and give it the ability to do dynamic prosody adjustments per language and per-speaker embedding. I have already done the last two parts. The main thing is training it on a clean data set so that the model has the230 phonemes that could be cross-linked to represent all 600. This would be fun and honestly potentially groundbreaking.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment