You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

🇳🇵 Chatterbox Nepali TTS

Fine-tuned Nepali text-to-speech model based on Chatterbox-Multilingual-500M. Supports high-quality zero-shot voice cloning from a short reference clip.

Open In Colab

🚀 Google Colab

Option 1: Quick Inference (T4 / Free Tier)

Run the full pipeline directly in Colab. Change runtime type to T4 GPU for best results.

Cell 1 — Install Dependencies

!pip install -q git+https://github.com/Imbatmann/chatterbox-nepali.git safetensors librosa

Cell 2 — Load Model

import torch, torchaudio
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from IPython.display import Audio

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base model
model = ChatterboxMultilingualTTS.from_pretrained(device)

# Load Nepali fine-tuned weights
ckpt = hf_hub_download("Imbatmann/chatterbox-nepali-tts", "t3_mtl_nepali_final.safetensors")
sd = load_file(ckpt)
cleaned = {k.replace("patched_model.", "").replace("model.", ""): v for k, v in sd.items()}
model.t3.load_state_dict(cleaned, strict=False)
model.t3.to(device).eval()
print("✅ Model loaded!")

Cell 3 — Download Reference Audio

!wget -q https://huggingface.co/Imbatmann/chatterbox-nepali-tts/resolve/main/ref.wav -O ref.wav
# Or upload your own 5-10s reference clip

Cell 4 — Generate Nepali Speech

text = "नमस्ते, म नेपाली एआई हुँ। मलाई तपाईंसँग कुरा गर्न पाउँदा खुसी लागेको छ।"

wav = model.generate(
    text=text,
    language_id="ne",
    audio_prompt_path="ref.wav",
    exaggeration=0.5,
    temperature=0.8,
)

torchaudio.save("output.wav", wav, model.sr)
print(f"✅ Saved: output.wav ({wav.shape[1]/model.sr:.1f}s)")
Audio("output.wav")

Cell 5 — Batch Generation (Multiple Texts)

texts = [
    "नेपाल हिमाल, पहाड र तराईले भरिएको सुन्दर देश हो।",
    "काठमाडौं उपत्यकाको ऐतिहासिक र सांस्कृतिक महत्त्व धेरै ठूलो छ।",
    "नेपाली भाषा धेरै मीठो र गम्भीर छ।",
]

for i, txt in enumerate(texts):
    w = model.generate(txt, "ne", audio_prompt_path="ref.wav", exaggeration=0.5, temperature=0.8)
    torchaudio.save(f"batch_{i}.wav", w, model.sr)
    print(f"✅ batch_{i}.wav — {w.shape[1]/model.sr:.1f}s")
    display(Audio(f"batch_{i}.wav"))

Option 2: Gradio Web UI in Colab

# Cell 1
!git clone https://github.com/Imbatmann/chatterbox-nepali.git
%cd chatterbox-nepali
!pip install -q -e . gradio

# Cell 2
!python gradio_nepali.py --share
# Click the gradio.live link when it appears

Quickstart

# pip install chatterbox-tts
import torch, torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base model
model = ChatterboxMultilingualTTS.from_pretrained(device)

# Load Nepali fine-tuned weights
ckpt = hf_hub_download("Imbatmann/chatterbox-nepali-tts", "t3_mtl_nepali_final.safetensors")
sd = load_file(ckpt)
cleaned = {k.replace("patched_model.", "").replace("model.", ""): v for k, v in sd.items()}
model.t3.load_state_dict(cleaned, strict=False)
model.t3.to(device).eval()

# Generate Nepali speech
text = "नमस्ते, म नेपाली एआई हुँ। मलाई तपाईंसँग कुरा गर्न पाउँदा खुसी लागेको छ।"
wav = model.generate(
    text=text,
    language_id="ne",
    audio_prompt_path="ref.wav",
    exaggeration=0.5,
    temperature=0.8,
)
ta.save("output-nepali.wav", wav, model.sr)

# Clone a different voice
wav = model.generate(
    text="काठमाडौं उपत्यकाको ऐतिहासिक र सांस्कृतिक महत्त्व धेरै ठूलो छ।",
    language_id="ne",
    audio_prompt_path="YOUR_VOICE.wav",
    exaggeration=0.5,
    temperature=0.8,
)
ta.save("output-cloned.wav", wav, model.sr)

Using the CLI

# Install
pip install -U chatterbox-tts

# Generate
python -c "
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
import torch, torchaudio

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = ChatterboxMultilingualTTS.from_pretrained(device)

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
sd = load_file(hf_hub_download('Imbatmann/chatterbox-nepali-tts', 't3_mtl_nepali_final.safetensors'))
cleaned = {k.replace('patched_model.','').replace('model.',''):v for k,v in sd.items()}
model.t3.load_state_dict(cleaned, strict=False)
model.t3.to(device).eval()

wav = model.generate('नमस्ते संसार', 'ne', audio_prompt_path='ref.wav')
torchaudio.save('out.wav', wav, model.sr)
"

Gradio Web UI

git clone https://github.com/Imbatmann/chatterbox-nepali.git
cd chatterbox-nepali
pip install -e .
python gradio_nepali.py

Training Your Own

# 1. Prepare dataset (pipe-separated CSV)
# data/train.jsonl: {"audio_path": "wavs/001.wav", "text": "नमस्ते संसार"}

# 2. Run training
bash run_train.sh
# or directly:
python src/chatterbox/train_nepali.py \
  --manifest data/train.jsonl \
  --device cuda \
  --batch_size 16 \
  --accum_steps 2 \
  --epochs 50 \
  --save_every 5 \
  --resume_t3_weights results/t3_nepali_epoch_25.pt

Model Details

Parameter Value
Architecture Token-to-Token Transformer (LLaMA 520M)
Languages Nepali (ne)
Sample Rate 24,000 Hz
Frame Rate 25 Hz (speech tokens)
Vocoder S3Gen (CFM + HiFiGAN)

Features

  • Devanagari Support — Full Nepali script handling with NFKD normalization
  • Zero-shot Voice Cloning — Clone any voice from 5-10s reference audio
  • Emotion Control — Exaggeration parameter (0.0-1.0) for pacing/style
  • Gradio UI — Built-in web interface for easy testing

Sample Output

Listen to the generated Nepali speech samples in the Files and versions tab.

License

MIT License — Original architecture by Resemble AI, Nepali fine-tuning by Imbatmann.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Imbatmann/chatterbox-nepali-tts

Finetuned
(45)
this model