VibeVoice-Realtime-0.5B — ONNX

ONNX export of Microsoft's VibeVoice-Realtime-0.5B text-to-speech model for native C# / .NET / cross-platform inference without Python.

This repository contains the VibeVoice-Realtime-0.5B model exported to ONNX format as three subcomponents. It enables running VibeVoice TTS inference using ONNX Runtime in C#, Python, C++, Java, JavaScript, or any language with an ONNX Runtime binding — no PyTorch or Python required at runtime.

📦 Source code & examples: github.com/elbruno/ElBruno.VibeVoiceTTS (see src/scenario-08-onnx-native/)

Model Overview

Property	Value
Original model	microsoft/VibeVoice-Realtime-0.5B
Parameters	~0.5B
Format	ONNX (opset 17)
License	MIT
Audio output	24 kHz, mono, 16-bit PCM
First audible latency	~300 ms (hardware dependent)
Voices	6 English presets (Carter, Davis, Emma, Frank, Grace, Mike)
Languages	English (primary), experimental multilingual
GitHub repo	elbruno/ElBruno.VibeVoiceTTS

Architecture — Three ONNX Subcomponents

VibeVoice uses a diffusion-based architecture that cannot be exported as a single ONNX graph (the denoising loop is iterative). Instead, the model is split into three stages:

Text → [Tokenize] → text_encoder.onnx → hidden states
                                           ↓
Noise → diffusion_step.onnx (×5 steps) → clean latents
                                           ↓
               acoustic_decoder.onnx → 24kHz WAV audio

File	Description	Approx. Size
`text_encoder.onnx`	LLM backbone (Qwen2.5) — text tokens → hidden states	~400 MB
`diffusion_step.onnx`	Single DDPM denoising step — called iteratively	~200 MB
`acoustic_decoder.onnx`	σ-VAE decoder — latents → 24kHz waveform	~100 MB
`tokenizer.json`	HuggingFace BPE tokenizer vocabulary	~2 MB
`voices/`	6 English voice presets (.npy format)	~5 MB each

Quick Start — Python (onnxruntime)

import onnxruntime as ort
import numpy as np
from huggingface_hub import hf_hub_download

# Download model files
repo_id = "elbruno/VibeVoice-Realtime-0.5B-ONNX"
text_encoder_path = hf_hub_download(repo_id, "text_encoder.onnx")
diffusion_path = hf_hub_download(repo_id, "diffusion_step.onnx")
decoder_path = hf_hub_download(repo_id, "acoustic_decoder.onnx")

# Load ONNX sessions
text_encoder = ort.InferenceSession(text_encoder_path)
diffusion = ort.InferenceSession(diffusion_path)
decoder = ort.InferenceSession(decoder_path)

# Run inference (see example_inference.py for full pipeline)
print("✅ All ONNX models loaded successfully!")
print(f"Text encoder inputs: {[i.name for i in text_encoder.get_inputs()]}")
print(f"Diffusion inputs: {[i.name for i in diffusion.get_inputs()]}")
print(f"Decoder inputs: {[i.name for i in decoder.get_inputs()]}")

Quick Start — C# (.NET / ONNX Runtime)

using Microsoft.ML.OnnxRuntime;

// Load ONNX models (download from HuggingFace or local path)
using var textEncoder = new InferenceSession("text_encoder.onnx");
using var diffusion = new InferenceSession("diffusion_step.onnx");
using var decoder = new InferenceSession("acoustic_decoder.onnx");

Console.WriteLine("✅ All ONNX models loaded!");
// See example_csharp.md for the full inference pipeline

NuGet package: Microsoft.ML.OnnxRuntime (1.17+)

For the complete C# inference pipeline with tokenizer, diffusion scheduler, and audio output, see: ElBruno.VibeVoiceTTS/scenario-08-onnx-native

How This Was Created

The ONNX files were exported from the original PyTorch model using torch.onnx.export() with opset version 17. Each subcomponent was traced and exported individually:

Text Encoder — The LLM backbone (Qwen2.5-based) wrapped as a standalone module
Diffusion Step — A single denoising step of the DDPM head, exported with timestep and conditioning inputs
Acoustic Decoder — The σ-VAE decoder that converts latent representations to audio waveforms

Voice presets were converted from PyTorch .pt tensors to NumPy .npy format.

Export scripts: ElBruno.VibeVoiceTTS/scenario-08-onnx-native/export

Inference Pipeline

The inference pipeline (implemented in your language of choice) follows these steps:

Tokenize — Encode input text to BPE token IDs using tokenizer.json
Text Encoder — Run text_encoder.onnx to get hidden states
Diffusion Loop — Starting from Gaussian noise, run diffusion_step.onnx for 5 iterations (DDPM denoising), conditioned on hidden states + voice preset
Acoustic Decoder — Run acoustic_decoder.onnx to convert clean latents to 24kHz audio
Save WAV — Write float audio samples as 16-bit PCM WAV

Voice Presets

Voice	Gender	Style
Carter	Male	Clear American English
Davis	Male	Warm tone
Emma	Female	Clear articulation
Frank	Male	Deep voice
Grace	Female	Soft, natural
Mike	Male	Conversational

Evaluation Results

Results from the original model (from microsoft/VibeVoice-Realtime-0.5B):

LibriSpeech test-clean

Model	WER (%) ↓	Speaker Similarity ↑
VALL-E 2	2.40	0.643
Voicebox	1.90	0.662
VibeVoice-Realtime-0.5B	2.00	0.695

SEED test-en

Model	WER (%) ↓	Speaker Similarity ↑
MaskGCT	2.62	0.714
CosyVoice2	2.57	0.652
VibeVoice-Realtime-0.5B	2.05	0.633

Note: ONNX conversion may introduce small numerical differences (~1e-4 tolerance). Benchmark results should be verified independently on the ONNX variant.

Responsible Usage

This section is reproduced from the original model card per Microsoft's responsible AI guidelines.

Intended Uses

The VibeVoice-Realtime model is intended for research purposes exploring real-time highly realistic audio generation as detailed in the technical report.

Out-of-Scope Uses

This release is NOT intended or licensed for:

Voice impersonation without explicit, recorded consent — including cloning a real individual's voice for satire, advertising, ransom, social engineering, or authentication bypass
Disinformation or impersonation — creating audio presented as genuine recordings of real people or events
Real-time voice conversion — telephone or video-conference "live deep-fake" applications
Circumventing safeguards — any act to disable watermarking, AI disclaimers, or security controls
Unsupported languages — the model is trained only on English data; outputs in other languages are unsupported
Non-speech audio — music, Foley, or ambient sound generation

Safety Mitigations

Microsoft has implemented the following safeguards:

Removed acoustic tokenizer to prevent users from creating voice embeddings for cloning
Audible AI disclaimer automatically embedded in every synthesized audio file
Imperceptible watermark added to generated audio for provenance verification

Recommendation

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. If you use this model to generate speech, please disclose to the end user that they are listening to AI-generated content.

Limitations

ONNX-specific: Small numerical differences (~1e-4) compared to PyTorch inference
English only: Other languages may produce unpredictable results
No overlapping speech: Does not model or generate overlapping speech
No code/formulas: Cannot read code, mathematical formulas, or uncommon symbols
Single speaker: For multi-speaker, use VibeVoice-1.5B

Technical Details

LLM Backbone: Qwen2.5-0.5B
Acoustic Tokenizer: σ-VAE variant (from LatentLM), ~340M parameters decoder
Diffusion Head: 4 layers, ~40M parameters, DDPM with DPM-Solver inference
Context Length: Up to 8,192 tokens
Frame Rate: 7.5 Hz (ultra-low for efficiency)
ONNX Opset: 17
Precision: float32

Citation

@article{vibevoice2025,
  title={VibeVoice Technical Report},
  author={Microsoft Research},
  journal={arXiv preprint arXiv:2508.19205},
  year={2025},
  url={https://arxiv.org/abs/2508.19205}
}

Contact

For issues with the ONNX conversion, open an issue at ElBruno.VibeVoiceTTS.

For issues with the original VibeVoice model, contact VibeVoice@microsoft.com.

Downloads last month: 118

Model tree for elbruno/VibeVoice-Realtime-0.5B-ONNX

Base model

Qwen/Qwen2.5-0.5B

Finetuned

microsoft/VibeVoice-Realtime-0.5B

Quantized

(4)

this model

Dataset used to train elbruno/VibeVoice-Realtime-0.5B-ONNX

Papers for elbruno/VibeVoice-Realtime-0.5B-ONNX

VibeVoice Technical Report

Paper • 2508.19205 • Published Aug 26, 2025 • 165

Multimodal Latent Language Modeling with Next-Token Diffusion

Paper • 2412.08635 • Published Dec 11, 2024 • 49

Evaluation results

WER on LibriSpeech test-clean
test set self-reported

2.000
Speaker Similarity on LibriSpeech test-clean
test set self-reported

0.695

elbruno
/

VibeVoice-Realtime-0.5B-ONNX