🚀 Xoron-Dev: State-of-the-Art Multimodal MoE

Xoron-Dev Logo License Params Context

Xoron-Dev is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a Mixture of Experts (MoE) backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.

🌟 Model Highlights

  • Architecture: Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
  • Vision Encoder: SigLIP-2 (384px) with TiTok-style 1D tokenization, Dual-Stream Attention, and 2D-RoPE for images; 3D-RoPE + Temporal MoE for video (up to 16 frames).
  • Image Generation: MoE-DiT (Diffusion Transformer with MoE) using Flow Matching, 2D-RoPE, and Symmetric Dual-Stream Attention (SD3/Flux-style).
  • Video Generation: 3D Causal Transformers with Flow Matching, 3D-RoPE for (x,y,t) positions, and Temporal Expert Routing.
  • Audio (Speech-to-Speech): Conformer encoder with RMLA and Raw Waveform Tokenizer for ASR; Direct waveform decoder (no vocoder needed!) with MAS for TTS; Zero-Shot Speaker Cloning with In-Context Audio Prompting. Talk to it, and it talks back!
  • Agentic: Trained for tool calling, file operations, and code execution with uncertainty estimation.
  • Context: Efficient 128K context using Ring Attention (4096 chunk size).
  • Fine-tuning: LoRA variants including rsLoRA, DoRA, and LoRA+ with configurable learning rate ratio.
  • Multimodal Fusion: Cross-Attention layers (4 layers, 8 heads) for deep multimodal integration.
  • Performance: Flash Attention support with FP16-native numerical stability.

🔬 Architecture Deep Dive

🧠 LLM Backbone (MoE)

Component Specification
Hidden Size 1024
Layers 12
Attention Heads 16
MoE Experts 8 + 1 Shared (DeepSeek-style isolation)
Experts per Token 2 (top-2 routing)
MoE Layer Frequency Every 2 layers
Routing Aux-Lossless MoE routing
Context Length 128K positions
Attention Ring Attention (4096 chunk) + Flash Attention
Tokenizer Qwen2.5 (151,643 vocab)

👁️ Vision Encoder (SigLIP-2 + SOTA Extensions)

Feature Description
Base Model google/siglip-so400m-patch14-384
Input Resolution 384×384
TiTok Tokenization 1D tokenization with 256 compressed tokens
Dual-Stream Attention 2 symmetric dual-stream layers
Position Encoding 2D-RoPE
Output Tokens 64 tokens per image

🎬 Video Encoder (3D Causal Transformers)

Feature Description
Max Frames 16 frames
Position Encoding 3D-RoPE for (x, y, t) coordinates
Attention 3D Causal Self-Attention
Expert Routing Temporal MoE (4 experts, temporally-aware)
Encoder Layers 4 layers

🎨 Image Generation (MoE-DiT + Flow Matching)

Feature Description
Architecture MoE-DiT (Diffusion Transformer with MoE)
Scheduler Flow Matching (not DDPM)
Output Resolution 384×384
Position Encoding 2D-RoPE
Attention Symmetric Dual-Stream Attention (SD3/Flux-style)
MoE Experts 4 experts in DiT blocks
Inference Steps 50 steps
Guidance Scale 7.5 (CFG)

📹 Video Generation (3D Causal + Flow Matching)

Feature Description
Output Resolution 256×256
Output Frames 16 frames (default), up to 32 frames (max capacity)
Scheduler Flow Matching
Position Encoding 3D-RoPE for (x, y, t)
Attention Factorized Spatial-Temporal (3D Causal)
Expert Routing Temporal MoE (4 experts)
Guidance Scale 7.5 (CFG)

🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)

Feature Description
Sample Rate 16kHz
Encoder (ASR) Raw Waveform Tokenizer → Conformer blocks with RMLA
Waveform Decoder BigVGAN-style with Snake activation + MRF - no external vocoder!
KV Compression LoRA-style KV compression (rank 256)
Decoder Alignment MAS (Monotonic Alignment Search) for text-to-audio alignment
Voice Cloning Zero-Shot Speaker Cloning with speaker embedding (256-dim)
In-Context Prompting Enabled for voice cloning from reference audio

🔊 Waveform Decoder (SOTA BigVGAN-style)

Direct audio output without external vocoder:

Feature Description
Architecture BigVGAN/HiFi-GAN style with transposed convolutions
Snake Activation x + sin²(αx)/α - preserves audio periodicity
Multi-Receptive Field Fusion Parallel residual stacks (kernels 3, 7, 11)
Weight Normalization Stable training, faster convergence
Upsampling 256x (rates: 8, 8, 2, 2)
Streaming stream_decode() for low-latency real-time output

🗣️ Speech-to-Speech API

The model provides three main methods for voice interaction:

Method Description
model.listen(audio) Encode speech to embeddings (ASR)
model.speak(text) Generate playable audio from text (TTS)
model.listen_and_respond(audio) Full conversation: listen → think → speak back
# Example: Talk to the model and it talks back
response_audio = model.listen_and_respond(your_audio)  # Returns playable waveform

# Example: Make the model say something
audio = model.speak(tokenizer.encode("Hello, how can I help you?"))

# Save as WAV file
import soundfile as sf
sf.write("response.wav", audio.cpu().numpy(), 16000)

# Streaming for real-time (low latency)
for chunk in model.waveform_decoder.stream_decode(features, chunk_size=10):
    play_audio(chunk)  # Play each chunk as it's generated

🎯 Training Pipeline for Speech

The model learns to speak using these datasets and losses:

Dataset Type Purpose
openslr/librispeech_asr ASR Learn to transcribe speech
blabble-io/libritts_r TTS Learn to generate speech
parler-tts/mls_eng_10k TTS Multi-speaker variety
MikhailT/hifi-tts TTS High-fidelity speech

Training Losses:

  • Mel Loss: MSE between predicted and target mel spectrograms
  • Duration Loss: MSE for MAS-predicted durations
  • Waveform L1 Loss: Time-domain reconstruction
  • Multi-Scale STFT Loss: Frequency-domain quality (512/1024/2048 FFT)

📚 Training Data

Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.

🌐 Open Source Datasets

We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:

  • Text & Code: Includes Code-Feedback, HumanEvalPack, OpenOrca, and AgentInstruct for robust coding and reasoning capabilities.
  • Tool Use: Datasets like Function-Calling-ChatML, Synth-APIGen, and Tool-Calls-MultiTurn enable precise tool invocation across single and multi-turn interactions.
  • Vision (Image/Video): Visual understanding is grounded in ScienceQA, Video-MME, and VideoInstruct-100K.
  • Generation: Text-to-Image/Video capabilities are fine-tuned on Stable-Diffusion-Prompts, Sora-Likert-Scoring datasets by Rapidata, and WebVid-10M.
  • Audio: Speech tasks are powered by LibriSpeech, LibriTTS-R, and HiFi-TTS.

🧪 Synthetic Data Pipeline

To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom synth engine. These datasets focus on complex behaviors often missing from public corpuses:

Category Description
Anti-Hallucination Training the model to say "I don't know" (Synth-IDK), verify facts (Synth-FactCheck), provide citations (Synth-Citation), express uncertainty (Synth-Uncertainty), and ground responses (Synth-GroundedResponse).
System Administration Simulated environments for Docker setup, SSH configuration, database management, and package installation (Synth-AptInstall).
Code Execution Traces of code execution including Shell errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors.
Git Operations Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding.
Chain-of-Thought Explicit Synth-CoT data to encourage internal reasoning before generating final answers.
File Operations Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Backup-bdg/Xoron-Dev-MultiMoe