🚀 Xoron-Dev: State-of-the-Art Multimodal MoE

Xoron-Dev is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a Mixture of Experts (MoE) backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.

🌟 Model Highlights

Architecture: Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
Vision Encoder: SigLIP-2 (384px) with TiTok-style 1D tokenization, Dual-Stream Attention, and 2D-RoPE for images; 3D-RoPE + Temporal MoE for video (up to 16 frames).
Image Generation: MoE-DiT (Diffusion Transformer with MoE) using Flow Matching, 2D-RoPE, and Symmetric Dual-Stream Attention (SD3/Flux-style).
Video Generation: 3D Causal Transformers with Flow Matching, 3D-RoPE for (x,y,t) positions, and Temporal Expert Routing.
Audio (Speech-to-Speech): Conformer encoder with RMLA and Raw Waveform Tokenizer for ASR; Direct waveform decoder (no vocoder needed!) with MAS for TTS; Zero-Shot Speaker Cloning with In-Context Audio Prompting. Talk to it, and it talks back!
Agentic: Trained for tool calling, file operations, and code execution with uncertainty estimation.
Context: Efficient 128K context using Ring Attention (4096 chunk size).
Fine-tuning: LoRA variants including rsLoRA, DoRA, and LoRA+ with configurable learning rate ratio.
Multimodal Fusion: Cross-Attention layers (4 layers, 8 heads) for deep multimodal integration.
Performance: Flash Attention support with FP16-native numerical stability.

🔬 Architecture Deep Dive

🧠 LLM Backbone (MoE)

Component	Specification
Hidden Size	1024
Layers	12
Attention Heads	16
MoE Experts	8 + 1 Shared (DeepSeek-style isolation)
Experts per Token	2 (top-2 routing)
MoE Layer Frequency	Every 2 layers
Routing	Aux-Lossless MoE routing
Context Length	128K positions
Attention	Ring Attention (4096 chunk) + Flash Attention
Tokenizer	Qwen2.5 (151,643 vocab)

👁️ Vision Encoder (SigLIP-2 + SOTA Extensions)

Feature	Description
Base Model	`google/siglip-so400m-patch14-384`
Input Resolution	384×384
TiTok Tokenization	1D tokenization with 256 compressed tokens
Dual-Stream Attention	2 symmetric dual-stream layers
Position Encoding	2D-RoPE
Output Tokens	64 tokens per image

🎬 Video Encoder (3D Causal Transformers)

Feature	Description
Max Frames	16 frames
Position Encoding	3D-RoPE for (x, y, t) coordinates
Attention	3D Causal Self-Attention
Expert Routing	Temporal MoE (4 experts, temporally-aware)
Encoder Layers	4 layers

🎨 Image Generation (MoE-DiT + Flow Matching)

Feature	Description
Architecture	MoE-DiT (Diffusion Transformer with MoE)
Scheduler	Flow Matching (not DDPM)
Output Resolution	384×384
Position Encoding	2D-RoPE
Attention	Symmetric Dual-Stream Attention (SD3/Flux-style)
MoE Experts	4 experts in DiT blocks
Inference Steps	50 steps
Guidance Scale	7.5 (CFG)

📹 Video Generation (3D Causal + Flow Matching)

Feature	Description
Output Resolution	256×256
Output Frames	16 frames (default), up to 32 frames (max capacity)
Scheduler	Flow Matching
Position Encoding	3D-RoPE for (x, y, t)
Attention	Factorized Spatial-Temporal (3D Causal)
Expert Routing	Temporal MoE (4 experts)
Guidance Scale	7.5 (CFG)

🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)

Feature	Description
Sample Rate	16kHz
Encoder (ASR)	Raw Waveform Tokenizer → Conformer blocks with RMLA
Waveform Decoder	BigVGAN-style with Snake activation + MRF - no external vocoder!
KV Compression	LoRA-style KV compression (rank 256)
Decoder Alignment	MAS (Monotonic Alignment Search) for text-to-audio alignment
Voice Cloning	Zero-Shot Speaker Cloning with speaker embedding (256-dim)
In-Context Prompting	Enabled for voice cloning from reference audio

🔊 Waveform Decoder (SOTA BigVGAN-style)

Direct audio output without external vocoder:

Feature	Description
Architecture	BigVGAN/HiFi-GAN style with transposed convolutions
Snake Activation	`x + sin²(αx)/α` - preserves audio periodicity
Multi-Receptive Field Fusion	Parallel residual stacks (kernels 3, 7, 11)
Weight Normalization	Stable training, faster convergence
Upsampling	256x (rates: 8, 8, 2, 2)
Streaming	`stream_decode()` for low-latency real-time output

🗣️ Speech-to-Speech API

The model provides three main methods for voice interaction:

Method	Description
`model.listen(audio)`	Encode speech to embeddings (ASR)
`model.speak(text)`	Generate playable audio from text (TTS)
`model.listen_and_respond(audio)`	Full conversation: listen → think → speak back

# Example: Talk to the model and it talks back
response_audio = model.listen_and_respond(your_audio)  # Returns playable waveform

# Example: Make the model say something
audio = model.speak(tokenizer.encode("Hello, how can I help you?"))

# Save as WAV file
import soundfile as sf
sf.write("response.wav", audio.cpu().numpy(), 16000)

# Streaming for real-time (low latency)
for chunk in model.waveform_decoder.stream_decode(features, chunk_size=10):
    play_audio(chunk)  # Play each chunk as it's generated

🎯 Training Pipeline for Speech

The model learns to speak using these datasets and losses:

Dataset	Type	Purpose
`openslr/librispeech_asr`	ASR	Learn to transcribe speech
`blabble-io/libritts_r`	TTS	Learn to generate speech
`parler-tts/mls_eng_10k`	TTS	Multi-speaker variety
`MikhailT/hifi-tts`	TTS	High-fidelity speech

Training Losses:

Mel Loss: MSE between predicted and target mel spectrograms
Duration Loss: MSE for MAS-predicted durations
Waveform L1 Loss: Time-domain reconstruction
Multi-Scale STFT Loss: Frequency-domain quality (512/1024/2048 FFT)

📚 Training Data

Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.

🌐 Open Source Datasets

We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:

Text & Code: Includes Code-Feedback, HumanEvalPack, OpenOrca, and AgentInstruct for robust coding and reasoning capabilities.
Tool Use: Datasets like Function-Calling-ChatML, Synth-APIGen, and Tool-Calls-MultiTurn enable precise tool invocation across single and multi-turn interactions.
Vision (Image/Video): Visual understanding is grounded in ScienceQA, Video-MME, and VideoInstruct-100K.
Generation: Text-to-Image/Video capabilities are fine-tuned on Stable-Diffusion-Prompts, Sora-Likert-Scoring datasets by Rapidata, and WebVid-10M.
Audio: Speech tasks are powered by LibriSpeech, LibriTTS-R, and HiFi-TTS.

🧪 Synthetic Data Pipeline

To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom synth engine. These datasets focus on complex behaviors often missing from public corpuses:

Category	Description
Anti-Hallucination	Training the model to say "I don't know" (`Synth-IDK`), verify facts (`Synth-FactCheck`), provide citations (`Synth-Citation`), express uncertainty (`Synth-Uncertainty`), and ground responses (`Synth-GroundedResponse`).
System Administration	Simulated environments for `Docker` setup, `SSH` configuration, database management, and package installation (`Synth-AptInstall`).
Code Execution	Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors.
Git Operations	Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding.
Chain-of-Thought	Explicit `Synth-CoT` data to encourage internal reasoning before generating final answers.
File Operations	Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Backup-bdg
/

Xoron-Dev-MultiMoe