| --- |
| language: |
| - en |
| tags: |
| - audio |
| - music |
| - codec |
| - neural-audio |
| - audio-compression |
| - transformers |
| pipeline_tag: audio-to-audio |
| library_name: transformers |
| inference: true |
| --- |
| |
|
|
| # XCodec Mini - Neural Audio Codec |
|
|
| ## Model Description |
|
|
| XCodec Mini is a state-of-the-art neural audio codec designed for high-quality music compression and reconstruction. It combines semantic and acoustic encoding approaches to achieve efficient compression while maintaining audio quality. |
|
|
| ### Key Features |
|
|
| - **Dual Encoding Architecture** |
| - Semantic encoder for high-level musical features |
| - Acoustic encoder for detailed sound information |
| - Multi-scale processing for efficient compression |
|
|
| - **Advanced Compression** |
| - Multiple codebooks for flexible quality/size tradeoff |
| - Support for 44.1kHz high-fidelity audio |
| - Separate processing paths for vocals and instrumentals |
|
|
| - **Technical Specifications** |
| - Input: Raw audio at 44.1kHz |
| - Output: Compressed representations and reconstructed audio |
| - Model Size: [Add total size] |
| - Compression Ratio: [Add typical ratio] |
|
|
| ## Intended Uses |
|
|
| - High-quality music compression |
| - Audio archival and storage |
| - Music streaming applications |
| - Audio processing pipelines |
|
|
| ## Training Data |
|
|
| The model was trained on a diverse dataset of music, including: |
| - Various genres and styles |
| - Vocal and instrumental tracks |
| - High-quality studio recordings |
|
|
| ## Performance and Limitations |
|
|
| ### Strengths |
| - High-quality audio reconstruction |
| - Efficient compression ratios |
| - Separate handling of vocals and instrumentals |
| - Support for high sample rates |
|
|
| ### Limitations |
| - Computationally intensive for real-time applications |
| - Requires significant GPU memory |
| - Best suited for offline processing |
| - May introduce artifacts in extreme compression settings |
|
|
| ## Technical Specifications |
|
|
| ### Model Architecture |
| 1. **Semantic Encoder** |
| - Based on HuBERT architecture |
| - Captures high-level musical features |
| - Outputs semantic tokens |
|
|
| 2. **Acoustic Encoder** |
| - Multi-scale convolutional architecture |
| - Processes detailed sound information |
| - Generates acoustic tokens |
|
|
| 3. **Dual Decoders** |
| - Separate decoders for vocals and instrumentals |
| - Multi-stage reconstruction process |
| - Quality-focused design |
|
|
| ### Input Requirements |
| - Audio Format: WAV/MP3 |
| - Sample Rate: 44.1kHz |
| - Channels: Mono/Stereo |
| - Bit Depth: 16-bit |
|
|
| ### Output Format |
| - Reconstructed Audio: 44.1kHz WAV |
| - Intermediate Representations: Compressed tokens |
|
|
| ## Usage Guidelines |
|
|
| ### Hardware Requirements |
| - GPU: NVIDIA GPU with 8GB+ VRAM |
| - RAM: 16GB+ recommended |
| - Storage: SSD recommended for faster processing |
|
|
| ### Software Requirements |
| - Python 3.8+ |
| - PyTorch 2.0+ |
| - CUDA 11.0+ |
| - Additional dependencies listed in installation guide |
|
|
| ## Ethical Considerations |
|
|
| - **Copyright**: Users should ensure they have proper rights to process copyrighted material |
| - **Attribution**: Proper attribution should be given when using this model |
| - **Data Privacy**: Consider data privacy implications when processing sensitive audio |
|
|
|
|
| ## Additional Information |
|
|
| ### Model Weights |
| The model requires several checkpoint files: |
| - Semantic Encoder: `semantic_ckpts/hf_1_325000/pytorch_model.bin` |
| - Vocal Decoder: `decoders/decoder_131000.pth` |
| - Instrumental Decoder: `decoders/decoder_151000.pth` |
| - Final Checkpoint: `final_ckpt/ckpt_00360000.pth` |
|
|
| ### Contact |
| For issues and questions, please use the GitHub repository's issue tracker. |