| --- |
| library_name: transformers |
| license: cc-by-nc-4.0 |
| tags: |
| - audio-to-audio |
| pipeline_tag: audio-to-audio |
| --- |
| |
| # 🚨🚨 THIS IS A DRAFT, FOR THE LATEST VERSION SEE: [bezzam/xcodec2](https://huggingface.co/bezzam/xcodec2) |
|
|
| # Xcodec2 (Transformers-compatible version) |
|
|
|
|
| The X-Codec2 model was proposed in [Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis](https://huggingface.co/papers/2502.04128). |
|
|
| X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation. |
|
|
| Its architecture is based on X-Codec with several major differences: |
|
|
| - **Unified Semantic-Acoustic Tokenization**: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre). |
| - **Single-Stage Vector Quantization (VQ)**: Unlike the multi-layer residual VQ in most approaches (e.g., X-Codec, DAC, EnCodec), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs. |
| - **Semantic Supervision During Training**: It adds a semantic reconstruction loss, ensuring that the discrete tokens preserve meaningful linguistic and emotional information — crucial for TTS tasks. |
| - **Transformer-Friendly Design**: The 1D token structure of X-Codec2 naturally aligns with the autoregressive modeling in LLMs like LLaMA, improving training efficiency and downstream compatibility. |
|
|
| ## Usage example |
|
|
| Since Xcodec2 isn't yet merged into Transformers, you can install from source from the [corresponding fork](https://github.com/Deep-unlearning/transformers/tree/add-xcodec2): |
| ```python |
| pip install git+https://github.com/Deep-unlearning/transformers.git@add-xcodec2 |
| ``` |
|
|
| Here is a quick example of how to encode and decode an audio using this model: |
|
|
| ```python |
| >>> import torch |
| >>> from datasets import Audio, load_dataset |
| >>> from transformers import AutoFeatureExtractor, Xcodec2Model |
| |
| >>> torch_device = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| >>> # load model and feature extractor |
| >>> model_id = "hf-audio/xcodec2" |
| >>> model = Xcodec2Model.from_pretrained(model_id).to(torch_device).eval() |
| >>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) |
| |
| >>> # load data |
| >>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") |
| >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate)) |
| >>> audio = dataset[0]["audio"]["array"] |
| |
| >>> # prepare data |
| >>> inputs = feature_extractor(audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(torch_device) |
| |
| >>> # encoder and decode |
| >>> audio_codes = model.encode(**inputs).audio_codes |
| >>> audio_values = model.decode(audio_codes).audio_values |
| >>> # or the equivalent with a forward pass |
| >>> model_output = model(**inputs) |
| >>> audio_codes = model_output.audio_codes |
| >>> audio_values = model_output.audio_values |
| ``` |
|
|
| This model was contributed by [Steven Zheng](https://huggingface.co/Steveeeeeeen) and [Eric Bezzam](https://huggingface.co/bezzam). |
| The original code can be found [here](https://github.com/zhenye234/X-Codec-2.0), and original checkpoints [here](https://huggingface.co/HKUSTAudio/xcodec2). |
|
|
|
|
|
|
|
|