|
|
--- |
|
|
license: cc-by-nc-sa-4.0 |
|
|
pipeline_tag: feature-extraction |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- audio-classification |
|
|
- audio |
|
|
- speech |
|
|
- music |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- openslr/librispeech_asr |
|
|
- facebook/multilingual_librispeech |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
- speechcolab/gigaspeech |
|
|
- facebook/voxpopuli |
|
|
- agkphysics/AudioSet |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
# USAD: Universal Speech and Audio Representation via Distillation |
|
|
|
|
|
**Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers. |
|
|
Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model. |
|
|
|
|
|
[π **Read Full Paper**](https://arxiv.org/abs/2506.18843) |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Models |
|
|
|
|
|
USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**. |
|
|
|
|
|
| Model | Parameters | Dim | Layer | Checkpoint | |
|
|
| ---------- | ---------- | ---- | ----- | ------------------------------------------------- | |
|
|
| USAD Small | 24M | 384 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Small) | |
|
|
| USAD Base | 94M | 768 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Base) | |
|
|
| USAD Large | 330M | 1024 | 24 | [link](https://huggingface.co/MIT-SLS/USAD-Large) | |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## π How To Use |
|
|
|
|
|
**Installation** |
|
|
``` |
|
|
pip install -U transformers |
|
|
``` |
|
|
|
|
|
**Load Model and Extract Features** |
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModel |
|
|
|
|
|
# Load pre-trained model |
|
|
model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval() |
|
|
|
|
|
# Load audio and resample to 16kHz |
|
|
wav = model.load_audio("path/to/audio").unsqueeze(0) # (batch_size, wav_len) |
|
|
# wav is a float tensor on the same device as the model |
|
|
# You can also load waveforms directly with torchaudio.load |
|
|
|
|
|
# Extract features |
|
|
with torch.no_grad(): |
|
|
results = model(wav) |
|
|
|
|
|
# result["x"]: model final output (batch_size, seq_len) |
|
|
# result["mel"]: mel fbank (batch_size, seq_len * 2, mel_dim) |
|
|
# result["hidden_states"]: list of (batch_size, seq_len, encoder_dim) |
|
|
# result["ffn"]: list of (batch_size, seq_len, encoder_dim) |
|
|
``` |
|
|
|
|
|
See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@article{chang2025usad, |
|
|
title={{USAD}: Universal Speech and Audio Representation via Distillation}, |
|
|
author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.}, |
|
|
journal={arXiv preprint arXiv:2506.18843}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Acknowledgement |
|
|
|
|
|
Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories. |
|
|
|