File size: 5,898 Bytes
5fa771d 523abf4 5fa771d 5ac5227 523abf4 5ac5227 523abf4 5fa771d 523abf4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
---
tags:
- music-structure-annotation
- transformer
---
<p align="center">
<img src="https://github.com/ASLP-lab/SongFormer/blob/main/figs/logo.png?raw=true" width="50%" />
</p>
<h1 align="center">SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision</h1>
<div align="center">


[](https://arxiv.org/abs/2510.02797)
[](https://github.com/ASLP-lab/SongFormer)
[](https://huggingface.co/spaces/ASLP-lab/SongFormer)
[](https://huggingface.co/ASLP-lab/SongFormer)
[](https://huggingface.co/datasets/ASLP-lab/SongFormDB)
[](https://huggingface.co/datasets/ASLP-lab/SongFormBench)
[](https://discord.gg/p5uBryC4Zs)
[](http://www.npu-aslp.org/)
</div>
<div align="center">
<h3>
Chunbo Hao<sup>1*</sup>, Ruibin Yuan<sup>2,5*</sup>, Jixun Yao<sup>1</sup>, Qixin Deng<sup>3,5</sup>,<br>Xinyi Bai<sup>4,5</sup>, Wei Xue<sup>2</sup>, Lei Xie<sup>1โ </sup>
</h3>
<p>
<sup>*</sup>Equal contribution <sup>โ </sup>Corresponding author
</p>
<p>
<sup>1</sup>Audio, Speech and Language Processing Group (ASLP@NPU),<br>Northwestern Polytechnical University<br>
<sup>2</sup>Hong Kong University of Science and Technology<br>
<sup>3</sup>Northwestern University<br>
<sup>4</sup>Cornell University<br>
<sup>5</sup>Multimodal Art Projection (M-A-P)
</p>
</div>
----
SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.

For a more detailed deployment guide, please refer to the [GitHub repository](https://github.com/ASLP-lab/SongFormer/).
## ๐ QuickStart
### Prerequisites
Before running the model, follow the instructions in the [GitHub repository](https://github.com/ASLP-lab/SongFormer/) to set up the required **Python environment**.
---
### Input: Audio File Path
You can perform inference by providing the path to an audio file:
```python
from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os
# Download the model from Hugging Face Hub
local_dir = snapshot_download(
repo_id="ASLP-lab/SongFormer",
repo_type="model",
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns="*",
ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)
# Add the local directory to path and set environment variable
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir
# Load the model
songformer = AutoModel.from_pretrained(
local_dir,
trust_remote_code=True,
low_cpu_mem_usage=False,
)
# Set device and switch to evaluation mode
device = "cuda:0"
songformer.to(device)
songformer.eval()
# Run inference
result = songformer("path/to/audio/file.wav")
```
---
### Input: Tensor or NumPy Array
Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor:
```python
from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os
import numpy as np
# Download model
local_dir = snapshot_download(
repo_id="ASLP-lab/SongFormer",
repo_type="model",
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns="*",
ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)
# Setup environment
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir
# Load model
songformer = AutoModel.from_pretrained(
local_dir,
trust_remote_code=True,
low_cpu_mem_usage=False,
)
# Configure device
device = "cuda:0"
songformer.to(device)
songformer.eval()
# Generate dummy audio input (sampling rate: 24,000 Hz, e.g., 60 seconds of audio)
audio = np.random.randn(24000 * 60).astype(np.float32)
# Perform inference
result = songformer(audio)
```
> โ ๏ธ **Note:** The expected sampling rate for input audio is **24,000 Hz**.
---
### Output Format
The model returns a structured list of segment predictions, with each entry containing timing and label information:
```json
[
{
"start": 0.0, // Start time of segment (in seconds)
"end": 15.2, // End time of segment (in seconds)
"label": "verse" // Predicted segment label
},
...
]
```
## ๐ง Notes
- The initialization logic of **MusicFM** has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency.
## ๐ Citation
If you use **SongFormer** in your research or application, please cite our work:
```bibtex
@misc{hao2025songformer,
title = {SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision},
author = {Chunbo Hao and Ruibin Yuan and Jixun Yao and Qixin Deng and Xinyi Bai and Wei Xue and Lei Xie},
year = {2025},
eprint = {2510.02797},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2510.02797}
}
``` |