File size: 5,898 Bytes
5fa771d
 
 
 
 
 
523abf4
 
 
5fa771d
5ac5227
 
 
523abf4
 
 
 
 
 
 
 
 
 
 
 
5ac5227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
523abf4
 
5fa771d
 
 
523abf4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
tags:
- music-structure-annotation
- transformer
---

<p align="center">
  <img src="https://github.com/ASLP-lab/SongFormer/blob/main/figs/logo.png?raw=true" width="50%" />
</p>

<h1 align="center">SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision</h1>

<div align="center">

![Python](https://img.shields.io/badge/Python-3.10-brightgreen)
![License](https://img.shields.io/badge/License-CC%20BY%204.0-lightblue)
[![arXiv Paper](https://img.shields.io/badge/arXiv-2510.02797-blue)](https://arxiv.org/abs/2510.02797)
[![GitHub](https://img.shields.io/badge/GitHub-SongFormer-black)](https://github.com/ASLP-lab/SongFormer)
[![HuggingFace Space](https://img.shields.io/badge/HuggingFace-space-yellow)](https://huggingface.co/spaces/ASLP-lab/SongFormer)
[![HuggingFace Model](https://img.shields.io/badge/HuggingFace-model-blue)](https://huggingface.co/ASLP-lab/SongFormer)
[![Dataset SongFormDB](https://img.shields.io/badge/HF%20Dataset-SongFormDB-green)](https://huggingface.co/datasets/ASLP-lab/SongFormDB)
[![Dataset SongFormBench](https://img.shields.io/badge/HF%20Dataset-SongFormBench-orange)](https://huggingface.co/datasets/ASLP-lab/SongFormBench)
[![Discord](https://img.shields.io/badge/Discord-join%20us-purple?logo=discord&logoColor=white)](https://discord.gg/p5uBryC4Zs)
[![lab](https://img.shields.io/badge/๐Ÿซ-ASLP-grey?labelColor=lightgrey)](http://www.npu-aslp.org/)

</div>

<div align="center">
  <h3>
    Chunbo Hao<sup>1*</sup>, Ruibin Yuan<sup>2,5*</sup>, Jixun Yao<sup>1</sup>, Qixin Deng<sup>3,5</sup>,<br>Xinyi Bai<sup>4,5</sup>, Wei Xue<sup>2</sup>, Lei Xie<sup>1โ€ </sup>
  </h3>
  
  <p>
    <sup>*</sup>Equal contribution &nbsp;&nbsp; <sup>โ€ </sup>Corresponding author
  </p>
  
  <p>
    <sup>1</sup>Audio, Speech and Language Processing Group (ASLP@NPU),<br>Northwestern Polytechnical University<br>
    <sup>2</sup>Hong Kong University of Science and Technology<br>
    <sup>3</sup>Northwestern University<br>
    <sup>4</sup>Cornell University<br>
    <sup>5</sup>Multimodal Art Projection (M-A-P)
  </p>
</div>

----

SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.

![](https://github.com/ASLP-lab/SongFormer/blob/main/figs/songformer.png?raw=true)

For a more detailed deployment guide, please refer to the [GitHub repository](https://github.com/ASLP-lab/SongFormer/).

## ๐Ÿš€ QuickStart

### Prerequisites

Before running the model, follow the instructions in the [GitHub repository](https://github.com/ASLP-lab/SongFormer/) to set up the required **Python environment**.

---

### Input: Audio File Path

You can perform inference by providing the path to an audio file:

```python
from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os

# Download the model from Hugging Face Hub
local_dir = snapshot_download(
    repo_id="ASLP-lab/SongFormer",
    repo_type="model",
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns="*",
    ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)

# Add the local directory to path and set environment variable
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir

# Load the model
songformer = AutoModel.from_pretrained(
    local_dir,
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Set device and switch to evaluation mode
device = "cuda:0"
songformer.to(device)
songformer.eval()

# Run inference
result = songformer("path/to/audio/file.wav")
```

---

### Input: Tensor or NumPy Array

Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor:

```python
from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os
import numpy as np

# Download model
local_dir = snapshot_download(
    repo_id="ASLP-lab/SongFormer",
    repo_type="model",
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns="*",
    ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)

# Setup environment
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir

# Load model
songformer = AutoModel.from_pretrained(
    local_dir,
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Configure device
device = "cuda:0"
songformer.to(device)
songformer.eval()

# Generate dummy audio input (sampling rate: 24,000 Hz, e.g., 60 seconds of audio)
audio = np.random.randn(24000 * 60).astype(np.float32)

# Perform inference
result = songformer(audio)
```

> โš ๏ธ **Note:** The expected sampling rate for input audio is **24,000 Hz**.

---

### Output Format

The model returns a structured list of segment predictions, with each entry containing timing and label information:

```json
[
  {
    "start": 0.0,          // Start time of segment (in seconds)
    "end": 15.2,           // End time of segment (in seconds)
    "label": "verse"       // Predicted segment label
  },
  ...
]
```

## ๐Ÿ”ง Notes

- The initialization logic of **MusicFM** has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency.

## ๐Ÿ“š Citation

If you use **SongFormer** in your research or application, please cite our work:

```bibtex
@misc{hao2025songformer,
  title         = {SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision},
  author        = {Chunbo Hao and Ruibin Yuan and Jixun Yao and Qixin Deng and Xinyi Bai and Wei Xue and Lei Xie},
  year          = {2025},
  eprint        = {2510.02797},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2510.02797}
}
```