manoskary commited on
Commit
db44d98
·
verified ·
1 Parent(s): 6525545

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -36
README.md CHANGED
@@ -25,25 +25,14 @@ and as a backbone for downstream generative tasks.
25
  - **Checkpoint**: 60000 steps
26
  - **Hidden size**: 1024
27
  - **Parameters**: ~330M
28
- - **Training loss**: unknown
29
- - **Validation loss**: 1.5264089107513428
30
 
31
  ## Training Configuration
32
  - **Objective**: Masked language modeling with span-aware masking
33
- - **Dataset**: GigaMIDI (REMI tokens → BPE, vocab size 50000)
34
  - **Sequence length**: 1024
35
  - **Max events per MIDI**: 2048
36
- - **Per-device batch size**: 24
37
- - **Gradient accumulation**: 8
38
- - **Effective batch size**: 192
39
- - **Learning rate**: 5e-05
40
- - **Warmup steps**: 0
41
 
42
- ## Tokenizer
43
- - **Base REMI vocab size**: 532
44
- - **BPE vocab size**: 50000
45
- - Includes REMI control tokens for bar, position, tempo, velocity, program, and duration
46
- - Special tokens: `<PAD>`, `<MASK>`, `<SEP>`, `<CLS>`
47
 
48
  ## Inference Example
49
 
@@ -79,29 +68,6 @@ with torch.no_grad():
79
  print("Predicted token IDs:", predictions.tolist())
80
  ```
81
 
82
- ### Using with pre-tokenized sequences
83
- ```python
84
- from transformers import BertForMaskedLM
85
- from miditok import MusicTokenizer
86
- import torch
87
-
88
- model = BertForMaskedLM.from_pretrained("manoskary/musicbert-large")
89
- tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
90
-
91
- # Note: The tokenizer uses REMI+BPE encoding
92
- # For direct token manipulation, work with token IDs
93
- # The vocabulary includes compressed BPE tokens learned from REMI sequences
94
- ```
95
-
96
- ## Training Command (for reproducibility)
97
- Training was launched with the simplified MusicBERT pretraining script:
98
- ```bash
99
- python -m music_llm.train.train_pretrain_musicbert_simple \
100
- --model_size large \
101
- --output_dir ./runs/musicbert_large_gigamidi_bpe \
102
- --dataset_path /opt/datasets/music_llm/gigamidi_remi/final \
103
- --tokenizer_path /opt/datasets/music_llm/gigamidi_remi/bpe_tokenizer
104
- ```
105
 
106
  ## Limitations and Risks
107
  - Model is trained purely on symbolic data; it does not produce audio directly.
 
25
  - **Checkpoint**: 60000 steps
26
  - **Hidden size**: 1024
27
  - **Parameters**: ~330M
28
+ - **Validation loss**: ~1.5
 
29
 
30
  ## Training Configuration
31
  - **Objective**: Masked language modeling with span-aware masking
32
+ - **Dataset**: GigaMIDI (REMI tokens → BPE, vocab size 40000)
33
  - **Sequence length**: 1024
34
  - **Max events per MIDI**: 2048
 
 
 
 
 
35
 
 
 
 
 
 
36
 
37
  ## Inference Example
38
 
 
68
  print("Predicted token IDs:", predictions.tolist())
69
  ```
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ## Limitations and Risks
73
  - Model is trained purely on symbolic data; it does not produce audio directly.