See axolotl config
axolotl version: 0.12.2
base_model: Qwen/Qwen3-4B-Instruct-2507 #Qwen/Qwen3-4B-Instruct
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name
strict: false
#resume_from_checkpoint: /leonardo_work/EUHPC_A04_045/training/ale_outputs/pluto-8B-sft/checkpoint-4040 #
auto_resume_from_checkpoints: true
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_fused_linear_cross_entropy: true
liger_cross_entropy: false # Explicitly disabled to ensure the Fused version takes over
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
#chat_template: qwen3
datasets:
- path: Coloss/Omnia-v5-Nesso
type: chat_template
field_messages: conversations
message_property_mappings:
role: from
content: value
#dataset_prepared_path: ./ale_outputs/tokenized-omni-v5-v.2
dataset_prepared_path: /leonardo_work/EUHPC_A04_045/training/ale_outputs/tokenized-omnia-v5-nesso-4b
val_set_size: 0.0005
output_dir: ./ale_outputs/pluto-4B-sft
#do_bench_eval: true
#bench_dataset: /leonardo_work/EUHPC_A04_045/training/examples/qwen3/eval_mix_train.json
sequence_len: 6000
excess_length_strategy: truncate
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 3
#max_steps: 50
optimizer: adamw_torch_fused #adamw_bnb_8bit #adamw_torch_fused
lr_scheduler: cosine
learning_rate: 4e-5
bf16: auto #auto
fp16: false
tf32: true
wandb_mode: "offline"
wandb_project: pluto-4b
wandb_entity: mii-llm
wandb_name: pluto-4b-sft
#gradient_checkpointing: true
#gradient_checkpointing_kwargs:
# use_reentrant: false
logging_steps: 1
sdp_attention: false
flash_attention: true
warmup_ratio: 0.1
evals_per_epoch: 15
saves_per_epoch: 15
save_total_limit: 5
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_offload_optimizer: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT #SHARDED_STATE_DICT #FULL_STATE_DICT
fsdp_activation_checkpointing: true
#fsdp:
# - full_shard
# - auto_wrap
#fsdp_config:
# fsdp_limit_all_gathers: true
# fsdp_sync_module_states: true
# fsdp_offload_params: true
# fsdp_use_orig_params: false
# fsdp_cpu_ram_efficient_loading: true
# ADD THIS LINE:
# fsdp_offload_optimizer: true
# fsdp_use_orig_params: false
# fsdp_cpu_ram_efficient_loading: true
# fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
# fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
# fsdp_state_dict_type: FULL_STATE_DICT
# fsdp_sharding_strategy: FULL_SHARD
special_tokens:
ale_outputs/pluto-4B-sft
This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 on the Coloss/Omnia-v5-Nesso dataset. It achieves the following results on the evaluation set:
- Loss: 0.6709
- Memory/max Mem Active(gib): 18.21
- Memory/max Mem Allocated(gib): 17.82
- Memory/device Mem Reserved(gib): 22.48
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 4e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 32
- total_train_batch_size: 64
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 1557
- training_steps: 15579
Training results
| Training Loss | Epoch | Step | Validation Loss | Mem Reserved(gib) | Mem Active(gib) | Mem Allocated(gib) |
|---|---|---|---|---|---|---|
| No log | 0 | 0 | 1.5396 | 11.56 | 11.39 | 11.01 |
| 0.7725 | 0.0668 | 347 | 0.7732 | 61.1 | 58.38 | 58.38 |
| 0.7052 | 0.1336 | 694 | 0.7548 | 61.1 | 58.38 | 58.38 |
| 0.7628 | 0.2005 | 1041 | 0.7460 | 61.1 | 58.38 | 58.38 |
| 0.816 | 0.2673 | 1388 | 0.7446 | 61.1 | 58.38 | 58.38 |
| 0.761 | 0.3341 | 1735 | 0.7461 | 61.1 | 58.38 | 58.38 |
| 0.7901 | 0.4009 | 2082 | 0.7395 | 61.1 | 58.38 | 58.38 |
| 0.7336 | 0.4677 | 2429 | 0.7346 | 61.1 | 58.38 | 58.38 |
| 0.7007 | 0.5346 | 2776 | 0.7295 | 61.1 | 58.38 | 58.38 |
| 0.7005 | 0.6014 | 3123 | 0.7252 | 61.1 | 58.38 | 58.38 |
| 0.7077 | 0.6682 | 3470 | 0.7193 | 61.1 | 58.38 | 58.38 |
| 0.7056 | 0.7350 | 3817 | 0.7143 | 61.1 | 58.38 | 58.38 |
| 0.7165 | 0.8018 | 4164 | 0.7108 | 61.1 | 58.38 | 58.38 |
| 0.6852 | 0.8687 | 4511 | 0.7075 | 61.1 | 58.38 | 58.38 |
| 0.6852 | 0.8687 | 4511 | 0.7075 | 18.01 | 17.82 | 18.36 |
| 0.75 | 0.9355 | 4858 | 0.7038 | 18.02 | 17.82 | 22.48 |
| 0.6958 | 1.0023 | 5205 | 0.7023 | 18.02 | 17.82 | 22.48 |
| 0.6855 | 1.0691 | 5552 | 0.6953 | 18.21 | 17.82 | 22.48 |
| 0.7263 | 1.1360 | 5899 | 0.6929 | 18.21 | 17.82 | 22.48 |
| 0.613 | 1.2028 | 6246 | 0.6917 | 18.21 | 17.82 | 22.48 |
| 0.6071 | 1.2696 | 6593 | 0.6930 | 18.21 | 17.82 | 22.48 |
| 0.6121 | 1.3364 | 6940 | 0.6947 | 18.21 | 17.82 | 22.48 |
| 0.6586 | 1.4032 | 7287 | 0.6935 | 18.21 | 17.82 | 22.48 |
| 0.578 | 1.4701 | 7634 | 0.6895 | 18.21 | 17.82 | 22.48 |
| 0.5976 | 1.5369 | 7981 | 0.6882 | 18.21 | 17.82 | 22.48 |
| 0.5904 | 1.6037 | 8328 | 0.6861 | 18.21 | 17.82 | 22.48 |
| 0.5766 | 1.6705 | 8675 | 0.6833 | 18.21 | 17.82 | 22.48 |
| 0.5685 | 1.7373 | 9022 | 0.6807 | 18.21 | 17.82 | 22.48 |
| 0.6106 | 1.8042 | 9369 | 0.6771 | 18.21 | 17.82 | 22.48 |
| 0.565 | 1.8710 | 9716 | 0.6749 | 18.21 | 17.82 | 22.48 |
| 0.5615 | 1.9378 | 10063 | 0.6731 | 18.21 | 17.82 | 22.48 |
| 0.5786 | 2.0046 | 10410 | 0.6695 | 18.21 | 17.82 | 22.48 |
| 0.5916 | 2.0714 | 10757 | 0.6705 | 18.21 | 17.82 | 22.48 |
| 0.5055 | 2.1383 | 11104 | 0.6705 | 18.21 | 17.82 | 22.48 |
| 0.4924 | 2.2051 | 11451 | 0.6762 | 18.21 | 17.82 | 22.48 |
| 0.4933 | 2.2719 | 11798 | 0.6794 | 18.21 | 17.82 | 22.48 |
| 0.5539 | 2.3387 | 12145 | 0.6809 | 18.21 | 17.82 | 22.48 |
| 0.5226 | 2.4055 | 12492 | 0.6805 | 18.21 | 17.82 | 22.48 |
| 0.4963 | 2.4724 | 12839 | 0.6780 | 18.21 | 17.82 | 22.48 |
| 0.4958 | 2.5392 | 13186 | 0.6782 | 18.21 | 17.82 | 22.48 |
| 0.547 | 2.6060 | 13533 | 0.6770 | 18.21 | 17.82 | 22.48 |
| 0.5395 | 2.6728 | 13880 | 0.6757 | 18.21 | 17.82 | 22.48 |
| 0.5267 | 2.7396 | 14227 | 0.6743 | 18.21 | 17.82 | 22.48 |
| 0.5182 | 2.8065 | 14574 | 0.6727 | 18.21 | 17.82 | 22.48 |
| 0.5336 | 2.8733 | 14921 | 0.6720 | 18.21 | 17.82 | 22.48 |
| 0.4768 | 2.9401 | 15268 | 0.6709 | 18.21 | 17.82 | 22.48 |
Framework versions
- Transformers 4.55.2
- Pytorch 2.6.0+cu126
- Datasets 4.0.0
- Tokenizers 0.21.4
- Downloads last month
- 59
Model tree for Coloss/Nesso-4B-sft-v0.1
Base model
Qwen/Qwen3-4B-Instruct-2507