You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Built with Axolotl

See axolotl config

axolotl version: 0.12.2

base_model: Qwen/Qwen3-4B-Instruct-2507 #Qwen/Qwen3-4B-Instruct

# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

strict: false

#resume_from_checkpoint:  /leonardo_work/EUHPC_A04_045/training/ale_outputs/pluto-8B-sft/checkpoint-4040 #
auto_resume_from_checkpoints: true


plugins:
  - axolotl.integrations.liger.LigerPlugin



liger_fused_linear_cross_entropy: true
liger_cross_entropy: false # Explicitly disabled to ensure the Fused version takes over
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true

#chat_template: qwen3
datasets:
  - path: Coloss/Omnia-v5-Nesso
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value

#dataset_prepared_path: ./ale_outputs/tokenized-omni-v5-v.2 
dataset_prepared_path: /leonardo_work/EUHPC_A04_045/training/ale_outputs/tokenized-omnia-v5-nesso-4b

      
val_set_size: 0.0005
output_dir: ./ale_outputs/pluto-4B-sft

#do_bench_eval: true
#bench_dataset: /leonardo_work/EUHPC_A04_045/training/examples/qwen3/eval_mix_train.json

sequence_len: 6000
excess_length_strategy: truncate
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true


gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 3
#max_steps: 50
optimizer: adamw_torch_fused   #adamw_bnb_8bit #adamw_torch_fused
lr_scheduler: cosine
learning_rate: 4e-5

bf16: auto #auto
fp16: false

tf32: true

wandb_mode: "offline"
wandb_project: pluto-4b
wandb_entity: mii-llm
wandb_name: pluto-4b-sft

#gradient_checkpointing: true
#gradient_checkpointing_kwargs:
#  use_reentrant: false

logging_steps: 1

sdp_attention: false
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 15
saves_per_epoch: 15
save_total_limit: 5
weight_decay: 0.0


fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_offload_optimizer: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT  #SHARDED_STATE_DICT #FULL_STATE_DICT
  fsdp_activation_checkpointing: true

#fsdp:
#  - full_shard
#  - auto_wrap
#fsdp_config:
#  fsdp_limit_all_gathers: true
#  fsdp_sync_module_states: true
#  fsdp_offload_params: true
#  fsdp_use_orig_params: false
#  fsdp_cpu_ram_efficient_loading: true
  
  # ADD THIS LINE:
#  fsdp_offload_optimizer: true
  
#  fsdp_use_orig_params: false
#  fsdp_cpu_ram_efficient_loading: true
#  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
#  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
#  fsdp_state_dict_type: FULL_STATE_DICT
#  fsdp_sharding_strategy: FULL_SHARD

special_tokens:

ale_outputs/pluto-4B-sft

This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 on the Coloss/Omnia-v5-Nesso dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6709
  • Memory/max Mem Active(gib): 18.21
  • Memory/max Mem Allocated(gib): 17.82
  • Memory/device Mem Reserved(gib): 22.48

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 4e-05
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 32
  • total_train_batch_size: 64
  • total_eval_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 1557
  • training_steps: 15579

Training results

Training Loss Epoch Step Validation Loss Mem Reserved(gib) Mem Active(gib) Mem Allocated(gib)
No log 0 0 1.5396 11.56 11.39 11.01
0.7725 0.0668 347 0.7732 61.1 58.38 58.38
0.7052 0.1336 694 0.7548 61.1 58.38 58.38
0.7628 0.2005 1041 0.7460 61.1 58.38 58.38
0.816 0.2673 1388 0.7446 61.1 58.38 58.38
0.761 0.3341 1735 0.7461 61.1 58.38 58.38
0.7901 0.4009 2082 0.7395 61.1 58.38 58.38
0.7336 0.4677 2429 0.7346 61.1 58.38 58.38
0.7007 0.5346 2776 0.7295 61.1 58.38 58.38
0.7005 0.6014 3123 0.7252 61.1 58.38 58.38
0.7077 0.6682 3470 0.7193 61.1 58.38 58.38
0.7056 0.7350 3817 0.7143 61.1 58.38 58.38
0.7165 0.8018 4164 0.7108 61.1 58.38 58.38
0.6852 0.8687 4511 0.7075 61.1 58.38 58.38
0.6852 0.8687 4511 0.7075 18.01 17.82 18.36
0.75 0.9355 4858 0.7038 18.02 17.82 22.48
0.6958 1.0023 5205 0.7023 18.02 17.82 22.48
0.6855 1.0691 5552 0.6953 18.21 17.82 22.48
0.7263 1.1360 5899 0.6929 18.21 17.82 22.48
0.613 1.2028 6246 0.6917 18.21 17.82 22.48
0.6071 1.2696 6593 0.6930 18.21 17.82 22.48
0.6121 1.3364 6940 0.6947 18.21 17.82 22.48
0.6586 1.4032 7287 0.6935 18.21 17.82 22.48
0.578 1.4701 7634 0.6895 18.21 17.82 22.48
0.5976 1.5369 7981 0.6882 18.21 17.82 22.48
0.5904 1.6037 8328 0.6861 18.21 17.82 22.48
0.5766 1.6705 8675 0.6833 18.21 17.82 22.48
0.5685 1.7373 9022 0.6807 18.21 17.82 22.48
0.6106 1.8042 9369 0.6771 18.21 17.82 22.48
0.565 1.8710 9716 0.6749 18.21 17.82 22.48
0.5615 1.9378 10063 0.6731 18.21 17.82 22.48
0.5786 2.0046 10410 0.6695 18.21 17.82 22.48
0.5916 2.0714 10757 0.6705 18.21 17.82 22.48
0.5055 2.1383 11104 0.6705 18.21 17.82 22.48
0.4924 2.2051 11451 0.6762 18.21 17.82 22.48
0.4933 2.2719 11798 0.6794 18.21 17.82 22.48
0.5539 2.3387 12145 0.6809 18.21 17.82 22.48
0.5226 2.4055 12492 0.6805 18.21 17.82 22.48
0.4963 2.4724 12839 0.6780 18.21 17.82 22.48
0.4958 2.5392 13186 0.6782 18.21 17.82 22.48
0.547 2.6060 13533 0.6770 18.21 17.82 22.48
0.5395 2.6728 13880 0.6757 18.21 17.82 22.48
0.5267 2.7396 14227 0.6743 18.21 17.82 22.48
0.5182 2.8065 14574 0.6727 18.21 17.82 22.48
0.5336 2.8733 14921 0.6720 18.21 17.82 22.48
0.4768 2.9401 15268 0.6709 18.21 17.82 22.48

Framework versions

  • Transformers 4.55.2
  • Pytorch 2.6.0+cu126
  • Datasets 4.0.0
  • Tokenizers 0.21.4
Downloads last month
59
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Coloss/Nesso-4B-sft-v0.1

Finetuned
(336)
this model

Evaluation results