Nesso-0.4B-Instruct

Nesso-0.4B-Instruct is a bilingual English/Italian Small Language Model (SLM) optimized for conversational and instruction-following use cases. It is post-trained on top of Zagreus-0.4B-ita, a foundational model trained from scratch by the mii-llm community (Made in Italy – Large Language Model) on the Seeweb HPC infrastructure.

Designed for sovereign edge inference, Nesso-0.4B-Instruct delivers competitive instruction-following performance in both Italian and English at a fraction of the compute cost of larger models.

⚠️ This model is currently at the SFT (Supervised Fine-Tuning) stage. DPO (Direct Preference Optimization) training is planned and updated results will be published upon completion.


Model Details

Property Value
Architecture Modified Llama-3.2 (fully dense)
Parameters ~400M
Hidden size 960
Layers 32
Attention heads 15 (KV heads: 5)
Context length 4096 tokens
Tokenizer Llama-3.2 (vocab_size: 128,256)
Precision BF16
Languages English, Italian
Base model mii-llm/zagreus-0.4B-ita
Post-training framework Axolotl + FSDP
Chat template ChatML

Training Details

Base Model Pre-training

Nesso-0.4B-Instruct is built on Zagreus-0.4B-ita, which was pre-trained on approximately 1 trillion tokens using the following data mix:

Dataset Description
FineWeb (350BT sample) ~350B tokens of English web text
FineWeb-2 (ita_Latn) Italian web text
FinePDFs (ita_Latn) Italian PDF documents
StarCoder Data ~250B tokens of code

Token distribution: ~400B English + ~400B Italian + ~200B Code
Infrastructure: 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs) on Seeweb HPC
Framework: Nanotron (mii-llm fork)

Post-training (SFT)

Post-training was performed using Axolotl with FSDP across 4 nodes (32× A100 GPUs).

The instruction dataset is a proprietary bilingual (English/Italian) corpus curated by the mii-llm team, with long-term iteration across domains including instruction following, conversational AI, and general knowledge. This dataset is considered a strategic research asset and is not released as open source.

Key hyperparameters:

Hyperparameter Value
Optimizer AdamW (fused)
Learning rate 1e-3
LR scheduler Cosine (constant ratio: 0.8, min ratio: 0.3)
Epochs 3
Micro batch size 1
Gradient accumulation steps 8
Sequence length 4096
Max grad norm 1.0
Precision BF16 + Flash Attention
FSDP strategy FULL_SHARD

Chat Template

This model uses the ChatML format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Ciao! Come stai?<|im_end|>
<|im_start|>assistant

Special tokens:

  • pad_token: <|im_end|>
  • eos_token: <|im_end|>

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mii-llm/nesso-0.4B-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "Sei un assistente utile e preciso."},
    {"role": "user", "content": "Spiegami cos'è un modello linguistico di grandi dimensioni."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

output = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))

Evaluation

Evaluation Commands

# Italian benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks ifeval-ita --device cuda:0 --batch_size 1

# English benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks mmlu --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks hellaswag,arc --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks ifeval --device cuda:0 --batch_size 1

Results

English Benchmarks

Model IFEval EN ↑ ARC EN ↑ HellaSwag EN ↑ MMLU EN ↑ Avg EN
Qwen/Qwen3-0.6B 0.2758 0.3430 0.4742 0.4013 0.3736
Nesso-0.4B-instruct 0.3465 0.3003 0.4629 0.2871 0.3492
LiquidAI/LFM2-350M 0.1595 0.2457 0.3092 0.3445 0.2647

Italian Benchmarks

Model IFEval IT ↑ ARC IT ↑ HellaSwag IT ↑ MMLU IT ↑ Avg IT
Qwen/Qwen3-0.6B 0.3058 0.2729 0.3598 0.4025 0.3353
Nesso-0.4B-instruct 0.2962 0.2874 0.4076 0.2875 0.3197
LiquidAI/LFM2-350M 0.1427 0.2464 0.2994 0.3132 0.2504

Overall

Model Avg EN Avg IT Overall
Qwen/Qwen3-0.6B 0.3736 0.3353 0.3545
Nesso-0.4B-instruct 0.3492 0.3197 0.3345
LiquidAI/LFM2-350M 0.2647 0.2504 0.2576

Discussion

Nesso-0.4B-Instruct achieves the highest IFEval English score (0.3465) among all compared models — including the larger Qwen3-0.6B — demonstrating strong instruction-following capability. On Italian HellaSwag, it also leads with 0.4076.

Qwen3-0.6B maintains a clear advantage on MMLU in both languages. MMLU is a widely used benchmark that is frequently represented in training corpora; we believe our results nonetheless demonstrate a highly competitive SLM for English/Italian edge deployment scenarios.


Related Models

Model Description
Zagreus-0.4B-ita Base pre-trained model (this model's foundation)
Nesso-0.4B-agentic Optimized for function calling and agentic tasks
Open-Zagreus-0.4B Fully open-source SFT variant

Citation

If you use this model in your research, please cite:

@misc{nesso2025,
  title        = {The Joy and Pain of Training an LLM from Scratch:
                  A Technical Report on the Zagreus and Nesso Model Families},
  author       = {mii-llm community},
  year         = {2025},
  howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
}

Acknowledgements

  • Antonio Baldassarra (CEO, Seeweb) and Marco Cristofanilli (Head of AI, Seeweb) for infrastructure sponsorship
  • The Hugging Face team for Nanotron, datatrove, FineWeb, and FineWeb-2
  • The mii-llm open-source community

License

Released under the Apache 2.0 license.

Made with ❤️ in Italy by mii-llm

Downloads last month
46
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mii-llm/nesso-0.4B-instruct

Finetuned
(3)
this model

Datasets used to train mii-llm/nesso-0.4B-instruct

Collection including mii-llm/nesso-0.4B-instruct