Nesso-0.4B-Instruct

Nesso-0.4B-Instruct is a bilingual English/Italian Small Language Model (SLM) optimized for conversational and instruction-following use cases. It is post-trained on top of Zagreus-0.4B-ita, a foundational model trained from scratch by the mii-llm community (Made in Italy – Large Language Model) on the Seeweb HPC infrastructure.

Designed for sovereign edge inference, Nesso-0.4B-Instruct delivers competitive instruction-following performance in both Italian and English at a fraction of the compute cost of larger models.

⚠️ This model is currently at the SFT (Supervised Fine-Tuning) stage. DPO (Direct Preference Optimization) training is planned and updated results will be published upon completion.

Model Details

Property	Value
Architecture	Modified Llama-3.2 (fully dense)
Parameters	~400M
Hidden size	960
Layers	32
Attention heads	15 (KV heads: 5)
Context length	4096 tokens
Tokenizer	Llama-3.2 (`vocab_size`: 128,256)
Precision	BF16
Languages	English, Italian
Base model	mii-llm/zagreus-0.4B-ita
Post-training framework	Axolotl + FSDP
Chat template	ChatML

Training Details

Base Model Pre-training

Nesso-0.4B-Instruct is built on Zagreus-0.4B-ita, which was pre-trained on approximately 1 trillion tokens using the following data mix:

Dataset	Description
FineWeb (350BT sample)	~350B tokens of English web text
FineWeb-2 (ita_Latn)	Italian web text
FinePDFs (ita_Latn)	Italian PDF documents
StarCoder Data	~250B tokens of code

Token distribution: ~400B English + ~400B Italian + ~200B Code
Infrastructure: 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs) on Seeweb HPC
Framework: Nanotron (mii-llm fork)

Post-training (SFT)

Post-training was performed using Axolotl with FSDP across 4 nodes (32× A100 GPUs).

The instruction dataset is a proprietary bilingual (English/Italian) corpus curated by the mii-llm team, with long-term iteration across domains including instruction following, conversational AI, and general knowledge. This dataset is considered a strategic research asset and is not released as open source.

Key hyperparameters:

Hyperparameter	Value
Optimizer	AdamW (fused)
Learning rate	`1e-3`
LR scheduler	Cosine (constant ratio: 0.8, min ratio: 0.3)
Epochs	3
Micro batch size	1
Gradient accumulation steps	8
Sequence length	4096
Max grad norm	1.0
Precision	BF16 + Flash Attention
FSDP strategy	FULL_SHARD

Chat Template

This model uses the ChatML format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Ciao! Come stai?<|im_end|>
<|im_start|>assistant

Special tokens:

pad_token: <|im_end|>
eos_token: <|im_end|>

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mii-llm/nesso-0.4B-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "Sei un assistente utile e preciso."},
    {"role": "user", "content": "Spiegami cos'è un modello linguistico di grandi dimensioni."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

output = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))

Evaluation

Evaluation Commands

# Italian benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks ifeval-ita --device cuda:0 --batch_size 1

# English benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks mmlu --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks hellaswag,arc --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-instruct \
  --tasks ifeval --device cuda:0 --batch_size 1

Results

English Benchmarks

Model	IFEval EN ↑	ARC EN ↑	HellaSwag EN ↑	MMLU EN ↑	Avg EN
Qwen/Qwen3-0.6B	0.2758	0.3430	0.4742	0.4013	0.3736
Nesso-0.4B-instruct	0.3465	0.3003	0.4629	0.2871	0.3492
LiquidAI/LFM2-350M	0.1595	0.2457	0.3092	0.3445	0.2647

Italian Benchmarks

Model	IFEval IT ↑	ARC IT ↑	HellaSwag IT ↑	MMLU IT ↑	Avg IT
Qwen/Qwen3-0.6B	0.3058	0.2729	0.3598	0.4025	0.3353
Nesso-0.4B-instruct	0.2962	0.2874	0.4076	0.2875	0.3197
LiquidAI/LFM2-350M	0.1427	0.2464	0.2994	0.3132	0.2504

Overall

Model	Avg EN	Avg IT	Overall
Qwen/Qwen3-0.6B	0.3736	0.3353	0.3545
Nesso-0.4B-instruct	0.3492	0.3197	0.3345
LiquidAI/LFM2-350M	0.2647	0.2504	0.2576

Discussion

Nesso-0.4B-Instruct achieves the highest IFEval English score (0.3465) among all compared models — including the larger Qwen3-0.6B — demonstrating strong instruction-following capability. On Italian HellaSwag, it also leads with 0.4076.

Qwen3-0.6B maintains a clear advantage on MMLU in both languages. MMLU is a widely used benchmark that is frequently represented in training corpora; we believe our results nonetheless demonstrate a highly competitive SLM for English/Italian edge deployment scenarios.

Related Models

Model	Description
Zagreus-0.4B-ita	Base pre-trained model (this model's foundation)
Nesso-0.4B-agentic	Optimized for function calling and agentic tasks
Open-Zagreus-0.4B	Fully open-source SFT variant

Citation

If you use this model in your research, please cite:

@misc{nesso2025,
  title        = {The Joy and Pain of Training an LLM from Scratch:
                  A Technical Report on the Zagreus and Nesso Model Families},
  author       = {mii-llm community},
  year         = {2025},
  howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
}

Acknowledgements

Antonio Baldassarra (CEO, Seeweb) and Marco Cristofanilli (Head of AI, Seeweb) for infrastructure sponsorship
The Hugging Face team for Nanotron, datatrove, FineWeb, and FineWeb-2
The mii-llm open-source community

License

Released under the Apache 2.0 license.

Made with ❤️ in Italy by mii-llm

Downloads last month: 46

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for mii-llm/nesso-0.4B-instruct

Base model

mii-llm/zagreus-0.4B-ita

Finetuned

(3)

this model

Datasets used to train mii-llm/nesso-0.4B-instruct

Collection including mii-llm/nesso-0.4B-instruct

Zagreus - Nesso fine tuned

Collection

The collection contains three bilingual English/Italian SLMs post-trained on Zagreus-0.4B-ita: instruct, agentic, and a fully open-source • 3 items • Updated 3 days ago • 1