Instructions to use xxrickyxx/Ailo152m-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use xxrickyxx/Ailo152m-v2 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="xxrickyxx/Ailo152m-v2",
	filename="ailo-152m-v2-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use xxrickyxx/Ailo152m-v2 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf xxrickyxx/Ailo152m-v2:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf xxrickyxx/Ailo152m-v2:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf xxrickyxx/Ailo152m-v2:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf xxrickyxx/Ailo152m-v2:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf xxrickyxx/Ailo152m-v2:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf xxrickyxx/Ailo152m-v2:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf xxrickyxx/Ailo152m-v2:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf xxrickyxx/Ailo152m-v2:Q4_K_M

Use Docker

docker model run hf.co/xxrickyxx/Ailo152m-v2:Q4_K_M

LM Studio
Jan

vLLM

How to use xxrickyxx/Ailo152m-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "xxrickyxx/Ailo152m-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "xxrickyxx/Ailo152m-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/xxrickyxx/Ailo152m-v2:Q4_K_M

Ollama
How to use xxrickyxx/Ailo152m-v2 with Ollama:
```
ollama run hf.co/xxrickyxx/Ailo152m-v2:Q4_K_M
```

Unsloth Studio

How to use xxrickyxx/Ailo152m-v2 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for xxrickyxx/Ailo152m-v2 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for xxrickyxx/Ailo152m-v2 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for xxrickyxx/Ailo152m-v2 to start chatting

Atomic Chat new
Docker Model Runner
How to use xxrickyxx/Ailo152m-v2 with Docker Model Runner:
```
docker model run hf.co/xxrickyxx/Ailo152m-v2:Q4_K_M
```

Lemonade

How to use xxrickyxx/Ailo152m-v2 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull xxrickyxx/Ailo152m-v2:Q4_K_M

Run and chat with the model

lemonade run user.Ailo152m-v2-Q4_K_M

List all available models

lemonade list

AILO-152M-v2 Tiny LLM with Chat, Reasoning & Web Search ⚡

A 152M-parameter language model that runs on almost anything laptops, old PCs, even a Raspberry Pi yet does instruction-following chat, step-by-step reasoning, and web search for fresh facts.

AILO (Artificial Intelligence Language Operator) is a compact, fast, from-scratch transformer. v2 turns the original base model into a real assistant: it answers questions, thinks before answering, and can use live web results to answer about things it was never trained on.

ollama run Alieno/ailo-152m-v2


🧠 Parameters	151.9M
⚡ Speed	up to 384 tok/s (GPU), runs on CPU & edge
📦 Size	97 MB (q4_k_m) – 305 MB (f16)
🌐 Web search	yes (context-following)
💭 Reasoning	yes (`<think>`)
🪶 Min RAM	~300 MB

✨ Why AILO-152M-v2?

Runs anywhere 97 MB quantized, ~300 MB RAM. Old laptops, mini-PCs, Raspberry Pi, phones.
Fast fastest in its class (see benchmarks). Real-time chat even on modest hardware.
Web-aware trained for context-following, so it answers from fresh search results instead of stale memory.
Distilled from a bigger model answers learned from Gemma 3 4B (knowledge distillation): richer, better-structured replies than its size suggests.
Honest small model strong at concise factual Q&A and conversation; pairs with tools for exact math.
Open & local no cloud, full privacy, drop-in for Ollama.

Great for: edge/on-device AI, offline assistants, learning how LLMs work, fast prototyping, low-power servers, privacy-first chatbots.

🚀 Quick start

Ollama (recommended)

ollama run Alieno/ailo-152m-v2
>>> What is the capital of Italy?
The capital city of Italy is Rome.

Tags: :latest / :q8_0 (best quality, 156 MB) · :q4_k_m (smallest, 97 MB) · :f16 (full precision, 305 MB)

API

curl http://localhost:11434/api/chat -d '{
  "model": "Alieno/ailo-152m-v2",
  "messages": [{"role": "user", "content": "Explain what gravity is."}]
}'

🏆 Benchmarks

Evaluated via Ollama /api/chat on factual QA, reasoning and coherence vs comparable and larger models:

Model	Params	Factual	Reasoning	Coherence	Speed (tok/s)
AILO-152M-v2	152M	7/8	1–2/5	100%	384 🥇
SmolLM2	135M	8/8	1/5	98%	403
Qwen2.5	500M	8/8	3–4/5	96%	213
TinyLlama	1.1B	8/8	1–2/5	97%	260

🥇 Top coherence (100% virtually no repetition) and among the fastest.
Competitive on factual accuracy with models its size and larger.
Trails only bigger instruction-tuned models on multi-step reasoning expected for the smallest, from-scratch model here.

Measured on an NVIDIA RTX 5060 Ti. Reasoning has run-to-run variance on an 8/5-question micro-suite.

🖥️ Hardware & performance

AILO-152M is tiny, so it runs far beyond high-end GPUs including old and low-power hardware. Approximate generation speed (q4_k_m, ~97 MB):

Hardware	Type	Est. speed (tok/s)	Notes
RTX 5060 Ti / 4070+	Modern GPU	350–450	✅ measured: 384 (q8_0)
RTX 3060 / 2070	Mid GPU	~250–350	smooth real-time
GTX 1660 / 1060	Older GPU	~150–220	still real-time
GTX 1050 / MX150	Old laptop GPU	~90–140	very usable
Ryzen 7 / Core i7 (recent)	Modern CPU	~45–80	no GPU needed
Core i5 ~2015	Old CPU	~18–30	usable for chat
Raspberry Pi 5	SBC / edge	~10–16	runs offline
Raspberry Pi 4	Low-power SBC	~5–9	runs offline
Recent smartphone	Mobile	~15–35	via llama.cpp/Termux

Estimates except the measured RTX 5060 Ti; real numbers vary with quantization, RAM bandwidth and build flags. The takeaway: AILO runs even where larger models can't load at all.

Minimum requirements: ~300 MB RAM (q4_k_m), any x86-64 / ARM CPU. No GPU required.

💬 Chat format

Trained on this template (tags are plain GPT-2 BPE sequences no vocab extension):

<|user|>
{question}
<|assistant|>
<think>{optional reasoning}</think>
{answer}<|end|>

🌐 Web search (fresh facts)

AILO v2 is trained for context-following with override: give it search results and it answers from them even when they contradict its training-time knowledge, so it can use up-to-date facts. When no context is given, it falls back to its own (true) knowledge.

A ready pipeline is included (ailo_web.py): DuckDuckGo → instant-answer + semantic re-ranking (MiniLM) with language/relevance filters → short clean context (fits the 512-token window) → AILO answers.

python ailo_web.py "What is the tallest mountain in the world?"
# -> "Mount Everest, at 8,848 meters."

This is how a 152M model can answer about events it never saw in training.

💭 Reasoning (thinking)

The model declares the thinking capability: set "think": true and the reasoning trace is returned in message.thinking, separate from the answer (shown in a dedicated box in the Ollama desktop app). Best on reasoning-style prompts; for exact math, pair with a calculator tool.

🐍 Python (Transformers)

from huggingface_hub import hf_hub_download
import torch, tiktoken, sys
repo = "xxrickyxx/ailo-152m-v2"
for f in ["config.json","configuration_ailo.py","modeling_ailo.py","pytorch_model.bin"]:
    hf_hub_download(repo_id=repo, filename=f, local_dir="ailo_v2")
sys.path.insert(0, "ailo_v2")
from modeling_ailo import AILOForCausalLM
from configuration_ailo import AILOConfig
model = AILOForCausalLM(AILOConfig())
model.load_state_dict(torch.load("ailo_v2/pytorch_model.bin", map_location="cpu"), strict=False)
model.eval()
tok = tiktoken.get_encoding("gpt2")
ids = torch.tensor([tok.encode_ordinary("<|user|>\nWhat is the capital of Italy?\n<|assistant|>\n")])
print(tok.decode(model.generate(ids, max_new_tokens=40, temperature=0.3)[0].tolist()))

📐 Model details

Property	Value
Parameters	151.9M
Architecture	Decoder-only Transformer (LayerNorm · RoPE · SwiGLU)
Layers / Hidden / Heads	12 / 768 / 12
Context length	512 tokens
Vocabulary	50,257 (GPT-2 BPE)
Base	AILO-152M (FineWeb-Edu, 182k steps)
Fine-tuning	SFT + distillation from Gemma 3 4B: instruction + reasoning (GSM8K) + context-following (SQuAD) + context-override + tool-use
Formats	GGUF (q4_k_m, q8_0, f16) + PyTorch

⚠️ Limitations

152M params: limited world knowledge and multi-step reasoning vs larger models.
512-token context: best with short, focused prompts; not for long documents.
Web-search quality depends on search-result quality; best for well-defined factual questions.
For exact arithmetic, use the tool/agent layer (the calculator does the math).
Primarily English.

📜 License

This project uses a dual-license model.

🆓 Non-Commercial License

Released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0).

You are free to:

Use the model for research, education, and personal projects
Modify and fine-tune the model
Redistribute derivatives under the same license

You must:

Provide attribution
Keep the same license for derivative works
Not use the model for commercial purposes

💼 Commercial License

Commercial use of AILO-152M is not permitted under the free license. Commercial use includes:

Integration into paid products or services
Use in SaaS platforms, APIs, or enterprise systems
Any application that generates revenue directly or indirectly

For commercial licensing, a separate paid agreement (royalty or license fee) is required. Please contact the author.

📬 Contact

For research collaboration or commercial licensing inquiries, contact the project maintainer:

Riccardo Sparacino LinkedIn

📑 Citation

@misc{ailo152m_v2_2026,
  title  = {AILO-152M-v2: A Tiny Instruction-Tuned LLM with Reasoning and Web Search},
  author = {Sparacino, Riccardo},
  year   = {2026},
  note   = {Dual-licensed CC BY-NC-SA 4.0 / commercial}
}

🙏 Acknowledgments

Built with Ollama and llama.cpp. Fine-tuning data: Alpaca-cleaned, GSM8K, SQuAD. Knowledge-distillation teacher: Gemma 3 4B. Embeddings for web re-ranking: sentence-transformers MiniLM.

Keywords: small language model, tiny LLM, 152M, efficient LLM, edge AI, on-device LLM, CPU inference, Raspberry Pi LLM, Ollama model, GGUF, instruction-tuned, reasoning model, web search LLM, RAG, offline assistant, low-resource, fast inference.

Downloads last month: 521

GGUF

Model size

0.2B params

Architecture

llama

Hardware compatibility

4-bit

8-bit

16-bit