Instructions to use xxrickyxx/Ailo152m-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use xxrickyxx/Ailo152m-v2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="xxrickyxx/Ailo152m-v2", filename="ailo-152m-v2-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use xxrickyxx/Ailo152m-v2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf xxrickyxx/Ailo152m-v2:Q4_K_M # Run inference directly in the terminal: llama-cli -hf xxrickyxx/Ailo152m-v2:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf xxrickyxx/Ailo152m-v2:Q4_K_M # Run inference directly in the terminal: llama-cli -hf xxrickyxx/Ailo152m-v2:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf xxrickyxx/Ailo152m-v2:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf xxrickyxx/Ailo152m-v2:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf xxrickyxx/Ailo152m-v2:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf xxrickyxx/Ailo152m-v2:Q4_K_M
Use Docker
docker model run hf.co/xxrickyxx/Ailo152m-v2:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use xxrickyxx/Ailo152m-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "xxrickyxx/Ailo152m-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "xxrickyxx/Ailo152m-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/xxrickyxx/Ailo152m-v2:Q4_K_M
- Ollama
How to use xxrickyxx/Ailo152m-v2 with Ollama:
ollama run hf.co/xxrickyxx/Ailo152m-v2:Q4_K_M
- Unsloth Studio
How to use xxrickyxx/Ailo152m-v2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for xxrickyxx/Ailo152m-v2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for xxrickyxx/Ailo152m-v2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for xxrickyxx/Ailo152m-v2 to start chatting
- Atomic Chat new
- Docker Model Runner
How to use xxrickyxx/Ailo152m-v2 with Docker Model Runner:
docker model run hf.co/xxrickyxx/Ailo152m-v2:Q4_K_M
- Lemonade
How to use xxrickyxx/Ailo152m-v2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull xxrickyxx/Ailo152m-v2:Q4_K_M
Run and chat with the model
lemonade run user.Ailo152m-v2-Q4_K_M
List all available models
lemonade list
- AILO-152M-v2 Tiny LLM with Chat, Reasoning & Web Search β‘
AILO-152M-v2 Tiny LLM with Chat, Reasoning & Web Search β‘
A 152M-parameter language model that runs on almost anything laptops, old PCs, even a Raspberry Pi yet does instruction-following chat, step-by-step reasoning, and web search for fresh facts.
AILO (Artificial Intelligence Language Operator) is a compact, fast, from-scratch transformer. v2 turns the original base model into a real assistant: it answers questions, thinks before answering, and can use live web results to answer about things it was never trained on.
ollama run Alieno/ailo-152m-v2
| π§ Parameters | 151.9M |
| β‘ Speed | up to 384 tok/s (GPU), runs on CPU & edge |
| π¦ Size | 97 MB (q4_k_m) β 305 MB (f16) |
| π Web search | yes (context-following) |
| π Reasoning | yes (<think>) |
| πͺΆ Min RAM | ~300 MB |
β¨ Why AILO-152M-v2?
- Runs anywhere 97 MB quantized, ~300 MB RAM. Old laptops, mini-PCs, Raspberry Pi, phones.
- Fast fastest in its class (see benchmarks). Real-time chat even on modest hardware.
- Web-aware trained for context-following, so it answers from fresh search results instead of stale memory.
- Distilled from a bigger model answers learned from Gemma 3 4B (knowledge distillation): richer, better-structured replies than its size suggests.
- Honest small model strong at concise factual Q&A and conversation; pairs with tools for exact math.
- Open & local no cloud, full privacy, drop-in for Ollama.
Great for: edge/on-device AI, offline assistants, learning how LLMs work, fast prototyping, low-power servers, privacy-first chatbots.
π Quick start
Ollama (recommended)
ollama run Alieno/ailo-152m-v2
>>> What is the capital of Italy?
The capital city of Italy is Rome.
Tags: :latest / :q8_0 (best quality, 156 MB) Β· :q4_k_m (smallest, 97 MB) Β· :f16 (full precision, 305 MB)
API
curl http://localhost:11434/api/chat -d '{
"model": "Alieno/ailo-152m-v2",
"messages": [{"role": "user", "content": "Explain what gravity is."}]
}'
π Benchmarks
Evaluated via Ollama /api/chat on factual QA, reasoning and coherence vs comparable and larger models:
| Model | Params | Factual | Reasoning | Coherence | Speed (tok/s) |
|---|---|---|---|---|---|
| AILO-152M-v2 | 152M | 7/8 | 1β2/5 | 100% | 384 π₯ |
| SmolLM2 | 135M | 8/8 | 1/5 | 98% | 403 |
| Qwen2.5 | 500M | 8/8 | 3β4/5 | 96% | 213 |
| TinyLlama | 1.1B | 8/8 | 1β2/5 | 97% | 260 |
- π₯ Top coherence (100% virtually no repetition) and among the fastest.
- Competitive on factual accuracy with models its size and larger.
- Trails only bigger instruction-tuned models on multi-step reasoning expected for the smallest, from-scratch model here.
Measured on an NVIDIA RTX 5060 Ti. Reasoning has run-to-run variance on an 8/5-question micro-suite.
π₯οΈ Hardware & performance
AILO-152M is tiny, so it runs far beyond high-end GPUs including old and low-power hardware. Approximate generation speed (q4_k_m, ~97 MB):
| Hardware | Type | Est. speed (tok/s) | Notes |
|---|---|---|---|
| RTX 5060 Ti / 4070+ | Modern GPU | 350β450 | β measured: 384 (q8_0) |
| RTX 3060 / 2070 | Mid GPU | ~250β350 | smooth real-time |
| GTX 1660 / 1060 | Older GPU | ~150β220 | still real-time |
| GTX 1050 / MX150 | Old laptop GPU | ~90β140 | very usable |
| Ryzen 7 / Core i7 (recent) | Modern CPU | ~45β80 | no GPU needed |
| Core i5 ~2015 | Old CPU | ~18β30 | usable for chat |
| Raspberry Pi 5 | SBC / edge | ~10β16 | runs offline |
| Raspberry Pi 4 | Low-power SBC | ~5β9 | runs offline |
| Recent smartphone | Mobile | ~15β35 | via llama.cpp/Termux |
Estimates except the measured RTX 5060 Ti; real numbers vary with quantization, RAM bandwidth and build flags. The takeaway: AILO runs even where larger models can't load at all.
Minimum requirements: ~300 MB RAM (q4_k_m), any x86-64 / ARM CPU. No GPU required.
π¬ Chat format
Trained on this template (tags are plain GPT-2 BPE sequences no vocab extension):
<|user|>
{question}
<|assistant|>
<think>{optional reasoning}</think>
{answer}<|end|>
π Web search (fresh facts)
AILO v2 is trained for context-following with override: give it search results and it answers from them even when they contradict its training-time knowledge, so it can use up-to-date facts. When no context is given, it falls back to its own (true) knowledge.
A ready pipeline is included (ailo_web.py): DuckDuckGo β instant-answer + semantic re-ranking (MiniLM) with language/relevance filters β short clean context (fits the 512-token window) β AILO answers.
python ailo_web.py "What is the tallest mountain in the world?"
# -> "Mount Everest, at 8,848 meters."
This is how a 152M model can answer about events it never saw in training.
π Reasoning (thinking)
The model declares the thinking capability: set "think": true and the reasoning trace is returned in message.thinking, separate from the answer (shown in a dedicated box in the Ollama desktop app). Best on reasoning-style prompts; for exact math, pair with a calculator tool.
π Python (Transformers)
from huggingface_hub import hf_hub_download
import torch, tiktoken, sys
repo = "xxrickyxx/ailo-152m-v2"
for f in ["config.json","configuration_ailo.py","modeling_ailo.py","pytorch_model.bin"]:
hf_hub_download(repo_id=repo, filename=f, local_dir="ailo_v2")
sys.path.insert(0, "ailo_v2")
from modeling_ailo import AILOForCausalLM
from configuration_ailo import AILOConfig
model = AILOForCausalLM(AILOConfig())
model.load_state_dict(torch.load("ailo_v2/pytorch_model.bin", map_location="cpu"), strict=False)
model.eval()
tok = tiktoken.get_encoding("gpt2")
ids = torch.tensor([tok.encode_ordinary("<|user|>\nWhat is the capital of Italy?\n<|assistant|>\n")])
print(tok.decode(model.generate(ids, max_new_tokens=40, temperature=0.3)[0].tolist()))
π Model details
| Property | Value |
|---|---|
| Parameters | 151.9M |
| Architecture | Decoder-only Transformer (LayerNorm Β· RoPE Β· SwiGLU) |
| Layers / Hidden / Heads | 12 / 768 / 12 |
| Context length | 512 tokens |
| Vocabulary | 50,257 (GPT-2 BPE) |
| Base | AILO-152M (FineWeb-Edu, 182k steps) |
| Fine-tuning | SFT + distillation from Gemma 3 4B: instruction + reasoning (GSM8K) + context-following (SQuAD) + context-override + tool-use |
| Formats | GGUF (q4_k_m, q8_0, f16) + PyTorch |
β οΈ Limitations
- 152M params: limited world knowledge and multi-step reasoning vs larger models.
- 512-token context: best with short, focused prompts; not for long documents.
- Web-search quality depends on search-result quality; best for well-defined factual questions.
- For exact arithmetic, use the tool/agent layer (the calculator does the math).
- Primarily English.
π License
This project uses a dual-license model.
π Non-Commercial License
Released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0).
You are free to:
- Use the model for research, education, and personal projects
- Modify and fine-tune the model
- Redistribute derivatives under the same license
You must:
- Provide attribution
- Keep the same license for derivative works
- Not use the model for commercial purposes
πΌ Commercial License
Commercial use of AILO-152M is not permitted under the free license. Commercial use includes:
- Integration into paid products or services
- Use in SaaS platforms, APIs, or enterprise systems
- Any application that generates revenue directly or indirectly
For commercial licensing, a separate paid agreement (royalty or license fee) is required. Please contact the author.
π¬ Contact
For research collaboration or commercial licensing inquiries, contact the project maintainer:
Riccardo Sparacino LinkedIn
π Citation
@misc{ailo152m_v2_2026,
title = {AILO-152M-v2: A Tiny Instruction-Tuned LLM with Reasoning and Web Search},
author = {Sparacino, Riccardo},
year = {2026},
note = {Dual-licensed CC BY-NC-SA 4.0 / commercial}
}
π Acknowledgments
Built with Ollama and llama.cpp. Fine-tuning data: Alpaca-cleaned, GSM8K, SQuAD. Knowledge-distillation teacher: Gemma 3 4B. Embeddings for web re-ranking: sentence-transformers MiniLM.
Keywords: small language model, tiny LLM, 152M, efficient LLM, edge AI, on-device LLM, CPU inference, Raspberry Pi LLM, Ollama model, GGUF, instruction-tuned, reasoning model, web search LLM, RAG, offline assistant, low-resource, fast inference.
- Downloads last month
- 521
4-bit
8-bit
16-bit