How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sjakek/LFM-2.5-8B-1B-hermes-ft
# Run inference directly in the terminal:
llama-cli -hf sjakek/LFM-2.5-8B-1B-hermes-ft
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sjakek/LFM-2.5-8B-1B-hermes-ft
# Run inference directly in the terminal:
llama-cli -hf sjakek/LFM-2.5-8B-1B-hermes-ft
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sjakek/LFM-2.5-8B-1B-hermes-ft
# Run inference directly in the terminal:
./llama-cli -hf sjakek/LFM-2.5-8B-1B-hermes-ft
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sjakek/LFM-2.5-8B-1B-hermes-ft
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sjakek/LFM-2.5-8B-1B-hermes-ft
Use Docker
docker model run hf.co/sjakek/LFM-2.5-8B-1B-hermes-ft
Quick Links

LFM-2.5-8B-1B Hermes FT

This repo contains Hermes/tool-use fine-tuned variants of LiquidAI/LFM2.5-8B-A1B.

The current release candidate is iter13_llamacpp_chat_fixed. Earlier GGUF artifacts were withdrawn because they overfit tool routing and regressed normal chat by stopping too early. The iter13 repair adds a narrow chat-retention LoRA pass on top of the fixed Hermes tool-router model, then regenerates MLX and GGUF quants from the repaired fused checkpoint.

Runtime requirements

  • MLX releases preserve tool_parser_type: "pythonic".
  • llama.cpp GGUF releases are intended to be served with the LFM chat template, --jinja, and a 64K context smoke target.
  • Tested live server alias: sjakek/LFM-2.5-8B-1B-hermes-ft:Q6KXL.

Available artifacts

  • gguf/LFM-2.5-8B-1B-Hermes-Tuned-Q8KXL.gguf
  • gguf/LFM-2.5-8B-1B-Hermes-Tuned-Q6KXL.gguf
  • gguf/LFM-2.5-8B-1B-Hermes-Tuned-Q5KXL.gguf
  • gguf/LFM-2.5-8B-1B-Hermes-Tuned-Q4KXL.gguf
  • mlx/8bit/
  • mlx/6bit/
  • adapters/iter13_llamacpp_chat_retention_r8/
  • evals/iter13_*

Validation summary

All listed iter13 variants passed:

  • Normal-chat regression: 10/10
  • Fixed-Hermes tool-router suite: 43/43
  • Structured tool-call cases: 28/28
  • No-tool false positives: 0/10
  • Text tool-call leaks: 0

The normal-chat eval includes plain chat, tools-available no-tool chat, short factual answers, and multi-sentence explanations. The tool suite uses the fixed Hermes browser, terminal, file/search/write, no-tool, and tool-result finalization cases.

Training summary

  1. Semi-full-gradient grouped MoE expert training over the Hermes trace corpus.
  2. Fixed-Hermes contrastive router LoRA repairs for structured pythonic tool calls.
  3. iter12 chat-retention repair to fix MLX normal-chat early stopping.
  4. iter13 llama.cpp-targeted chat-retention repair after BF16 GGUF exposed shorter completions than MLX.

The GGUFs were regenerated from a dequantized fused safetensors source, then quantized from the BF16 GGUF parent. They are named Q*KXL as Hermes-tuned mixed-precision KXL targets.

Downloads last month
1,564
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sjakek/LFM-2.5-8B-1B-hermes-ft

Quantized
(44)
this model