moonshotai/Kimi-K2.7-Code optimized for running on a Mac Studio M3 Ultra.

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 3-bit MoE baseline with important always-on layers at higher precision.
  • Fits into ~460 GB memory, leaving enough room for a smaller utility model.

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Kimi-K2.7-Code-MLX-3.6bit

Benchmarks

metric this model
bpw 3.578
base memory 427.579
peak memory (1024/512) 460.444
prompt tok/s (1024) 218.851 卤 0.208
gen tok/s (512) 21.035 卤 0.049
perplexity 4.462 卤 0.037
arc_challenge 0.692 卤 0.021
hellaswag 0.780 卤 0.019

Methodology

Quantized with a mlx-lm fork. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
1,824
Safetensors
Model size
1T params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/Kimi-K2.7-Code-MLX-3.6bit

Quantized
(14)
this model