Instructions to use prism-ml/bonsai-image-ternary-4B-mlx-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use prism-ml/bonsai-image-ternary-4B-mlx-2bit with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("prism-ml/bonsai-image-ternary-4B-mlx-2bit", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - MLX
How to use prism-ml/bonsai-image-ternary-4B-mlx-2bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir bonsai-image-ternary-4B-mlx-2bit prism-ml/bonsai-image-ternary-4B-mlx-2bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Draw Things
- DiffusionBee
Prism ML Website | White Paper | Demo & Examples | Discord
bonsai-image-ternary-4B-mlx-2bit
Ternary weight (1.58-bit) text-to-image diffusion transformer deployment for Apple Silicon
1.21 GB transformer | 6.4Γ smaller than FP16 | 9.4 s / 512Β² on iPhone 17 Pro Max | ~6 s / 512Β² on M4 Pro | runs on Mac, iPhone, iPad
Highlights
- 1.21 GB diffusion transformer, down from 7.75 GB for the FP16 FLUX.2 Klein 4B transformer
- Ternary {β1, 0, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights)
- Quality-oriented Bonsai Image variant: the additional zero state improves visual quality and prompt fidelity while keeping the transformer compact
- 3.88 GB Apple Silicon deployment payload including the 4-bit text encoder and FP16 VAE β text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident
- 4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 β no CFG, no negative prompts needed
- MLX-native 2-bit format for Apple Silicon, the same kernel path as our ternary language-model releases
- Cross-platform companion: also available as gemlite 2-bit for NVIDIA GPUs
Resources
- White Paper β full benchmarks, kernels, and memory analysis
- Demo repo β one-command setup for Mac / Linux / Windows
- Discord β community + support
- Kernels: MLX (Apple Silicon) Β· mlx-swift (iOS / macOS) β 2-bit format is supported out of the box
Model Overview
| Item | Specification |
|---|---|
| Base architecture | FLUX.2 Klein 4B (MMDiT diffusion transformer) |
| Parameters | ~4.0B (transformer trunk) |
| Blocks | 25 MMDiT blocks: 5 double-stream + 20 single-stream |
| Sampler | FlowMatchEuler, 4 steps, guidance = 1.0, shift = 3.0 |
| Text encoder | Qwen3-4B at 4-bit (β 2.28 GB on-device, offloaded after prompt encode) |
| VAE | Flux2 32-channel latent, tiled decode (128 px tiles) |
| Native resolution | 1024Γ1024 (also supports 512Γ512 and arbitrary multiples of 32) |
| Weight format | MLX 2-bit g128, ternary values + FP16 group-wise scales |
| Transformer size | 1.21 GB (6.4Γ smaller than 7.75 GB FP16) |
| Total payload | 3.88 GB (4.1x smaller than the 15.97 GB FP16 transformer + text encoder + VAE) |
| Ternary coverage | All 100 matmul-heavy linears in the 25 MMDiT blocks |
| License | Apache 2.0 |
Ternary Weight Representation: 1.58-bit g128
Each ternary weight takes a value from {β1, 0, +1} with one shared FP16 scale per group of 128 weights:
w_i = scale_g * t_i, t_i in {β1, 0, +1}
Ternary values carry logβ(3) β 1.585 bits of information per weight. With one FP16 scale per group of 128, the effective storage is
b_eff β log2(3) + 16/128 β 1.585 + 0.125 β 1.71 bits/weight
This gives an idealized 9.4Γ reduction relative to FP16 for the ternary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final Ternary Bonsai Image 4B diffusion transformer is 1.21 GB, a 6.4Γ reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer.
The ternary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability.
The MLX deployment uses a 2-bit packed format. Ternary values are stored in 2-bit slots, with the fourth code unused. The model-level Bonsai representation is 1.21 GB; the deployed MLX pack is 1.43 GB on disk due to runtime packing and alignment overhead in the current MLX path.
Memory
| Format | Transformer size | Reduction | Ratio |
|---|---|---|---|
| FP16 FLUX.2 Klein 4B | 7.75 GB | β | 1.0Γ |
| Ternary Bonsai Image 4B | 1.21 GB | 84.4% | 6.4Γ |
Apple Silicon deployment:
| Component | Size |
|---|---|
| MLX 2-bit diffusion transformer | 1.43 GB |
| Compressed text encoder | 2.28 GB |
| FP16 VAE | 0.17 GB |
| Total payload | 3.88 GB |
At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact ternary diffusion transformer and active image-generation components rather than the full payload.
End-to-end Mac M4 Pro mean-active memory pressure at 1024Β² is 2.38 GB β a 6.0Γ reduction vs the stock FP16 MFLUX pipeline (14.39 GB).
Best Practices
- Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0 (no classifier-free guidance), shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts.
- Resolution: native 1024Β² is the design target; 512Β² works for quick previews.
- Aspect ratios: multiples of 32 are supported, including 832Γ1248 and 1248Γ832.
- Prompting: natural-language prompts. Negative prompts are not required.
- Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light.
Quickstart
MLX (Python)
The simplest path is the Bonsai Image Demo repo, which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend):
git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
./scripts/download_model.sh # ternary is the default
./scripts/serve.sh
For a one-shot render without the studio frontend:
./scripts/generate.sh --prompt "A bonsai tree in a quiet ceramic studio, soft morning light"
MLX Swift (iOS / macOS)
Ternary Bonsai Image 4B runs natively on iPhone and iPad via MLX Swift. Bonsai Studio for iPhone is available on the App Store and ships ternary as the default variant.
Throughput (MLX / Apple Silicon)
Mac M4 Pro (48 GB unified memory), 4 denoising steps, fixed prompt and seed:
| Resolution | s / step | s / image (mean Β± std) | vs stock MFLUX FP16 |
|---|---|---|---|
| 512 Γ 512 | 1.44 | 5.78 Β± 0.08 s | 3.15Γ |
| 1024 Γ 1024 | 6.06 | 24.26 Β± 0.24 s | 5.56Γ |
iPhone 17 Pro Max (A19 Pro, 12 GB unified memory), MLX Swift, same methodology:
| Resolution | s / step | s / image |
|---|---|---|
| 128 Γ 128 | 0.68 | 2.7 s |
| 256 Γ 256 | 1.00 | 4.0 s |
| 512 Γ 512 | 2.35 | 9.4 s |
| 1024 Γ 1024 | 8.50 | 34.0 s |
Stock FP16 FLUX.2 Klein 4B does not fit within iPhone 17 Pro Max's 12 GB unified memory budget; Bonsai Image 4B models do.
Benchmarks
Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks.
| Model | Transformer (GB) | GenEval | HPSv3 | DPG-Bench |
|---|---|---|---|---|
| Bonsai Image Β· Ternary 4B | 1.21 | 0.723 | 12.22 | 0.851 |
| Bonsai Image Β· Binary 4B | 0.93 | 0.671 | 11.15 | 0.822 |
| FLUX.2 Klein 4B | 7.75 | 0.819 | 12.84 | 0.853 |
| FLUX.1-schnell | 23.8 | 0.716 | 12.67 | 0.848 |
| SDXL | 5.14 | 0.300 | 10.05 | 0.740 |
| PixArt-Ξ£ XL 2 | 1.20 | 0.541 | 11.93 | 0.769 |
| Stable Diffusion 1.5 | 1.72 | 0.396 | 4.20 | 0.601 |
| BK-SDM-Small | 0.98 | 0.297 | 3.05 | 0.559 |
The benchmark results show the intended quality-footprint trade-off. Ternary Bonsai Image 4B is the quality-oriented variant: at 1.21 GB, it sits very close to FLUX.2 Klein 4B across GenEval, HPSv3, and DPG-Bench while reducing the diffusion transformer footprint by 6.4x. The binary companion is the footprint-oriented variant, reducing the diffusion transformer below 1 GB while still delivering strong benchmark results.
Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models.
Use Cases
- Local creative tooling: image generation directly on Mac, iPhone, and iPad
- Private generation: prompts and generated assets can remain local
- Rapid iteration: lower local latency and no remote queue for iterative creative workflows
- Mobile deployment: image generation on devices with unified-memory, thermal, and connectivity constraints
- Commodity-GPU serving: lower transformer footprint and reduced memory pressure through the companion CUDA deployment
- Enterprise and controlled inference: local or private environments for data residency and compliance-sensitive workflows
Limitations
- Ternary Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact ternary-weight deployment designed to deliver similar practical behavior at much smaller size.
- Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case.
- Current commodity inference stacks do not yet expose fully native ternary execution as a standard hardware path. This release uses practical MLX low-bit kernel paths on Apple Silicon and Gemlite low-bit GEMM on CUDA.
- After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding.
Citation
@techreport{bonsaiimage4b,
title = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs},
author = {Prism ML},
year = {2026},
month = {May},
url = {https://prismml.com}
}
Contact
For questions, feedback, or collaboration inquiries: contact@prismml.com
- Downloads last month
- -
Quantized
Model tree for prism-ml/bonsai-image-ternary-4B-mlx-2bit
Base model
black-forest-labs/FLUX.2-klein-4B