bonsai-image-ternary-4B-mlx-2bit

Ternary weight (1.58-bit) text-to-image diffusion transformer deployment for Apple Silicon

1.21 GB transformer | 6.4× smaller than FP16 | 9.4 s / 512² on iPhone 17 Pro Max | ~6 s / 512² on M4 Pro | runs on Mac, iPhone, iPad

Highlights

1.21 GB diffusion transformer, down from 7.75 GB for the FP16 FLUX.2 Klein 4B transformer
Ternary {−1, 0, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights)
Quality-oriented Bonsai Image variant: the additional zero state improves visual quality and prompt fidelity while keeping the transformer compact
3.88 GB Apple Silicon deployment payload including the 4-bit text encoder and FP16 VAE — text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident
4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 — no CFG, no negative prompts needed
MLX-native 2-bit format for Apple Silicon, the same kernel path as our ternary language-model releases
Cross-platform companion: also available as gemlite 2-bit for NVIDIA GPUs

Resources

White Paper — full benchmarks, kernels, and memory analysis
Demo repo — one-command setup for Mac / Linux / Windows
Discord — community + support
Kernels: MLX (Apple Silicon) · mlx-swift (iOS / macOS) — 2-bit format is supported out of the box

Model Overview

Item	Specification
Base architecture	FLUX.2 Klein 4B (MMDiT diffusion transformer)
Parameters	~4.0B (transformer trunk)
Blocks	25 MMDiT blocks: 5 double-stream + 20 single-stream
Sampler	FlowMatchEuler, 4 steps, guidance = 1.0, shift = 3.0
Text encoder	Qwen3-4B at 4-bit (≈ 2.28 GB on-device, offloaded after prompt encode)
VAE	Flux2 32-channel latent, tiled decode (128 px tiles)
Native resolution	1024×1024 (also supports 512×512 and arbitrary multiples of 32)
Weight format	MLX 2-bit g128, ternary values + FP16 group-wise scales
Transformer size	1.21 GB (6.4× smaller than 7.75 GB FP16)
Total payload	3.88 GB (4.1x smaller than the 15.97 GB FP16 transformer + text encoder + VAE)
Ternary coverage	All 100 matmul-heavy linears in the 25 MMDiT blocks
License	Apache 2.0

Ternary Weight Representation: 1.58-bit g128

Each ternary weight takes a value from {−1, 0, +1} with one shared FP16 scale per group of 128 weights:

w_i = scale_g * t_i,    t_i in {−1, 0, +1}

Ternary values carry log₂(3) ≈ 1.585 bits of information per weight. With one FP16 scale per group of 128, the effective storage is

b_eff ≈ log2(3) + 16/128 ≈ 1.585 + 0.125 ≈ 1.71 bits/weight

This gives an idealized 9.4× reduction relative to FP16 for the ternary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final Ternary Bonsai Image 4B diffusion transformer is 1.21 GB, a 6.4× reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer.

The ternary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability.

The MLX deployment uses a 2-bit packed format. Ternary values are stored in 2-bit slots, with the fourth code unused. The model-level Bonsai representation is 1.21 GB; the deployed MLX pack is 1.43 GB on disk due to runtime packing and alignment overhead in the current MLX path.

Memory

Format	Transformer size	Reduction	Ratio
FP16 FLUX.2 Klein 4B	7.75 GB	—	1.0×
Ternary Bonsai Image 4B	1.21 GB	84.4%	6.4×

Apple Silicon deployment:

Component	Size
MLX 2-bit diffusion transformer	1.43 GB
Compressed text encoder	2.28 GB
FP16 VAE	0.17 GB
Total payload	3.88 GB

At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact ternary diffusion transformer and active image-generation components rather than the full payload.

End-to-end Mac M4 Pro mean-active memory pressure at 1024² is 2.38 GB — a 6.0× reduction vs the stock FP16 MFLUX pipeline (14.39 GB).

Best Practices

Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0 (no classifier-free guidance), shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts.
Resolution: native 1024² is the design target; 512² works for quick previews.
Aspect ratios: multiples of 32 are supported, including 832×1248 and 1248×832.
Prompting: natural-language prompts. Negative prompts are not required.
Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light.

Quickstart

MLX (Python)

The simplest path is the Bonsai Image Demo repo, which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend):

git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
./scripts/download_model.sh           # ternary is the default
./scripts/serve.sh

For a one-shot render without the studio frontend:

./scripts/generate.sh --prompt "A bonsai tree in a quiet ceramic studio, soft morning light"

MLX Swift (iOS / macOS)

Ternary Bonsai Image 4B runs natively on iPhone and iPad via MLX Swift. Bonsai Studio for iPhone is available on the App Store and ships ternary as the default variant.

Throughput (MLX / Apple Silicon)

Mac M4 Pro (48 GB unified memory), 4 denoising steps, fixed prompt and seed:

Resolution	s / step	s / image (mean ± std)	vs stock MFLUX FP16
512 × 512	1.44	5.78 ± 0.08 s	3.15×
1024 × 1024	6.06	24.26 ± 0.24 s	5.56×

iPhone 17 Pro Max (A19 Pro, 12 GB unified memory), MLX Swift, same methodology:

Resolution	s / step	s / image
128 × 128	0.68	2.7 s
256 × 256	1.00	4.0 s
512 × 512	2.35	9.4 s
1024 × 1024	8.50	34.0 s

Stock FP16 FLUX.2 Klein 4B does not fit within iPhone 17 Pro Max's 12 GB unified memory budget; Bonsai Image 4B models do.

Benchmarks

Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks.

Model	Transformer (GB)	GenEval	HPSv3	DPG-Bench
Bonsai Image · Ternary 4B	1.21	0.723	12.22	0.851
Bonsai Image · Binary 4B	0.93	0.671	11.15	0.822
FLUX.2 Klein 4B	7.75	0.819	12.84	0.853
FLUX.1-schnell	23.8	0.716	12.67	0.848
SDXL	5.14	0.300	10.05	0.740
PixArt-Σ XL 2	1.20	0.541	11.93	0.769
Stable Diffusion 1.5	1.72	0.396	4.20	0.601
BK-SDM-Small	0.98	0.297	3.05	0.559

The benchmark results show the intended quality-footprint trade-off. Ternary Bonsai Image 4B is the quality-oriented variant: at 1.21 GB, it sits very close to FLUX.2 Klein 4B across GenEval, HPSv3, and DPG-Bench while reducing the diffusion transformer footprint by 6.4x. The binary companion is the footprint-oriented variant, reducing the diffusion transformer below 1 GB while still delivering strong benchmark results.

Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models.

Use Cases

Local creative tooling: image generation directly on Mac, iPhone, and iPad
Private generation: prompts and generated assets can remain local
Rapid iteration: lower local latency and no remote queue for iterative creative workflows
Mobile deployment: image generation on devices with unified-memory, thermal, and connectivity constraints
Commodity-GPU serving: lower transformer footprint and reduced memory pressure through the companion CUDA deployment
Enterprise and controlled inference: local or private environments for data residency and compliance-sensitive workflows

Limitations

Ternary Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact ternary-weight deployment designed to deliver similar practical behavior at much smaller size.
Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case.
Current commodity inference stacks do not yet expose fully native ternary execution as a standard hardware path. This release uses practical MLX low-bit kernel paths on Apple Silicon and Gemlite low-bit GEMM on CUDA.
After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding.

Citation

@techreport{bonsaiimage4b,
    title   = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs},
    author  = {Prism ML},
    year    = {2026},
    month   = {May},
    url     = {https://prismml.com}
}