OsaurusAI/Qwen3.6-27B-MXFP4

Qwen 3.6 27B — MXFP4 (MLX)

Open Compute Project MXFP4 quantization of Alibaba's hybrid linear/full attention dense 27B VL model, with the vision tower preserved.

Model Details

Property	Value
Base model	`Qwen/Qwen3.6-27B`
Parameters	27.32 B, dense (no MoE)
Architecture	`qwen3_5` — 64 decoder layers: 48 `Gated DeltaNet` (linear-attn) + 16 full-attention with `swish` output gate
Quantization	OCP MXFP4 (E2M1 + shared E8M0 scale) at block 32
Package size on disk	14 GB across 3 shards
Bits per weight	4.449
vs BF16 source	52 GB → 14 GB, 3.7× compression
Context (position embeddings)	262,144 native; upstream card reports up to ~1 M with YaRN scaling
Vision tower	27-layer ViT (hidden 1152, patch 16), MXFP4 quantized
Chat format	Qwen im_start/im_end, unified thinking toggle

Quantization details

Category	Bits	Group	Notes
Dense FFN (`mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`)	4 (MXFP4)	32	Bulk of parameters
Full-attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`)	4 (MXFP4)	32	`q_proj` is fused with a swish output gate (output split 50/50 queries/gate)
Linear-attention projections (`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`, `out_proj`)	4 (MXFP4)	32
Embedding (`embed_tokens`), `lm_head`	4 (MXFP4)	32
Vision tower	4 (MXFP4)	32
Norms, `A_log`, `dt_bias`, `conv1d`	bf16 passthrough	—	Kept un-quantized

MXFP4 is the open OCP Microscaling FP4 spec, distinct from NVIDIA's NVFP4. MLX exposes both as separate --q-mode values; this release uses mxfp4. No activation-aware calibration (no AWQ) — quant is purely weight-driven, so vision and text inputs are treated with equal fidelity.

Architecture notes — what's new vs Qwen 3 / 3.5

Hybrid attention stack: 48 of 64 layers use Gated DeltaNet, a linear-attention / delta-rule hybrid with a grouped conv1d input path and per-head A_log / dt_bias state — constant memory in sequence length. The other 16 layers (one every 4, given by full_attention_interval: 4) use full softmax attention with attn_output_gate: true — q_proj produces a fused (queries, gate) tensor; attention output is multiplied by sigmoid(gate) before o_proj.
Partial rotary embeddings: only the first 25% of head dim rotates (partial_rotary_factor: 0.25), rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section, mrope_interleaved: true) is preserved in config.json.
Dense FFN: no MoE. Each layer has gate_proj/up_proj (5120 → 17408) + down_proj (17408 → 5120) with SwiGLU activation.
Vision tower: qwen3_vl ViT, 27 layers, hidden 1152, patch 16, temporal_patch 2. Produces video token sequences via 3D conv patch-embed (pairs of frames merge into one temporal patch).

Usage

Load in Osaurus on Apple Silicon (macOS) — single-click deploy, local chat + vision, no Python setup. The bundle also loads in any Apple Silicon MLX runtime that supports qwen3_5 VL bundles at mxfp4 quantization.

Reasoning on/off, image inference, and video inference are all verified on this quant (see table below).

Verified modalities

Test	Result
Chat template (with + without thinking)	✓ coherent
Text: "Translate to French: Hello, how are you?" → "Bonjour, comment allez-vous ?"	✓
Text: `def fibonacci(n):` → correct recursive continuation	✓
Math: `2 + 2` with `enable_thinking=False` → "2 + 2 = 4" direct	✓
VL image: solid red/green/blue/yellow 224×224 → correct color ID	✓ 4/4
VL video: 4-frame RGBY sequence → structurally coherent description	✓

MMLU-200 (10 subjects × 20 questions, reasoning OFF)

Both quants evaluated on the same 200-question slice of MMLU with enable_thinking=False (direct answer, no <think> preamble). Same prompts, same greedy decode, same extraction.

Subject	MXFP4	JANG_4M	Δ (JANG − MXFP4)
abstract_algebra	12/20 (60.0%)	15/20 (75.0%)	+3
anatomy	18/20 (90.0%)	16/20 (80.0%)	-2
astronomy	20/20 (100.0%)	19/20 (95.0%)	-1
college_computer_science	16/20 (80.0%)	16/20 (80.0%)	0
college_physics	15/20 (75.0%)	15/20 (75.0%)	0
high_school_biology	19/20 (95.0%)	19/20 (95.0%)	0
high_school_chemistry	16/20 (80.0%)	15/20 (75.0%)	-1
high_school_mathematics	12/20 (60.0%)	14/20 (70.0%)	+2
logical_fallacies	20/20 (100.0%)	19/20 (95.0%)	-1
world_religions	19/20 (95.0%)	17/20 (85.0%)	-2
Total	167/200 (83.5%)	165/200 (82.5%)	−1.0 pp

Both quants are strong baselines on reasoning-OFF MMLU. MXFP4 edges ahead by 1 pp overall. JANG_4M wins on the harder math-heavy subjects (abstract_algebra +3, high_school_mathematics +2) — plausibly because the 8-bit full-attention projections carry more signal on multi-step symbolic chains. MXFP4 wins on rote-recall subjects (anatomy, world_religions) by ~2 each, closer to ties on factual/scientific subjects.

Reasoning ON: not yet measured. Qwen 3.6 is a reasoning-optional model — with enable_thinking=True the model generates a <think>…</think> block before answering, which typically lifts MMLU significantly. Reasoning-ON benchmarks for both quants are planned as a follow-up.

Hardware notes

14 GB weights on disk; once loaded, expect ~14–18 GB resident plus KV cache. Full-attention KV grows with sequence length; linear-attention layers contribute a bounded per-layer SSM state (independent of context).

Mac	Works?	Notes
16 GB unified	⚠️	Text-only OK with tight context; image inference will be tight
24 GB unified	✅ text + image, short context	Leave headroom for KV cache at ≤ 32 k tokens
32 GB+ unified	✅ comfortable	Full context + VL + video

License

Apache 2.0 — inherits from the base model.

Packaged on Apple Silicon by Osaurus.
© 2026 Osaurus AI — osaurus.ai

Downloads last month: 822

Safetensors

Model size

6B params

Tensor type

U32

BF16

MLX

Hardware compatibility

4-bit

Model tree for OsaurusAI/Qwen3.6-27B-MXFP4

Base model

Qwen/Qwen3.6-27B

Quantized

(177)

this model