Qwen 3.6 27B — MXFP4 (MLX)
Open Compute Project MXFP4 quantization of Alibaba's hybrid linear/full attention dense 27B VL model, with the vision tower preserved.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Parameters | 27.32 B, dense (no MoE) |
| Architecture | qwen3_5 — 64 decoder layers: 48 Gated DeltaNet (linear-attn) + 16 full-attention with swish output gate |
| Quantization | OCP MXFP4 (E2M1 + shared E8M0 scale) at block 32 |
| Package size on disk | 14 GB across 3 shards |
| Bits per weight | 4.449 |
| vs BF16 source | 52 GB → 14 GB, 3.7× compression |
| Context (position embeddings) | 262,144 native; upstream card reports up to ~1 M with YaRN scaling |
| Vision tower | 27-layer ViT (hidden 1152, patch 16), MXFP4 quantized |
| Chat format | Qwen im_start/im_end, unified thinking toggle |
Quantization details
| Category | Bits | Group | Notes |
|---|---|---|---|
Dense FFN (mlp.gate_proj, mlp.up_proj, mlp.down_proj) |
4 (MXFP4) | 32 | Bulk of parameters |
Full-attention projections (q_proj, k_proj, v_proj, o_proj) |
4 (MXFP4) | 32 | q_proj is fused with a swish output gate (output split 50/50 queries/gate) |
Linear-attention projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj) |
4 (MXFP4) | 32 | |
Embedding (embed_tokens), lm_head |
4 (MXFP4) | 32 | |
| Vision tower | 4 (MXFP4) | 32 | |
Norms, A_log, dt_bias, conv1d |
bf16 passthrough | — | Kept un-quantized |
MXFP4 is the open OCP Microscaling FP4 spec, distinct from NVIDIA's NVFP4. MLX exposes both as separate --q-mode values; this release uses mxfp4. No activation-aware calibration (no AWQ) — quant is purely weight-driven, so vision and text inputs are treated with equal fidelity.
Architecture notes — what's new vs Qwen 3 / 3.5
- Hybrid attention stack: 48 of 64 layers use
Gated DeltaNet, a linear-attention / delta-rule hybrid with a groupedconv1dinput path and per-headA_log/dt_biasstate — constant memory in sequence length. The other 16 layers (one every 4, given byfull_attention_interval: 4) use full softmax attention withattn_output_gate: true—q_projproduces a fused (queries, gate) tensor; attention output is multiplied bysigmoid(gate)beforeo_proj. - Partial rotary embeddings: only the first 25% of head dim rotates (
partial_rotary_factor: 0.25),rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section,mrope_interleaved: true) is preserved inconfig.json. - Dense FFN: no MoE. Each layer has
gate_proj/up_proj(5120 → 17408) +down_proj(17408 → 5120) with SwiGLU activation. - Vision tower:
qwen3_vlViT, 27 layers, hidden 1152, patch 16, temporal_patch 2. Produces video token sequences via 3D conv patch-embed (pairs of frames merge into one temporal patch).
Usage
Load in Osaurus on Apple Silicon (macOS) — single-click deploy, local chat + vision, no Python setup. The bundle also loads in any Apple Silicon MLX runtime that supports qwen3_5 VL bundles at mxfp4 quantization.
Reasoning on/off, image inference, and video inference are all verified on this quant (see table below).
Verified modalities
| Test | Result |
|---|---|
| Chat template (with + without thinking) | ✓ coherent |
| Text: "Translate to French: Hello, how are you?" → "Bonjour, comment allez-vous ?" | ✓ |
Text: def fibonacci(n): → correct recursive continuation |
✓ |
Math: 2 + 2 with enable_thinking=False → "2 + 2 = 4" direct |
✓ |
| VL image: solid red/green/blue/yellow 224×224 → correct color ID | ✓ 4/4 |
| VL video: 4-frame RGBY sequence → structurally coherent description | ✓ |
MMLU-200 (10 subjects × 20 questions, reasoning OFF)
Both quants evaluated on the same 200-question slice of MMLU with enable_thinking=False (direct answer, no <think> preamble). Same prompts, same greedy decode, same extraction.
| Subject | MXFP4 | JANG_4M | Δ (JANG − MXFP4) |
|---|---|---|---|
| abstract_algebra | 12/20 (60.0%) | 15/20 (75.0%) | +3 |
| anatomy | 18/20 (90.0%) | 16/20 (80.0%) | -2 |
| astronomy | 20/20 (100.0%) | 19/20 (95.0%) | -1 |
| college_computer_science | 16/20 (80.0%) | 16/20 (80.0%) | 0 |
| college_physics | 15/20 (75.0%) | 15/20 (75.0%) | 0 |
| high_school_biology | 19/20 (95.0%) | 19/20 (95.0%) | 0 |
| high_school_chemistry | 16/20 (80.0%) | 15/20 (75.0%) | -1 |
| high_school_mathematics | 12/20 (60.0%) | 14/20 (70.0%) | +2 |
| logical_fallacies | 20/20 (100.0%) | 19/20 (95.0%) | -1 |
| world_religions | 19/20 (95.0%) | 17/20 (85.0%) | -2 |
| Total | 167/200 (83.5%) | 165/200 (82.5%) | −1.0 pp |
Both quants are strong baselines on reasoning-OFF MMLU. MXFP4 edges ahead by 1 pp overall. JANG_4M wins on the harder math-heavy subjects (abstract_algebra +3, high_school_mathematics +2) — plausibly because the 8-bit full-attention projections carry more signal on multi-step symbolic chains. MXFP4 wins on rote-recall subjects (anatomy, world_religions) by ~2 each, closer to ties on factual/scientific subjects.
Reasoning ON: not yet measured. Qwen 3.6 is a reasoning-optional model — with
enable_thinking=Truethe model generates a<think>…</think>block before answering, which typically lifts MMLU significantly. Reasoning-ON benchmarks for both quants are planned as a follow-up.
Hardware notes
14 GB weights on disk; once loaded, expect ~14–18 GB resident plus KV cache. Full-attention KV grows with sequence length; linear-attention layers contribute a bounded per-layer SSM state (independent of context).
| Mac | Works? | Notes |
|---|---|---|
| 16 GB unified | ⚠️ | Text-only OK with tight context; image inference will be tight |
| 24 GB unified | ✅ text + image, short context | Leave headroom for KV cache at ≤ 32 k tokens |
| 32 GB+ unified | ✅ comfortable | Full context + VL + video |
License
Apache 2.0 — inherits from the base model.
Packaged on Apple Silicon by Osaurus.
© 2026 Osaurus AI — osaurus.ai
- Downloads last month
- 822
4-bit
Model tree for OsaurusAI/Qwen3.6-27B-MXFP4
Base model
Qwen/Qwen3.6-27B