Osaurus AI

Qwen 3.6 27B — MXFP4 (MLX)

Open Compute Project MXFP4 quantization of Alibaba's hybrid linear/full attention dense 27B VL model, with the vision tower preserved.

Website  OsaurusAI


Model Details

Property Value
Base model Qwen/Qwen3.6-27B
Parameters 27.32 B, dense (no MoE)
Architecture qwen3_5 — 64 decoder layers: 48 Gated DeltaNet (linear-attn) + 16 full-attention with swish output gate
Quantization OCP MXFP4 (E2M1 + shared E8M0 scale) at block 32
Package size on disk 14 GB across 3 shards
Bits per weight 4.449
vs BF16 source 52 GB → 14 GB, 3.7× compression
Context (position embeddings) 262,144 native; upstream card reports up to ~1 M with YaRN scaling
Vision tower 27-layer ViT (hidden 1152, patch 16), MXFP4 quantized
Chat format Qwen im_start/im_end, unified thinking toggle

Quantization details

Category Bits Group Notes
Dense FFN (mlp.gate_proj, mlp.up_proj, mlp.down_proj) 4 (MXFP4) 32 Bulk of parameters
Full-attention projections (q_proj, k_proj, v_proj, o_proj) 4 (MXFP4) 32 q_proj is fused with a swish output gate (output split 50/50 queries/gate)
Linear-attention projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj) 4 (MXFP4) 32
Embedding (embed_tokens), lm_head 4 (MXFP4) 32
Vision tower 4 (MXFP4) 32
Norms, A_log, dt_bias, conv1d bf16 passthrough Kept un-quantized

MXFP4 is the open OCP Microscaling FP4 spec, distinct from NVIDIA's NVFP4. MLX exposes both as separate --q-mode values; this release uses mxfp4. No activation-aware calibration (no AWQ) — quant is purely weight-driven, so vision and text inputs are treated with equal fidelity.


Architecture notes — what's new vs Qwen 3 / 3.5

  • Hybrid attention stack: 48 of 64 layers use Gated DeltaNet, a linear-attention / delta-rule hybrid with a grouped conv1d input path and per-head A_log / dt_bias state — constant memory in sequence length. The other 16 layers (one every 4, given by full_attention_interval: 4) use full softmax attention with attn_output_gate: trueq_proj produces a fused (queries, gate) tensor; attention output is multiplied by sigmoid(gate) before o_proj.
  • Partial rotary embeddings: only the first 25% of head dim rotates (partial_rotary_factor: 0.25), rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section, mrope_interleaved: true) is preserved in config.json.
  • Dense FFN: no MoE. Each layer has gate_proj/up_proj (5120 → 17408) + down_proj (17408 → 5120) with SwiGLU activation.
  • Vision tower: qwen3_vl ViT, 27 layers, hidden 1152, patch 16, temporal_patch 2. Produces video token sequences via 3D conv patch-embed (pairs of frames merge into one temporal patch).

Usage

Load in Osaurus on Apple Silicon (macOS) — single-click deploy, local chat + vision, no Python setup. The bundle also loads in any Apple Silicon MLX runtime that supports qwen3_5 VL bundles at mxfp4 quantization.

Reasoning on/off, image inference, and video inference are all verified on this quant (see table below).


Verified modalities

Test Result
Chat template (with + without thinking) ✓ coherent
Text: "Translate to French: Hello, how are you?" → "Bonjour, comment allez-vous ?"
Text: def fibonacci(n): → correct recursive continuation
Math: 2 + 2 with enable_thinking=False → "2 + 2 = 4" direct
VL image: solid red/green/blue/yellow 224×224 → correct color ID ✓ 4/4
VL video: 4-frame RGBY sequence → structurally coherent description


MMLU-200 (10 subjects × 20 questions, reasoning OFF)

Both quants evaluated on the same 200-question slice of MMLU with enable_thinking=False (direct answer, no <think> preamble). Same prompts, same greedy decode, same extraction.

Subject MXFP4 JANG_4M Δ (JANG − MXFP4)
abstract_algebra 12/20 (60.0%) 15/20 (75.0%) +3
anatomy 18/20 (90.0%) 16/20 (80.0%) -2
astronomy 20/20 (100.0%) 19/20 (95.0%) -1
college_computer_science 16/20 (80.0%) 16/20 (80.0%) 0
college_physics 15/20 (75.0%) 15/20 (75.0%) 0
high_school_biology 19/20 (95.0%) 19/20 (95.0%) 0
high_school_chemistry 16/20 (80.0%) 15/20 (75.0%) -1
high_school_mathematics 12/20 (60.0%) 14/20 (70.0%) +2
logical_fallacies 20/20 (100.0%) 19/20 (95.0%) -1
world_religions 19/20 (95.0%) 17/20 (85.0%) -2
Total 167/200 (83.5%) 165/200 (82.5%) −1.0 pp

Both quants are strong baselines on reasoning-OFF MMLU. MXFP4 edges ahead by 1 pp overall. JANG_4M wins on the harder math-heavy subjects (abstract_algebra +3, high_school_mathematics +2) — plausibly because the 8-bit full-attention projections carry more signal on multi-step symbolic chains. MXFP4 wins on rote-recall subjects (anatomy, world_religions) by ~2 each, closer to ties on factual/scientific subjects.

Reasoning ON: not yet measured. Qwen 3.6 is a reasoning-optional model — with enable_thinking=True the model generates a <think>…</think> block before answering, which typically lifts MMLU significantly. Reasoning-ON benchmarks for both quants are planned as a follow-up.


Hardware notes

14 GB weights on disk; once loaded, expect ~14–18 GB resident plus KV cache. Full-attention KV grows with sequence length; linear-attention layers contribute a bounded per-layer SSM state (independent of context).

Mac Works? Notes
16 GB unified ⚠️ Text-only OK with tight context; image inference will be tight
24 GB unified ✅ text + image, short context Leave headroom for KV cache at ≤ 32 k tokens
32 GB+ unified ✅ comfortable Full context + VL + video

License

Apache 2.0 — inherits from the base model.


Packaged on Apple Silicon by Osaurus.
© 2026 Osaurus AI — osaurus.ai

Downloads last month
822
Safetensors
Model size
6B params
Tensor type
U8
·
U32
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/Qwen3.6-27B-MXFP4

Base model

Qwen/Qwen3.6-27B
Quantized
(177)
this model