mlx-community/DeepSeek-V4-Flash-2bit-DQ

Made possible by Lambda.ai 鉂わ笍

DeepSeek-V4-Flash-2bit-DQ uses a dynamic mixed-precision quantization policy. Most routed MoE expert weights are packed to 2-bit, while sensitive layers and projections remain in higher-quality 4-bit, 6-bit or 8-bit quantization. This keeps memory use much lower than the baseline 4-bit checkpoint.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/DeepSeek-V4-Flash-2bit-DQ")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
45,447
Safetensors
Model size
284B params
Tensor type
BF16
U32
F32
I64
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support