GLM-4.7-Flash-heretic NVFP4

NVFP4 post-training quantization of Olafangensan/GLM-4.7-Flash-heretic for long-context multi-GPU inference with vLLM.

The Hugging Face UI "Model size" badge is auto-inferred from packed NVFP4 safetensors and may show an incorrect parameter count for this repo. Use the architecture statement below as source of truth: GLM-4.7-Flash (30B-A3B MoE).

This release uses NVFP4 (4-bit) quantization, not 8-bit quantization.

Model Size

  • Base architecture: GLM-4.7-Flash (30B-A3B MoE)
  • Parameter count for this release: unchanged from the base model architecture
  • Note: the ~17.8GB model.safetensors file size is a quantized checkpoint size and does not mean the model is 18B parameters.

Runtime Compatibility

Known issue on some stock vLLM 0.16.x + vllm-node setups:

  • assistant content may be null
  • output may be dumped into reasoning fields with broken formatting

Required Docker image (build this first)

This model requires a custom vLLM image. Do not use stock vllm/vllm-openai as-is.

1) Create Dockerfile

Save as Dockerfile.vllm-glm4lite:

FROM vllm/vllm-openai:latest

ARG TRANSFORMERS_COMMIT=393b4b3d28e29b4b05b19b4b7f3242a7fc893637

RUN apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir -U "huggingface_hub==1.4.0"
RUN pip install --no-cache-dir -U --no-deps "git+https://github.com/huggingface/transformers.git@${TRANSFORMERS_COMMIT}"

2) Build image

docker build -t vllm-glm:parser-only-r2 -f Dockerfile.vllm-glm4lite .

3) Serve model with that image

docker run --rm --name glm47   --gpus all   --ipc=host   -p 8000:8000   -v /path/to/hf_cache:/hf_cache   -v /path/to/models:/models   -e HF_HOME=/hf_cache   -e HUGGINGFACE_HUB_CACHE=/hf_cache   -e HF_HUB_CACHE=/hf_cache   -e TRANSFORMERS_CACHE=/hf_cache   -e VLLM_NVFP4_GEMM_BACKEND=marlin   vllm-glm:parser-only-r2   --model /models/GLM-4.7-Flash-heretic-NVFP4   --served-model-name glm-4.7-flash   --quantization modelopt_fp4   --dtype bfloat16   --tensor-parallel-size 4   --max-model-len 131072   --enable-auto-tool-choice   --tool-call-parser glm47   --reasoning-parser glm45   --default-chat-template-kwargs '{"enable_thinking": true}'   --generation-config vllm   --override-generation-config '{"temperature":0.7,"top_p":1.0}'

4) Verify

curl http://127.0.0.1:8000/v1/models

Expected: model list includes glm-4.7-flash.

Performance Notes (RTX 3090 / Ampere)

These measurements were taken on a 4x RTX 3090 host with vLLM and this NVFP4 export.

Best single-session setup found so far

  • Use --optimization-level 3 (O3), not default O2.
  • Use TP2 on an NVLink-connected pair (CUDA_VISIBLE_DEVICES=0,2 + --tensor-parallel-size 2).
  • Keep --max-model-len 131072 (128k context remains supported).

Observed impact (single-session decode):

  • O3 vs O2: ~+16% tokens/sec
  • TP2 NVLink vs TP2 non-NVLink: ~+10% tokens/sec

What did not help in this environment

  • MTP speculative decoding (--speculative-config {"method":"mtp",...}) was slower than baseline.
  • VLLM_MARLIN_USE_ATOMIC_ADD=1 was slightly slower (~4-5%).
  • FP8 KV cache variants were not viable on this stack:
    • --kv-cache-dtype fp8
    • --kv-cache-dtype fp8 --calculate-kv-scales
    • --kv-cache-dtype fp8_ds_mla These failed due to no valid MLA attention backend on this Ampere path.

Throughput vs single-session speed

  • Increasing --max-num-seqs / --max-num-batched-tokens improved aggregate throughput significantly.
  • Those changes did not materially improve single-session latency/tokens-sec.
  • If your goal is one chat/session feeling faster, prioritize O3 + TP2 on NVLink.

Repository Contents

This model repo intentionally contains only serving-required artifacts:

  • model.safetensors
  • config.json
  • generation_config.json
  • tokenizer.json
  • tokenizer_config.json
  • chat_template.jinja
  • hf_quant_config.json
  • README.md
  • QUANTIZATION.md
  • LICENSE

No training checkpoints, raw calibration corpora, or temporary files are included.

License and Provenance

  • Base model: Olafangensan/GLM-4.7-Flash-heretic
  • Upstream lineage: decensored derivative of zai-org/GLM-4.7-Flash
  • Base license: MIT (per upstream model card)
  • This repo: quantized derivative for inference; no architecture changes

Please review and comply with upstream licenses and terms for your use case.

Reproducibility

Quantization recipe, command, and environment details are documented in QUANTIZATION.md.

At a glance:

  • Quantization method: ModelOpt NVFP4 (group_size=16, lm_head excluded)
  • Calibration mix: switch_turnflow_sanitized,open_code_reasoning
  • Calibration sizes: 1536,512 (sequence length 2048)
  • Export format: Hugging Face

Quick Start (OpenAI-compatible)

curl http://127.0.0.1:8000/v1/chat/completions   -H 'Content-Type: application/json'   -d '{
    "model": "glm-4.7-flash",
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 256
  }'

Integrity

  • model.safetensors SHA256: 3b5aca2db60c472e9dbcb44e79ab4f69442d9a83315bbfab7a3f39ab8b004116
Downloads last month
237
Safetensors
Model size
17B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for alphakek/GLM-4.7-Flash-heretic-NVFP4

Quantized
(12)
this model