GLM-4.7-Flash-heretic NVFP4
NVFP4 post-training quantization of Olafangensan/GLM-4.7-Flash-heretic for long-context multi-GPU inference with vLLM.
The Hugging Face UI "Model size" badge is auto-inferred from packed NVFP4 safetensors and may show an incorrect parameter count for this repo. Use the architecture statement below as source of truth: GLM-4.7-Flash (30B-A3B MoE).
This release uses NVFP4 (4-bit) quantization, not 8-bit quantization.
Model Size
- Base architecture: GLM-4.7-Flash (30B-A3B MoE)
- Parameter count for this release: unchanged from the base model architecture
- Note: the ~17.8GB
model.safetensorsfile size is a quantized checkpoint size and does not mean the model is 18B parameters.
Runtime Compatibility
Known issue on some stock vLLM 0.16.x + vllm-node setups:
- assistant
contentmay benull - output may be dumped into reasoning fields with broken formatting
Required Docker image (build this first)
This model requires a custom vLLM image.
Do not use stock vllm/vllm-openai as-is.
1) Create Dockerfile
Save as Dockerfile.vllm-glm4lite:
FROM vllm/vllm-openai:latest
ARG TRANSFORMERS_COMMIT=393b4b3d28e29b4b05b19b4b7f3242a7fc893637
RUN apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir -U "huggingface_hub==1.4.0"
RUN pip install --no-cache-dir -U --no-deps "git+https://github.com/huggingface/transformers.git@${TRANSFORMERS_COMMIT}"
2) Build image
docker build -t vllm-glm:parser-only-r2 -f Dockerfile.vllm-glm4lite .
3) Serve model with that image
docker run --rm --name glm47 --gpus all --ipc=host -p 8000:8000 -v /path/to/hf_cache:/hf_cache -v /path/to/models:/models -e HF_HOME=/hf_cache -e HUGGINGFACE_HUB_CACHE=/hf_cache -e HF_HUB_CACHE=/hf_cache -e TRANSFORMERS_CACHE=/hf_cache -e VLLM_NVFP4_GEMM_BACKEND=marlin vllm-glm:parser-only-r2 --model /models/GLM-4.7-Flash-heretic-NVFP4 --served-model-name glm-4.7-flash --quantization modelopt_fp4 --dtype bfloat16 --tensor-parallel-size 4 --max-model-len 131072 --enable-auto-tool-choice --tool-call-parser glm47 --reasoning-parser glm45 --default-chat-template-kwargs '{"enable_thinking": true}' --generation-config vllm --override-generation-config '{"temperature":0.7,"top_p":1.0}'
4) Verify
curl http://127.0.0.1:8000/v1/models
Expected: model list includes glm-4.7-flash.
Performance Notes (RTX 3090 / Ampere)
These measurements were taken on a 4x RTX 3090 host with vLLM and this NVFP4 export.
Best single-session setup found so far
- Use
--optimization-level 3(O3), not default O2. - Use TP2 on an NVLink-connected pair (
CUDA_VISIBLE_DEVICES=0,2+--tensor-parallel-size 2). - Keep
--max-model-len 131072(128k context remains supported).
Observed impact (single-session decode):
- O3 vs O2: ~+16% tokens/sec
- TP2 NVLink vs TP2 non-NVLink: ~+10% tokens/sec
What did not help in this environment
- MTP speculative decoding (
--speculative-config {"method":"mtp",...}) was slower than baseline. VLLM_MARLIN_USE_ATOMIC_ADD=1was slightly slower (~4-5%).- FP8 KV cache variants were not viable on this stack:
--kv-cache-dtype fp8--kv-cache-dtype fp8 --calculate-kv-scales--kv-cache-dtype fp8_ds_mlaThese failed due to no valid MLA attention backend on this Ampere path.
Throughput vs single-session speed
- Increasing
--max-num-seqs/--max-num-batched-tokensimproved aggregate throughput significantly. - Those changes did not materially improve single-session latency/tokens-sec.
- If your goal is one chat/session feeling faster, prioritize O3 + TP2 on NVLink.
Repository Contents
This model repo intentionally contains only serving-required artifacts:
model.safetensorsconfig.jsongeneration_config.jsontokenizer.jsontokenizer_config.jsonchat_template.jinjahf_quant_config.jsonREADME.mdQUANTIZATION.mdLICENSE
No training checkpoints, raw calibration corpora, or temporary files are included.
License and Provenance
- Base model:
Olafangensan/GLM-4.7-Flash-heretic - Upstream lineage: decensored derivative of
zai-org/GLM-4.7-Flash - Base license: MIT (per upstream model card)
- This repo: quantized derivative for inference; no architecture changes
Please review and comply with upstream licenses and terms for your use case.
Reproducibility
Quantization recipe, command, and environment details are documented in QUANTIZATION.md.
At a glance:
- Quantization method: ModelOpt NVFP4 (
group_size=16,lm_headexcluded) - Calibration mix:
switch_turnflow_sanitized,open_code_reasoning - Calibration sizes:
1536,512(sequence length2048) - Export format: Hugging Face
Quick Start (OpenAI-compatible)
curl http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "glm-4.7-flash",
"messages": [{"role": "user", "content": "hello"}],
"max_tokens": 256
}'
Integrity
model.safetensorsSHA256:3b5aca2db60c472e9dbcb44e79ab4f69442d9a83315bbfab7a3f39ab8b004116
- Downloads last month
- 237
Model tree for alphakek/GLM-4.7-Flash-heretic-NVFP4
Base model
Olafangensan/GLM-4.7-Flash-heretic