--- tags: - fp4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: apache-2.0 base_model: Qwen/Qwen3-235B-A22B --- # Qwen3-235B-A22B-NVFP4 ## Model Overview - **Model Architecture:** Qwen/Qwen3-235B-A22B - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP4 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - **Release Date:** 10/29/2025 - **Version:** 1.0 - **Model Developers:** RedHatAI This model is a quantized version of [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B). It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. ### Model Optimizations This model was obtained by quantizing the weights and activations of [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/Qwen3-235B-A22B-NVFP4" number_gpus = 1 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below.

```python from datasets import load_dataset from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer from llmcompressor.modeling.prepare import replace_modules_for_calibration MODEL_ID = "Qwen/Qwen3-235B-A22B" #Load model. model = AutoModelForCausalLM.from_pretrained( MODEL_ID, device_map=None, torch_dtype="auto" ) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) print(model) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" NUM_CALIBRATION_SAMPLES = 256 MAX_SEQUENCE_LENGTH = 1024 # --- Replace MoE modules for calibration --- model = replace_modules_for_calibration(model, calibrate_all_experts=False) # Load dataset and preprocess. ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]") ds = ds.shuffle(seed=42) def preprocess(example): return { "text": tokenizer.apply_chat_template( example["messages"], tokenize=False, ) } ds = ds.map(preprocess) # Tokenize inputs. def tokenize(sample): return tokenizer( sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False, ) ds = ds.map(tokenize, remove_columns=ds.column_names) recipe = QuantizationModifier( targets="Linear", scheme="NVFP4", ignore=["re:.*lm_head.*", "re:.*mlp.gate$", "re:.*self_attn", ], ) # Save to disk in compressed-tensors format. SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4" # Apply quantization. oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, output_dir=SAVE_DIR, pipeline="sequential", sequential_targets=["Qwen3MoeDecoderLayer"], calibrate_moe_context=True, ) # Save to disk in compressed-tensors format. model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR) ```

## Evaluation This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval). ### Accuracy

Category	Metric	Qwen/Qwen3-235B-A22B	RedHatAI/Qwen3-235B-A22B-NVFP4 (this model)	Recovery
OpenLLM V1	arc_challenge	73.38	72.61	98.95
	gsm8k	85.14	86.43	101.52
	hellaswag	86.86	86.67	99.78
	mmlu	86.36	85.76	99.30
	truthfulqa_mc2	60.58	60.09	99.19
	winogrande	80.74	80.90	100.20
	Average	78.84	78.74	99.87
OpenLLM V2	BBH	63.67	63.81	100.22
	MMLU-Pro	58.23	57.99	99.59
	MuSR	43.25	42.99	99.40
	IFEval	88.25	88.25	100.00
	GPQA	29.28	28.94	98.84
	Average	56.54	56.40	99.75
Reasoning	GPQA (Diamond, 0-shot)	72.22	69.19	95.80
	Math-500 (0-shot)	95.00	94.20	99.16
	Average	83.61	81.70	97.72
Coding	HumanEval_64 pass@2	92.56	94.73	102.34

### Reproduction The results were obtained using the following commands:

``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Qwen3-235B-A22B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks openllm \ --batch_size auto ``` #### OpenLLM v2 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Qwen3-235B-A22B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks leaderboard \ --batch_size auto ``` #### HumanEval_64 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Qwen3-235B-A22B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks humaneval_64_instruct \ --batch_size auto ```