Instructions to use google/gemma-3-4b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-3-4b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-3-4b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use google/gemma-3-4b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-3-4b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-3-4b-it

SGLang

How to use google/gemma-3-4b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-3-4b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-3-4b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-3-4b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-3-4b-it
```

CUDA error: device-side assert triggered

#41

by ArshiaSoori - opened Apr 9, 2025

Discussion

ArshiaSoori

Apr 9, 2025

•

edited Apr 14, 2025

CUDA error: device-side assert triggered

I'm encountering a CUDA error when trying to quantize a model using BitsAndBytesConfig with 4-bit settings. Here's the error:

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Note : I have tried os.environ['CUDA_LAUNCH_BLOCKING'] = '1' but nothing happened.

Quantization Setup

llm_quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  
    bnb_4bit_compute_dtype=torch.float16,  
    bnb_4bit_quant_type="nf4",  
    bnb_4bit_use_double_quant=True, 
)

Model Loading

model = Gemma3ForConditionalGeneration.from_pretrained(
    llm_model_id,
    cache_dir=CACHE_DIR,
    device_map="auto",
    low_cpu_mem_usage=True,
    use_safetensors=True,
    quantization_config=llm_quantization_config
)

Environment Details

transformers: 4.50.3
CUDA Version: 12.4
GPU Driver Version: 550.144.03

Additional Notes

When running in CPU-only mode, the notebook cell stops executing without any visible error or traceback. It just silently halts.
I'm wondering if this might be due to a device assertion related to the model or quantization setup.

Any advice on how to debug or resolve this would be greatly appreciated!
Could this be related to the model weights / compatibility with quantization?

ArshiaSoori

Apr 14, 2025

Solved!

Do not use torch.float16 for torch_dtype. use torch.float32 instead.

    def _initialize_model(self):
        """Initialize the quantized LLM"""
        quantization_config = BitsAndBytesConfig(load_in_4bit=True)
        
        model_name = "google/gemma-3-4b-it"

        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float32,
            device_map="cuda",
            cache_dir=CACHE_DIR,
            quantization_config=quantization_config,
            # attn_implementation="flash_attention_2"
        )
        tokenizer = AutoTokenizer.from_pretrained(model_name,cache_dir=CACHE_DIR,)
        return model, tokenizer

shashank9830

Oct 3, 2025

Do not use torch.float16 for torch_dtype. use torch.float32 instead.

Can you explain why?

Also, without the 4-bit quantization and default torch_dtype, the inference is faster. After switching to float32 and 4-bit quantization, it became significantly slower.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment