Model VRAM Usage

#2
by LyraNovaHeart - opened

Hey Hey! I saw at the bottom of your readme that you had trouble fitting it into your GPUs VRAM.

You can fit it in 16GB, it just takes a bit of tweaking

    1. Use an IQ4_XS quant, and turn on Flash Attention
    1. Set the context to 16k tokens, then set it as Sliding Window Attention (SWA)
    1. Set the KV quant to 8 bit

This should load nearly all the model into VRAM, only a little bit spills out and it's still relatively fast on my system (25-30T/s).

Hope this helps! Maybe I'll try the model soon too!

Good day and thank you for the advice. I did not try to quantize the context on gemma models, as in tests this negated one of the main advantages of this family - accurate attention to context. However, I will try your settings on week. Thanks again.

I run models on a CPU with 32 GB of RAM and encountered excessive memory usage with some models, particularly the Gemma 3 27B and Mistral Small. When running Q4 (Q4_K_M and IQ4_XS) with --mlock, the memory was filled to 32 GB. However, I noticed that running the same models on Q5 uses significantly less RAM. I looked at the llama.cpp log and found the following:

Q4:
srv load_model: loading model 'E:\LLM\GGUFs\Mars_27B_V.1.Q4_K_M.gguf'
...
load_tensors: CPU_Mapped model buffer size = 16529.63 MiB
load_tensors: CPU_REPACK model buffer size = 11694.38 MiB

Q5:
srv load_model: loading model 'E:\LLM\GGUFs\Mars_27B_V.1.Q5_K_M.gguf'
...
load_tensors: CPU_Mapped model buffer size = 19296.38 MiB

I discussed this with GLM-5 and he said that llama.cpp was creating a copy of the weights (CPU_REPACK) when running Q4 to speed up inference.

I'm not sure I understood everything correctly, but perhaps this information will be useful to someone.

Sign up or log in to comment