Instructions to use TeichAI/gemma-4-31B-it-Claude-Opus-Distill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TeichAI/gemma-4-31B-it-Claude-Opus-Distill with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="TeichAI/gemma-4-31B-it-Claude-Opus-Distill") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("TeichAI/gemma-4-31B-it-Claude-Opus-Distill") model = AutoModelForImageTextToText.from_pretrained("TeichAI/gemma-4-31B-it-Claude-Opus-Distill") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TeichAI/gemma-4-31B-it-Claude-Opus-Distill with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TeichAI/gemma-4-31B-it-Claude-Opus-Distill" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TeichAI/gemma-4-31B-it-Claude-Opus-Distill", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/TeichAI/gemma-4-31B-it-Claude-Opus-Distill
- SGLang
How to use TeichAI/gemma-4-31B-it-Claude-Opus-Distill with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TeichAI/gemma-4-31B-it-Claude-Opus-Distill" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TeichAI/gemma-4-31B-it-Claude-Opus-Distill", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TeichAI/gemma-4-31B-it-Claude-Opus-Distill" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TeichAI/gemma-4-31B-it-Claude-Opus-Distill", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Unsloth Studio
How to use TeichAI/gemma-4-31B-it-Claude-Opus-Distill with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for TeichAI/gemma-4-31B-it-Claude-Opus-Distill to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for TeichAI/gemma-4-31B-it-Claude-Opus-Distill to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for TeichAI/gemma-4-31B-it-Claude-Opus-Distill to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="TeichAI/gemma-4-31B-it-Claude-Opus-Distill", max_seq_length=2048, ) - Docker Model Runner
How to use TeichAI/gemma-4-31B-it-Claude-Opus-Distill with Docker Model Runner:
docker model run hf.co/TeichAI/gemma-4-31B-it-Claude-Opus-Distill
Can we get a draft model for this model please?
Title.
I could probably get around to training a 100M parameter Eagle 2 model in a few weeks, the problem is that I need a few hundred thousand sequences from the 31B model which I can only run with offloading to the system memory so that might take awhile, the E4B model should have out of the box support to be a draft model though, its just bigger than Eagle for the same performance and you have to have enough memory to run a 8B model and a 31B model at the same time.
Ok, there is no support for drafting between those two, but this has already been trained by someone else https://huggingface.co/thoughtworks/Gemma-4-31B-Eagle3
I don't think the tokenizer changed, any gemma4 model should be support as a drafting model.
I was about to edit my reply but thought to refresh just incase, bruh
Yall were fast!
Thanks for the response.
Also the thinking blocks and use of markdown might be broken 👀
Basically, i haven't even used it for tool use yet, but i predict it won't work well with anything that uses tools
Also the thinking blocks and use of markdown might be broken 👀
Could you be bit more specific here? Perhaps a screenshot of your broken output, as well as some info on how you're running your inference would be helpful.
"Yall were fast!"
When you get bored and keep refreshing a page, you catch things pretty quickly 🤣
Wait for v2, I was using it in cline/continue flawlessly. It built me a web app, setup local supabase, and wired everything together, frontend and backend :)
The model also breaks by it starting to repeat itself mid generation once given a long enough task and this was found because flash attention was enabled. (I am unsure if this is a normal part of local models where some can use it, some can't. but this model cannot use flash attention, or at least when some parts of the model is offloaded)
Seems like your agent runner (looks like LMStudio) doesn't support the Gemma4 thinking format? Could you provide a side-by-side with the Teich model & a regular Gemma 4 model?
oh that's because it's not trained to have a new line after closing the channel tag. so if you dont have reasoning parsing setup properly markdown renderers wont know to start after the end of the <channel|> tag
The model also breaks by it starting to repeat itself mid generation once given a long enough task and this was found because flash attention was enabled. (I am unsure if this is a normal part of local models where some can use it, some can't. but this model cannot use flash attention, or at least when some parts of the model is offloaded)
I did see this flash attention issue as well though, the v2 was working better with fa on but still trips up occasionally
Seems like your agent runner (looks like LMStudio) doesn't support the Gemma4 thinking format? Could you provide a side-by-side with the Teich model & a regular Gemma 4 model?
Your model
Original Gemma 4 Model
Note: I have not been able to get Gemma 4-31B it to think on LM studio (i've seen that it has a reasoning capability online, but i have yet to see the variant I downloaded think, the original variant is the LM Studio Community edition one i have downloaded)
The model also breaks by it starting to repeat itself mid generation once given a long enough task and this was found because flash attention was enabled. (I am unsure if this is a normal part of local models where some can use it, some can't. but this model cannot use flash attention, or at least when some parts of the model is offloaded)
I did see this flash attention issue as well though, the v2 was working better with fa on but still trips up occasionally
Mind you, I am offloading because I have a 5090 and 128GB of Ram
I don't see how offloading could cause this. You could try only using the CPU. But other than that I would recommend waiting for v2
I don't see how offloading could cause this. You could try only using the CPU. But other than that I would recommend waiting for v2
V2 it is then. i'll keep an eye out 👍
Seems resolved enough to close.
Only took 2 hours to solve this. That might be a record. 🤣
so you are testing in LM studio then correct? if so here is your fix:
- Head to the models tab
- Click the
gearicon next to our model - Select the inference tab all the way to the right and expand the
Reasoning Parsingsection. - Change the Start String to
<|channel>thoughtand the End String to<channel|>
let me know if it works. Personally, I think you may get past that first hurdle and just be met with other issues. I will be reupdating these ggufs momentarily with the latest llama.cpp gemma 4 fixes
guessing it's the early-stopping/truncation issue with the old ggufs
updates going live now
please try again with the latest ggufs, they are up and tested. Confirmed working on my end (via llama.cpp chat ui)
I don't think the tokenizer changed, any gemma4 model should be support as a drafting model.
Well the E models have per layer embeddings which the 31B does not have, they have only 131k context, the 31B has 256k, and there tokenizer works with video and audio, whereas the 31B model only works with vision
And the Eagle is more accurate anyway because it is trained on the hidden states of the model, whereas the E4B is just theoretically trained on the same data, and of course the E4B is actually 8B, so you have to fit a 8B and a 31B model in context vs. a 31B model and a 350M model where the 350M is better. Though eagle3 is not widely supported yet.




