SauerkrautLM-Vision-Document-Retrieval
Collection
7 items
β’
Updated
β’
5
π #1 among 128-dim Models | State-of-the-Art Visual Document Retrieval
SauerkrautLM-ColQwen3-8b-v0.1 is the best-performing 128-dimensional embedding model for visual document retrieval, achieving 91.08 NDCG@5 on ViDoRe v1 - the highest score among all models with 128-dim embeddings.
Traditional OCR-based retrieval loses layout, tables, and visual context. Our visual approach:
| Benchmark | Score | Rank (128-dim) | Rank (All) |
|---|---|---|---|
| ViDoRe v1 | 91.08 | π₯ #1 | #1 |
| MTEB v1+v2 | 82.91 | #2 | #5 |
| ViDoRe v3 | 58.55 | π₯ #1 | #3 |
| Model | Params | Dim | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 |
|---|---|---|---|---|---|
| SauerkrautLM-ColQwen3-8b-v0.1 β | 8.0B | 128 | 91.08 | 82.91 | 58.55 |
| EvoQwen2.5-VL-Retriever-7B-v1 | 7.0B | 128 | 90.68 | 83.41 | - |
| colnomic-embed-multimodal-7b | 7.0B | 128 | 89.72 | 81.30 | 57.64 |
Our 8B model achieves the highest ViDoRe v1 and v3 scores among ALL 128-dim models!
| Task | Score |
|---|---|
| ArxivQA | 93.80 π₯ |
| DocVQA | 64.69 |
| InfoVQA | 94.51 |
| ShiftProject | 90.41 |
| SyntheticDocQA-AI | 98.65 |
| SyntheticDocQA-Energy | 96.52 |
| SyntheticDocQA-Gov | 96.79 |
| SyntheticDocQA-Health | 99.26 |
| TabFQuAD | 92.18 |
| TATDQA | 84.04 π₯ |
| Average | 91.08 π₯ |
ViDoRe v1 Tasks:
| Task | Score |
|---|---|
| ArxivQA | 93.80 π₯ |
| DocVQA | 64.69 |
| InfoVQA | 94.51 |
| ShiftProject | 90.41 |
| SyntheticDocQA-AI | 98.65 |
| SyntheticDocQA-Energy | 96.52 |
| SyntheticDocQA-Gov | 96.79 |
| SyntheticDocQA-Health | 99.26 |
| TabFQuAD | 92.18 |
| TATDQA | 84.04 π₯ |
ViDoRe v2 Tasks (Multilingual):
| Task | Score |
|---|---|
| ViDoRe-v2-2BioMed | 63.26 |
| ViDoRe-v2-2Econ | 57.98 |
| ViDoRe-v2-2ESG-HL | 70.77 |
| ViDoRe-v2-2ESG | 57.85 |
| Combined Average | 82.91 |
| Task | Score |
|---|---|
| ViDoRe-v3-CS | 77.52 π₯ |
| ViDoRe-v3-Energy | 66.32 |
| ViDoRe-v3-FinanceEn | 55.79 |
| ViDoRe-v3-FinanceFr | 45.03 |
| ViDoRe-v3-HR | 59.96 |
| ViDoRe-v3-Industry | 50.39 |
| ViDoRe-v3-Pharma | 63.98 |
| ViDoRe-v3-Physics | 49.36 |
| Average | 58.55 |
| Model | Params | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 |
|---|---|---|---|---|
| SauerkrautLM-ColQwen3-8b-v0.1 β | 8.0B | 91.08 (#1) | 82.91 (#2) | 58.55 (#1) |
| EvoQwen2.5-VL-Retriever-7B-v1 | 7.0B | 90.68 (#3) | 83.41 (#1) | - |
| SauerkrautLM-ColQwen3-4b-v0.1 | 4.0B | 90.80 (#2) | 81.97 (#4) | 56.03 (#4) |
| EvoQwen2.5-VL-Retriever-3B-v1 | 3.0B | 90.67 (#4) | 82.76 (#3) | - |
| SauerkrautLM-ColQwen3-2b-v0.1 | 2.2B | 90.24 (#5) | 81.02 (#7) | 54.32 (#5) |
| colnomic-embed-multimodal-7b | 7.0B | 89.72 (#7) | 81.30 (#5) | 57.64 (#2) |
| colnomic-embed-multimodal-3b | 3.0B | 89.86 (#6) | 80.09 (#8) | 56.40 (#3) |
| colqwen2.5-v0.2 | 3.0B | 89.54 (#8) | 81.12 (#6) | 52.44 (#6) |
| colqwen2-v1.0 | 2.2B | 89.23 (#9) | 79.74 (#9) | 44.18 (#8) |
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-VL-8B |
| Parameters | 8.0B |
| Embedding Dimension | 128 |
| VRAM (bfloat16) | ~16 GB |
| Max Context Length | 262,144 tokens |
| Image Resolution | Dynamic (up to 1540 visual tokens) |
| Supported Languages | EN, DE, FR, ES, IT, PT |
| License | Apache 2.0 |
| Setting | Value |
|---|---|
| GPUs | 4x NVIDIA A100 SXM (80GB) |
| Effective Batch Size | 256 |
| Precision | bfloat16 |
| Optimizer | AdamW |
| Dataset | Type | Description |
|---|---|---|
| vidore/colpali_train_set | Public | ColPali training data |
| openbmb/VisRAG-Ret-Train-In-domain-data | Public | Visual RAG training data |
| llamaindex/vdr-multilingual-train | Public | Multilingual document retrieval |
| VAGO Multilingual Dataset 1 | In-house | Proprietary multilingual document-query pairs |
| VAGO Multilingual Dataset 2 | In-house | Proprietary multilingual document-query pairs |
β οΈ Important: Install our package first before loading the model:
pip install git+https://github.com/VAGOsolutions/sauerkrautlm-colpali
import torch
from PIL import Image
from sauerkrautlm_colpali.models import ColQwen3, ColQwen3Processor
model_name = "VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1"
model = ColQwen3.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="cuda:0",
).eval()
processor = ColQwen3Processor.from_pretrained(model_name)
# Process inputs
images = [Image.open("document.png")]
queries = ["What is the main topic?"]
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score(query_embeddings, image_embeddings)
@misc{sauerkrautlm-colpali-2025,
title={SauerkrautLM-ColPali: Multi-Vector Vision Retrieval Models},
author={David Golchinfar},
organization={VAGO Solutions},
year={2025},
url={https://github.com/VAGOsolutions/sauerkrautlm-colpali}
}