|
|
--- |
|
|
base_model: unsloth/qwen2.5-vl-7b-instruct-unsloth-bnb-4bit |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- transformers |
|
|
- unsloth |
|
|
- qwen2_5_vl |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- AI4Math/MathVista |
|
|
- unsloth/LaTeX_OCR |
|
|
- mychen76/invoices-and-receipts_ocr_v1 |
|
|
- corto-ai/handwritten-text |
|
|
--- |
|
|
|
|
|
# Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding |
|
|
|
|
|
**Cernis-Thinking** is a reasoning-capable vision language model fine-tuned with reinforcement learning (GRPO/GSPO) for document understanding tasks. Built on Qwen2.5-VL-7B, it excels at mathematical reasoning, LaTeX OCR, invoice extraction, and handwriting transcription. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
|
|
- **Training Method**: Group Relative Policy Optimization (GRPO) with GSPO extensions |
|
|
- **Training Data**: ~2,000 samples across 4 document understanding tasks |
|
|
- **Model Size**: 7B parameters |
|
|
- **License**: Apache 2.0 |
|
|
|
|
|
## Capabilities |
|
|
|
|
|
Cernis-Thinking is trained on four distinct document understanding tasks: |
|
|
|
|
|
1. **Mathematical Reasoning** - Solves math problems from images with step-by-step reasoning |
|
|
2. **LaTeX OCR** - Converts mathematical notation images to LaTeX code |
|
|
3. **Invoice Extraction** - Extracts structured information from invoices and receipts |
|
|
4. **Handwriting Transcription** - Transcribes handwritten text from images |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Datasets |
|
|
|
|
|
- [AI4Math/MathVista](https://huggingface.co/datasets/AI4Math/MathVista) - Mathematical reasoning (filtered for numeric answers) |
|
|
- [unsloth/LaTeX_OCR](https://huggingface.co/datasets/unsloth/LaTeX_OCR) - LaTeX formula recognition |
|
|
- [mychen76/invoices-and-receipts_ocr_v1](https://huggingface.co/datasets/mychen76/invoices-and-receipts_ocr_v1) - Invoice extraction |
|
|
- [corto-ai/handwritten-text](https://huggingface.co/datasets/corto-ai/handwritten-text) - Handwriting transcription |
|
|
|
|
|
### Reinforcement Learning Approach |
|
|
|
|
|
The model was trained using GRPO (Group Relative Policy Optimization) with custom reward functions: |
|
|
|
|
|
**1. Formatting Reward Function** |
|
|
- Rewards proper use of `<REASONING>` and `<SOLUTION>` tags |
|
|
- Penalizes malformed outputs (e.g., excessive "addCriterion" artifacts) |
|
|
- Encourages structured, parseable responses |
|
|
|
|
|
**2. Task-Specific Correctness Reward** |
|
|
- **Math**: Exact numeric matching (2.0 points) |
|
|
- **LaTeX/Handwriting**: String similarity with word overlap scoring (0.75-2.0 points) |
|
|
- **Invoices**: Partial credit for extracting key information (1.5 points) |
|
|
|
|
|
**3. ROUGE-like Word Overlap** |
|
|
- For text-heavy tasks, rewards based on word overlap ratio: |
|
|
- >50% overlap: 1.5 points |
|
|
- >30% overlap: 0.75 points |
|
|
- Prevents wasted training on completely wrong outputs |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
```python |
|
|
training_args = GRPOConfig( |
|
|
learning_rate = 5e-6, |
|
|
num_train_epochs = 0.5, |
|
|
per_device_train_batch_size = 1, |
|
|
gradient_accumulation_steps = 2, |
|
|
num_generations = 4, |
|
|
max_prompt_length = 1024, |
|
|
max_completion_length = 1024, |
|
|
|
|
|
# GSPO settings |
|
|
importance_sampling_level = "sequence", |
|
|
loss_type = "dr_grpo", |
|
|
) |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With Transformers |
|
|
|
|
|
```python |
|
|
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor |
|
|
from PIL import Image |
|
|
|
|
|
# Load model and processor |
|
|
model = Qwen2VLForConditionalGeneration.from_pretrained( |
|
|
"coolAI/cernis-thinking", |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained("coolAI/cernis-thinking") |
|
|
|
|
|
# Prepare image and prompt |
|
|
image = Image.open("document.jpg") |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image"}, |
|
|
{"type": "text", "text": "Extract the key information from this invoice. First provide your reasoning between <REASONING> and </REASONING>, then your answer between <SOLUTION> and </SOLUTION>"} |
|
|
] |
|
|
} |
|
|
] |
|
|
|
|
|
# Prepare inputs |
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True).to(model.device) |
|
|
|
|
|
# Generate |
|
|
output_ids = model.generate(**inputs, max_new_tokens=1024) |
|
|
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True) |
|
|
print(generated_text[0]) |
|
|
``` |
|
|
|
|
|
### With vLLM (Recommended for Production) |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
from vllm.assets.image import ImageAsset |
|
|
|
|
|
# Initialize vLLM |
|
|
llm = LLM( |
|
|
model="coolAI/cernis-thinking", |
|
|
max_model_len=16384, |
|
|
gpu_memory_utilization=0.8 |
|
|
) |
|
|
|
|
|
# Prepare prompt |
|
|
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>What is the LaTeX code shown in this image? Provide your answer between <SOLUTION> and </SOLUTION><|im_end|>\n<|im_start|>assistant\n" |
|
|
|
|
|
# Sampling parameters |
|
|
sampling_params = SamplingParams( |
|
|
temperature=0.7, |
|
|
top_k=50, |
|
|
max_tokens=1024 |
|
|
) |
|
|
|
|
|
# Generate |
|
|
outputs = llm.generate( |
|
|
{ |
|
|
"prompt": prompt, |
|
|
"multi_modal_data": {"image": ImageAsset("formula.png").pil_image} |
|
|
}, |
|
|
sampling_params=sampling_params |
|
|
) |
|
|
|
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
## Example Outputs |
|
|
|
|
|
### Mathematical Reasoning |
|
|
**Input**: Image of geometry problem |
|
|
**Output**: |
|
|
``` |
|
|
<REASONING> |
|
|
To solve this parallelogram problem, I need to use the properties: |
|
|
1. Opposite sides are equal in a parallelogram |
|
|
2. Angle bisectors create specific relationships... |
|
|
</REASONING> |
|
|
|
|
|
<SOLUTION> |
|
|
42 |
|
|
</SOLUTION> |
|
|
``` |
|
|
|
|
|
### LaTeX OCR |
|
|
**Input**: Image of mathematical formula |
|
|
**Output**: |
|
|
``` |
|
|
<SOLUTION> |
|
|
\frac{2}{3} < a^{2} \alpha^{2} \leq 1 |
|
|
</SOLUTION> |
|
|
``` |
|
|
|
|
|
### Invoice Extraction |
|
|
**Input**: Invoice image |
|
|
**Output**: |
|
|
``` |
|
|
<SOLUTION> |
|
|
Invoice No: 53553822 |
|
|
Date: 07/24/2012 |
|
|
Vendor: Leo Brown |
|
|
Seller Address: 082 Christopher Club Apt. 771 Thomasberg, OH 42949 |
|
|
Seller Tax ID: 926-74-9803 |
|
|
Total: $247.50 |
|
|
</SOLUTION> |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{cernis-thinking-2025, |
|
|
title={Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding}, |
|
|
author={Your Name}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
howpublished={\url{https://huggingface.co/coolAI/cernis-thinking}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Built with [Unsloth](https://github.com/unslothai/unsloth) for efficient VLM training |
|
|
- Base model: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
|
|
- Training datasets: AI4Math, Unsloth, mychen76, corto-ai |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 - Free for commercial and research use |