cernis-thinking / README.md

Update README.md

96e5bd4 verified 3 months ago

6.46 kB

	---
	base_model: unsloth/qwen2.5-vl-7b-instruct-unsloth-bnb-4bit
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- qwen2_5_vl
	license: apache-2.0
	language:
	- en
	datasets:
	- AI4Math/MathVista
	- unsloth/LaTeX_OCR
	- mychen76/invoices-and-receipts_ocr_v1
	- corto-ai/handwritten-text
	---

	# Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding

	Cernis-Thinking is a reasoning-capable vision language model fine-tuned with reinforcement learning (GRPO/GSPO) for document understanding tasks. Built on Qwen2.5-VL-7B, it excels at mathematical reasoning, LaTeX OCR, invoice extraction, and handwriting transcription.

	## Model Details

	- Base Model: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
	- Training Method: Group Relative Policy Optimization (GRPO) with GSPO extensions
	- Training Data: ~2,000 samples across 4 document understanding tasks
	- Model Size: 7B parameters
	- License: Apache 2.0

	## Capabilities

	Cernis-Thinking is trained on four distinct document understanding tasks:

	1. Mathematical Reasoning - Solves math problems from images with step-by-step reasoning
	2. LaTeX OCR - Converts mathematical notation images to LaTeX code
	3. Invoice Extraction - Extracts structured information from invoices and receipts
	4. Handwriting Transcription - Transcribes handwritten text from images

	## Training Details

	### Datasets

	- [AI4Math/MathVista](https://huggingface.co/datasets/AI4Math/MathVista) - Mathematical reasoning (filtered for numeric answers)
	- [unsloth/LaTeX_OCR](https://huggingface.co/datasets/unsloth/LaTeX_OCR) - LaTeX formula recognition
	- [mychen76/invoices-and-receipts_ocr_v1](https://huggingface.co/datasets/mychen76/invoices-and-receipts_ocr_v1) - Invoice extraction
	- [corto-ai/handwritten-text](https://huggingface.co/datasets/corto-ai/handwritten-text) - Handwriting transcription

	### Reinforcement Learning Approach

	The model was trained using GRPO (Group Relative Policy Optimization) with custom reward functions:

	1. Formatting Reward Function
	- Rewards proper use of `<REASONING>` and `<SOLUTION>` tags
	- Penalizes malformed outputs (e.g., excessive "addCriterion" artifacts)
	- Encourages structured, parseable responses

	2. Task-Specific Correctness Reward
	- Math: Exact numeric matching (2.0 points)
	- LaTeX/Handwriting: String similarity with word overlap scoring (0.75-2.0 points)
	- Invoices: Partial credit for extracting key information (1.5 points)

	3. ROUGE-like Word Overlap
	- For text-heavy tasks, rewards based on word overlap ratio:
	- >50% overlap: 1.5 points
	- >30% overlap: 0.75 points
	- Prevents wasted training on completely wrong outputs

	### Training Configuration

	```python
	training_args = GRPOConfig(
	learning_rate = 5e-6,
	num_train_epochs = 0.5,
	per_device_train_batch_size = 1,
	gradient_accumulation_steps = 2,
	num_generations = 4,
	max_prompt_length = 1024,
	max_completion_length = 1024,

	# GSPO settings
	importance_sampling_level = "sequence",
	loss_type = "dr_grpo",
	)
	```

	## Usage

	### With Transformers

	```python
	from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
	from PIL import Image

	# Load model and processor
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"coolAI/cernis-thinking",
	torch_dtype="auto",
	device_map="auto"
	)
	processor = AutoProcessor.from_pretrained("coolAI/cernis-thinking")

	# Prepare image and prompt
	image = Image.open("document.jpg")
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "Extract the key information from this invoice. First provide your reasoning between <REASONING> and </REASONING>, then your answer between <SOLUTION> and </SOLUTION>"}
	]
	}
	]

	# Prepare inputs
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True).to(model.device)

	# Generate
	output_ids = model.generate(**inputs, max_new_tokens=1024)
	generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)
	print(generated_text[0])
	```

	### With vLLM (Recommended for Production)

	```python
	from vllm import LLM, SamplingParams
	from vllm.assets.image import ImageAsset

	# Initialize vLLM
	llm = LLM(
	model="coolAI/cernis-thinking",
	max_model_len=16384,
	gpu_memory_utilization=0.8
	)

	# Prepare prompt
	prompt = "<\|im_start\|>system\nYou are a helpful assistant.<\|im_end\|>\n<\|im_start\|>user\n<\|vision_start\|><\|image_pad\|><\|vision_end\|>What is the LaTeX code shown in this image? Provide your answer between <SOLUTION> and </SOLUTION><\|im_end\|>\n<\|im_start\|>assistant\n"

	# Sampling parameters
	sampling_params = SamplingParams(
	temperature=0.7,
	top_k=50,
	max_tokens=1024
	)

	# Generate
	outputs = llm.generate(
	{
	"prompt": prompt,
	"multi_modal_data": {"image": ImageAsset("formula.png").pil_image}
	},
	sampling_params=sampling_params
	)

	print(outputs[0].outputs[0].text)
	```

	## Example Outputs

	### Mathematical Reasoning
	Input: Image of geometry problem
	Output:
	```
	<REASONING>
	To solve this parallelogram problem, I need to use the properties:
	1. Opposite sides are equal in a parallelogram
	2. Angle bisectors create specific relationships...
	</REASONING>

	<SOLUTION>
	42
	</SOLUTION>
	```

	### LaTeX OCR
	Input: Image of mathematical formula
	Output:
	```
	<SOLUTION>
	\frac{2}{3} < a^{2} \alpha^{2} \leq 1
	</SOLUTION>
	```

	### Invoice Extraction
	Input: Invoice image
	Output:
	```
	<SOLUTION>
	Invoice No: 53553822
	Date: 07/24/2012
	Vendor: Leo Brown
	Seller Address: 082 Christopher Club Apt. 771 Thomasberg, OH 42949
	Seller Tax ID: 926-74-9803
	Total: $247.50
	</SOLUTION>
	```

	## Citation

	```bibtex
	@misc{cernis-thinking-2025,
	title={Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding},
	author={Your Name},
	year={2025},
	publisher={HuggingFace},
	howpublished={\url{https://huggingface.co/coolAI/cernis-thinking}}
	}
	```

	## Acknowledgments

	- Built with [Unsloth](https://github.com/unslothai/unsloth) for efficient VLM training
	- Base model: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
	- Training datasets: AI4Math, Unsloth, mychen76, corto-ai

	## License

	Apache 2.0 - Free for commercial and research use