**LLaVA-UHD-v3: PROGRESSIVE VISUAL COMPRES-SION FOR EFFICIENT NAIVE-RESOLUTION ENCODING IN MLLMS**
🚀 Github   |   📄 Arxiv
We release LLaVA-UHD-v3, a multimodal large language model (MLLM) built upon our proposed Progressive Visual Compression (PVC) for efficient naive-resolution encoding.
Our model not only achieves performance comparable to advanced MLLMs like Qwen2-VL across 15 diverse benchmarks but also delivers a 1.9x reduction in Time-to-First-Token
(TTFT). Moreover, LLaVA-UHD v3 can be trained efficiently in academic settings, requiring approximately 300 hours on 32 A100 GPUs.
## Quick Start
```
pip install "transformers>=4.51.0"
```
We adapt our model with ```transformers```, here we show a example of how to chat with our **LLaVA-UHD-v3** conveniently.
#### Using 🤗 Transformers to Chat
```
from transformers import AutoModelForCasualLM, AutoProcessor
# default: Load the model on the available device(s)
model = AutoModelForImageTextToText.from_pretrained(
"Sishxo/LLaVA-UHD-v3", dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = AutoModelForImageTextToText.from_pretrained(
# "Sishxo/LLaVA-UHD-v3",
# dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
processor = AutoProcessor.from_pretrained("Sishxo/LLaVA-UHD-v3")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## Highlights

🧠 **Progressive Visual Compression (PVC)**: LLaVA-UHD v3 introduces a novel visual encoding strategy for efficient naive-resolution processing in MLLMs, combining fine-grained tokenization with hierarchical compression.
💡 **Refined Patch Embedding (RPE)**: Flexibly scales patch sizes to produce detailed visual tokens while maintaining full compatibility with pretrained Vision Transformers, enabling richer visual representations.
💡 **Windowed Token Compression (WTC)**: Progressively merges local token representations within the vision encoder, reducing sequence length and computational cost without losing holistic visual context.
🏆 **Preserved Holistic Understanding**: Unlike slice-based approaches, PVC maintains full-scene semantics, preventing fragmented interpretations common in other naive-resolution encoding methods.
⚡ **Efficient and Competitive**: Achieves strong performance on a wide range of vision-language benchmarks, rivaling state-of-the-art models like Qwen2-VL, while significantly lowering inference latency.
This repository provides examples, usage instructions, and best practices to help developers leverage LLaVA-UHD v3 for efficient, high-fidelity vision-language tasks.
## Performance
#### LLaVA-UHD-v3



#### ViT-UHD