Improve model card: add pipeline tag, library name, key results, and usage example
Browse filesThis PR significantly improves the model card for UniLiP by:
* Adding the `pipeline_tag: any-to-any` to accurately reflect its versatile multimodal capabilities (understanding, generation, and editing), enhancing discoverability on the Hub.
* Specifying `library_name: transformers`, as the model's `config.json` indicates compatibility with the Hugging Face Transformers library (e.g., `transformers_version`, `Qwen2Tokenizer`), enabling the "how to use" widget on the model page.
* Incorporating key sections from the GitHub README, including a detailed introduction, main results tables, and the BibTeX citation, to provide a more comprehensive overview.
* Adding a quick sample usage snippet for text-to-image generation, directly extracted from the GitHub README's "Quick Start" section, to make the model more immediately usable.
This update makes the model card more informative and user-friendly, without changing the existing arXiv paper link.
|
@@ -1,5 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- BLIP3o/BLIP3o-Pretrain-Long-Caption
|
| 5 |
- BLIP3o/BLIP3o-Pretrain-Short-Caption
|
|
@@ -7,15 +8,103 @@ datasets:
|
|
| 7 |
- UCSC-VLAA/GPT-Image-Edit-1.5M
|
| 8 |
- BLIP3o/BLIP3o-60k
|
| 9 |
- FreedomIntelligence/ShareGPT-4o-Image
|
| 10 |
-
|
| 11 |
-
|
|
|
|
| 12 |
---
|
|
|
|
|
|
|
|
|
|
| 13 |
This repository contains the model (1B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
For more details, please refer to the original paper and the GitHub repository:
|
| 18 |
|
| 19 |
Paper: https://www.arxiv.org/abs/2507.23278
|
| 20 |
|
| 21 |
-
GitHub: https://github.com/nnnth/UniLIP
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- OpenGVLab/InternVL3-1B
|
| 4 |
datasets:
|
| 5 |
- BLIP3o/BLIP3o-Pretrain-Long-Caption
|
| 6 |
- BLIP3o/BLIP3o-Pretrain-Short-Caption
|
|
|
|
| 8 |
- UCSC-VLAA/GPT-Image-Edit-1.5M
|
| 9 |
- BLIP3o/BLIP3o-60k
|
| 10 |
- FreedomIntelligence/ShareGPT-4o-Image
|
| 11 |
+
license: apache-2.0
|
| 12 |
+
pipeline_tag: any-to-any
|
| 13 |
+
library_name: transformers
|
| 14 |
---
|
| 15 |
+
|
| 16 |
+
# UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
|
| 17 |
+
|
| 18 |
This repository contains the model (1B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.
|
| 19 |
|
| 20 |
+
## Introduction
|
| 21 |
+
Previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. To overcome this, we propose UniLIP:
|
| 22 |
+
- **Two-Stage Self-Distillation**: A novel training scheme that teaches CLIP high-fidelity reconstruction without degrading its powerful comprehension abilities.
|
| 23 |
+
- **Dual-Condition Architecture**: Enhances reasoning and edit consistency by combining rich multimodal context with learnable queries that harness the power of MLLMs.
|
| 24 |
+
- **State-of-the-Art Performance**: Achieves top results on GenEval (0.88/0.90), WISE (0.56/0.63), and ImgEdit (3.81/3.94) with efficient 1B/3B models, demonstrating superior instruction following and edit fidelity.
|
| 25 |
|
| 26 |
For more details, please refer to the original paper and the GitHub repository:
|
| 27 |
|
| 28 |
Paper: https://www.arxiv.org/abs/2507.23278
|
| 29 |
|
| 30 |
+
GitHub: https://github.com/nnnth/UniLIP
|
| 31 |
+
|
| 32 |
+
## 🚀 Main Results
|
| 33 |
+
|
| 34 |
+
### Image Reconstruction on ImageNet val
|
| 35 |
+
|
| 36 |
+
| Model | Res. | ratio | rFID ↓ | PSNR↑ | SSIM↑ |
|
| 37 |
+
| :--- | :--- | :--- | :--- | :--- | :--- |
|
| 38 |
+
| VILA-U | 256 | 16 | 1.80 | - | - |
|
| 39 |
+
| Tokenflow | 256 | 16 | 1.37 | 21.41 | 0.687 |
|
| 40 |
+
| DualViTok | 256 | 16 | 1.37 | 22.53 | 0.741 |
|
| 41 |
+
| **UniLIP** | 256 | 32 | **0.79** | **22.99** | **0.747** |
|
| 42 |
+
| Emu2 | 448 | 14 | 3.27 | 13.49 | 0.423 |
|
| 43 |
+
| **UniLIP** | 448 | 32 | **0.31** | **24.62** | **0.788** |
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
### Image Understanding
|
| 47 |
+
| Model | # LLM Params | MME-P | MMB | MMMU | MM-Vet | SEED | AI2D | MMVP |
|
| 48 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 49 |
+
| InternVL3-1B | 1B | 1492 | 72.6 | 43.4 | 59.5 | 71.1 | 69.4 | 67.3 |
|
| 50 |
+
| InternVL3-2B | 1.8B | 1633 | 80.6 | 48.2 | 62.2 | 75.0 | 78.5 | 72.7 |
|
| 51 |
+
| BAGEL-3B | 3B | 1610 | 79.2 | 43.2 | 48.2 | - | - | 54.7 |
|
| 52 |
+
| BLIP3o-4B | 4B | 1528 | 78.6 | 46.6 | 60.1 | 73.8 | - | - |
|
| 53 |
+
| TokLIP-7B | 7B | 1410 | - | 42.1 | - | 65.2 | - | - |
|
| 54 |
+
| Tar-7B | 7B | 1571 | 74.4 | 39.0 | | 73.0 | - | - |
|
| 55 |
+
| **UniLIP-1B** | 1B | 1499 | 72.6 | 43.3 | 59.4 | 71.0 | 70.7 | 68.7 |
|
| 56 |
+
| **UniLIP-3B** | 2B | **1636** | **80.7** | **48.7** | **62.2** | **75.0** | **78.6** | **73.0** |
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
### Image Generation and Editing
|
| 60 |
+
| Model | # Params | GenEval | WISE | ImgEdit |
|
| 61 |
+
| :--- | :--- | :--- | :--- | :--- |
|
| 62 |
+
| BAGEL | 7B+7B | 0.82 | 0.52 | 3.20 |
|
| 63 |
+
| BLIP3o-4B | 3B+1.4B | 0.81 | 0.50 | - |
|
| 64 |
+
| UniWorld-V1 | 7B+12B | - | - | 3.26 |
|
| 65 |
+
| **UniLIP-1B** | 1B+0.6B | 0.88 | 0.56 | 3.81 |
|
| 66 |
+
| **UniLIP-3B** | 2B+1.6B | **0.90** | **0.63** | **3.94** |
|
| 67 |
+
|
| 68 |
+
## 🛠️ Quick Start (Text-to-Image Generation)
|
| 69 |
+
|
| 70 |
+
Here's a simple example for text-to-image generation using `FlexARInferenceSolver`:
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
from inference_solver import FlexARInferenceSolver
|
| 74 |
+
from PIL import Image
|
| 75 |
+
|
| 76 |
+
inference_solver = FlexARInferenceSolver(
|
| 77 |
+
model_path="kanashi6/UniLIP-3B", # Or "kanashi6/UniLIP-1B"
|
| 78 |
+
precision="bf16",
|
| 79 |
+
target_size=768, # Or 512, 1024 depending on the model
|
| 80 |
+
)
|
| 81 |
+
|
| 82 |
+
question = "Generate an image of 768x768 according to the following prompt:
|
| 83 |
+
" \
|
| 84 |
+
"Image of a dog playing water, and a waterfall is in the background."
|
| 85 |
+
|
| 86 |
+
generated = inference_solver.generate(
|
| 87 |
+
images=[],
|
| 88 |
+
qas=[[question, None]],
|
| 89 |
+
max_gen_len=8192,
|
| 90 |
+
temperature=1.0,
|
| 91 |
+
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
generated_image = generated[1][0]
|
| 95 |
+
generated_image.save("generated_dog_waterfall.png")
|
| 96 |
+
print("Generated image saved as 'generated_dog_waterfall.png'")
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
For more detailed usage examples for image understanding, image editing, and omni-potent tasks, please refer to the [GitHub repository's inference section](https://github.com/nnnth/UniLIP#%EF%B8%8F-quick-start).
|
| 100 |
+
|
| 101 |
+
## 📘 Citation
|
| 102 |
+
Please consider citing our work as follows if it is helpful.
|
| 103 |
+
```
|
| 104 |
+
@article{tang2025unilip,
|
| 105 |
+
title={UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing},
|
| 106 |
+
author={Tang, Hao and Xie, Chenwei and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei},
|
| 107 |
+
journal={arXiv preprint arXiv:2507.23278},
|
| 108 |
+
year={2025}
|
| 109 |
+
}
|
| 110 |
+
```
|