nielsr HF Staff commited on
Commit
48182bb
·
verified ·
1 Parent(s): 67914af

Improve model card: add pipeline tag, library name, key results, and usage example

Browse files

This PR significantly improves the model card for UniLiP by:

* Adding the `pipeline_tag: any-to-any` to accurately reflect its versatile multimodal capabilities (understanding, generation, and editing), enhancing discoverability on the Hub.
* Specifying `library_name: transformers`, as the model's `config.json` indicates compatibility with the Hugging Face Transformers library (e.g., `transformers_version`, `Qwen2Tokenizer`), enabling the "how to use" widget on the model page.
* Incorporating key sections from the GitHub README, including a detailed introduction, main results tables, and the BibTeX citation, to provide a more comprehensive overview.
* Adding a quick sample usage snippet for text-to-image generation, directly extracted from the GitHub README's "Quick Start" section, to make the model more immediately usable.

This update makes the model card more informative and user-friendly, without changing the existing arXiv paper link.

Files changed (1) hide show
  1. README.md +94 -5
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - BLIP3o/BLIP3o-Pretrain-Long-Caption
5
  - BLIP3o/BLIP3o-Pretrain-Short-Caption
@@ -7,15 +8,103 @@ datasets:
7
  - UCSC-VLAA/GPT-Image-Edit-1.5M
8
  - BLIP3o/BLIP3o-60k
9
  - FreedomIntelligence/ShareGPT-4o-Image
10
- base_model:
11
- - OpenGVLab/InternVL3-1B
 
12
  ---
 
 
 
13
  This repository contains the model (1B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.
14
 
15
- UniLIP proposes a unified, CLIP-based encoder featuring both rich semantics and fine-grained image details. Through a **two-stage and self-distillation training** for reconstruction, we empower CLIP to achieve excellent reconstruction results **without compromising its original understanding abilities**. Leveraging this powerful unified representation, UniLIP excels across understanding, generation, and editing tasks.
 
 
 
 
16
 
17
  For more details, please refer to the original paper and the GitHub repository:
18
 
19
  Paper: https://www.arxiv.org/abs/2507.23278
20
 
21
- GitHub: https://github.com/nnnth/UniLIP
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - OpenGVLab/InternVL3-1B
4
  datasets:
5
  - BLIP3o/BLIP3o-Pretrain-Long-Caption
6
  - BLIP3o/BLIP3o-Pretrain-Short-Caption
 
8
  - UCSC-VLAA/GPT-Image-Edit-1.5M
9
  - BLIP3o/BLIP3o-60k
10
  - FreedomIntelligence/ShareGPT-4o-Image
11
+ license: apache-2.0
12
+ pipeline_tag: any-to-any
13
+ library_name: transformers
14
  ---
15
+
16
+ # UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
17
+
18
  This repository contains the model (1B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.
19
 
20
+ ## Introduction
21
+ Previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. To overcome this, we propose UniLIP:
22
+ - **Two-Stage Self-Distillation**: A novel training scheme that teaches CLIP high-fidelity reconstruction without degrading its powerful comprehension abilities.
23
+ - **Dual-Condition Architecture**: Enhances reasoning and edit consistency by combining rich multimodal context with learnable queries that harness the power of MLLMs.
24
+ - **State-of-the-Art Performance**: Achieves top results on GenEval (0.88/0.90), WISE (0.56/0.63), and ImgEdit (3.81/3.94) with efficient 1B/3B models, demonstrating superior instruction following and edit fidelity.
25
 
26
  For more details, please refer to the original paper and the GitHub repository:
27
 
28
  Paper: https://www.arxiv.org/abs/2507.23278
29
 
30
+ GitHub: https://github.com/nnnth/UniLIP
31
+
32
+ ## 🚀 Main Results
33
+
34
+ ### Image Reconstruction on ImageNet val
35
+
36
+ | Model | Res. | ratio | rFID ↓ | PSNR↑ | SSIM↑ |
37
+ | :--- | :--- | :--- | :--- | :--- | :--- |
38
+ | VILA-U | 256 | 16 | 1.80 | - | - |
39
+ | Tokenflow | 256 | 16 | 1.37 | 21.41 | 0.687 |
40
+ | DualViTok | 256 | 16 | 1.37 | 22.53 | 0.741 |
41
+ | **UniLIP** | 256 | 32 | **0.79** | **22.99** | **0.747** |
42
+ | Emu2 | 448 | 14 | 3.27 | 13.49 | 0.423 |
43
+ | **UniLIP** | 448 | 32 | **0.31** | **24.62** | **0.788** |
44
+
45
+
46
+ ### Image Understanding
47
+ | Model | # LLM Params | MME-P | MMB | MMMU | MM-Vet | SEED | AI2D | MMVP |
48
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
49
+ | InternVL3-1B | 1B | 1492 | 72.6 | 43.4 | 59.5 | 71.1 | 69.4 | 67.3 |
50
+ | InternVL3-2B | 1.8B | 1633 | 80.6 | 48.2 | 62.2 | 75.0 | 78.5 | 72.7 |
51
+ | BAGEL-3B | 3B | 1610 | 79.2 | 43.2 | 48.2 | - | - | 54.7 |
52
+ | BLIP3o-4B | 4B | 1528 | 78.6 | 46.6 | 60.1 | 73.8 | - | - |
53
+ | TokLIP-7B | 7B | 1410 | - | 42.1 | - | 65.2 | - | - |
54
+ | Tar-7B | 7B | 1571 | 74.4 | 39.0 | | 73.0 | - | - |
55
+ | **UniLIP-1B** | 1B | 1499 | 72.6 | 43.3 | 59.4 | 71.0 | 70.7 | 68.7 |
56
+ | **UniLIP-3B** | 2B | **1636** | **80.7** | **48.7** | **62.2** | **75.0** | **78.6** | **73.0** |
57
+
58
+
59
+ ### Image Generation and Editing
60
+ | Model | # Params | GenEval | WISE | ImgEdit |
61
+ | :--- | :--- | :--- | :--- | :--- |
62
+ | BAGEL | 7B+7B | 0.82 | 0.52 | 3.20 |
63
+ | BLIP3o-4B | 3B+1.4B | 0.81 | 0.50 | - |
64
+ | UniWorld-V1 | 7B+12B | - | - | 3.26 |
65
+ | **UniLIP-1B** | 1B+0.6B | 0.88 | 0.56 | 3.81 |
66
+ | **UniLIP-3B** | 2B+1.6B | **0.90** | **0.63** | **3.94** |
67
+
68
+ ## 🛠️ Quick Start (Text-to-Image Generation)
69
+
70
+ Here's a simple example for text-to-image generation using `FlexARInferenceSolver`:
71
+
72
+ ```python
73
+ from inference_solver import FlexARInferenceSolver
74
+ from PIL import Image
75
+
76
+ inference_solver = FlexARInferenceSolver(
77
+ model_path="kanashi6/UniLIP-3B", # Or "kanashi6/UniLIP-1B"
78
+ precision="bf16",
79
+ target_size=768, # Or 512, 1024 depending on the model
80
+ )
81
+
82
+ question = "Generate an image of 768x768 according to the following prompt:
83
+ " \
84
+ "Image of a dog playing water, and a waterfall is in the background."
85
+
86
+ generated = inference_solver.generate(
87
+ images=[],
88
+ qas=[[question, None]],
89
+ max_gen_len=8192,
90
+ temperature=1.0,
91
+ logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
92
+ )
93
+
94
+ generated_image = generated[1][0]
95
+ generated_image.save("generated_dog_waterfall.png")
96
+ print("Generated image saved as 'generated_dog_waterfall.png'")
97
+ ```
98
+
99
+ For more detailed usage examples for image understanding, image editing, and omni-potent tasks, please refer to the [GitHub repository's inference section](https://github.com/nnnth/UniLIP#%EF%B8%8F-quick-start).
100
+
101
+ ## 📘 Citation
102
+ Please consider citing our work as follows if it is helpful.
103
+ ```
104
+ @article{tang2025unilip,
105
+ title={UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing},
106
+ author={Tang, Hao and Xie, Chenwei and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei},
107
+ journal={arXiv preprint arXiv:2507.23278},
108
+ year={2025}
109
+ }
110
+ ```