Improve model card: add pipeline tag, library name, key results, and usage example

This PR significantly improves the model card for UniLiP by:

* Adding the `pipeline_tag: any-to-any` to accurately reflect its versatile multimodal capabilities (understanding, generation, and editing), enhancing discoverability on the Hub.
* Specifying `library_name: transformers`, as the model's `config.json` indicates compatibility with the Hugging Face Transformers library (e.g., `transformers_version`, `Qwen2Tokenizer`), enabling the "how to use" widget on the model page.
* Incorporating key sections from the GitHub README, including a detailed introduction, main results tables, and the BibTeX citation, to provide a more comprehensive overview.
* Adding a quick sample usage snippet for text-to-image generation, directly extracted from the GitHub README's "Quick Start" section, to make the model more immediately usable.

This update makes the model card more informative and user-friendly, without changing the existing arXiv paper link.

Files changed (1) hide show

README.md +94 -5

README.md CHANGED Viewed

@@ -1,5 +1,6 @@
 ---
-license: apache-2.0
 datasets:
 - BLIP3o/BLIP3o-Pretrain-Long-Caption
 - BLIP3o/BLIP3o-Pretrain-Short-Caption
@@ -7,15 +8,103 @@ datasets:
 - UCSC-VLAA/GPT-Image-Edit-1.5M
 - BLIP3o/BLIP3o-60k
 - FreedomIntelligence/ShareGPT-4o-Image
-base_model:
-- OpenGVLab/InternVL3-1B
 ---
 This repository contains the model (1B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.
-UniLIP proposes a unified, CLIP-based encoder featuring both rich semantics and fine-grained image details. Through a **two-stage and self-distillation training** for reconstruction, we empower CLIP to achieve excellent reconstruction results **without compromising its original understanding abilities**. Leveraging this powerful unified representation, UniLIP excels across understanding, generation, and editing tasks.
 For more details, please refer to the original paper and the GitHub repository:
 Paper: https://www.arxiv.org/abs/2507.23278
-GitHub: https://github.com/nnnth/UniLIP

 ---
+base_model:
+- OpenGVLab/InternVL3-1B
 datasets:
 - BLIP3o/BLIP3o-Pretrain-Long-Caption
 - BLIP3o/BLIP3o-Pretrain-Short-Caption
 - UCSC-VLAA/GPT-Image-Edit-1.5M
 - BLIP3o/BLIP3o-60k
 - FreedomIntelligence/ShareGPT-4o-Image
+license: apache-2.0
+pipeline_tag: any-to-any
+library_name: transformers
 ---
+# UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
 This repository contains the model (1B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.
+## Introduction
+Previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. To overcome this, we propose UniLIP:
+- **Two-Stage Self-Distillation**: A novel training scheme that teaches CLIP high-fidelity reconstruction without degrading its powerful comprehension abilities.
+- **Dual-Condition Architecture**: Enhances reasoning and edit consistency by combining rich multimodal context with learnable queries that harness the power of MLLMs.
+- **State-of-the-Art Performance**: Achieves top results on GenEval (0.88/0.90), WISE (0.56/0.63), and ImgEdit (3.81/3.94) with efficient 1B/3B models, demonstrating superior instruction following and edit fidelity.
 For more details, please refer to the original paper and the GitHub repository:
 Paper: https://www.arxiv.org/abs/2507.23278
+GitHub: https://github.com/nnnth/UniLIP
+## 🚀 Main Results
+### Image Reconstruction on ImageNet val
+| Model | Res. | ratio | rFID ↓ | PSNR↑ | SSIM↑ |
+| :--- | :--- | :--- | :--- | :--- | :--- |
+| VILA-U | 256 | 16 | 1.80 | - | - |
+| Tokenflow | 256 | 16 | 1.37 | 21.41 | 0.687 |
+| DualViTok | 256 | 16 | 1.37 | 22.53 | 0.741 |
+| **UniLIP** | 256 | 32 | **0.79** | **22.99** | **0.747** |
+| Emu2 | 448 | 14 | 3.27 | 13.49 | 0.423 |
+| **UniLIP** | 448 | 32 | **0.31** | **24.62** | **0.788** |
+### Image Understanding
+| Model | # LLM Params | MME-P | MMB | MMMU | MM-Vet | SEED | AI2D | MMVP |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| InternVL3-1B | 1B | 1492 | 72.6 | 43.4 | 59.5 | 71.1 | 69.4 | 67.3 |
+| InternVL3-2B | 1.8B | 1633 | 80.6 | 48.2 | 62.2 | 75.0 | 78.5 | 72.7 |
+| BAGEL-3B | 3B | 1610 | 79.2 | 43.2 | 48.2 | - | - | 54.7 |
+| BLIP3o-4B | 4B | 1528 | 78.6 | 46.6 | 60.1 | 73.8 | - | - |
+| TokLIP-7B | 7B | 1410 | - | 42.1 | - | 65.2 | - | - |
+| Tar-7B | 7B | 1571 | 74.4 | 39.0 | | 73.0 | - | - |
+| **UniLIP-1B** | 1B | 1499 | 72.6 | 43.3 | 59.4 | 71.0 | 70.7 | 68.7 |
+| **UniLIP-3B** | 2B | **1636** | **80.7** | **48.7** | **62.2** | **75.0** | **78.6** | **73.0** |
+### Image Generation and Editing
+| Model | # Params | GenEval | WISE  | ImgEdit |
+| :--- | :--- | :--- | :--- | :--- |
+| BAGEL | 7B+7B | 0.82 | 0.52 | 3.20 |
+| BLIP3o-4B | 3B+1.4B | 0.81 | 0.50 | - |
+| UniWorld-V1 | 7B+12B | - | - | 3.26 |
+| **UniLIP-1B** | 1B+0.6B | 0.88 | 0.56 | 3.81 |
+| **UniLIP-3B** | 2B+1.6B | **0.90** | **0.63** | **3.94** |
+## 🛠️ Quick Start (Text-to-Image Generation)
+Here's a simple example for text-to-image generation using `FlexARInferenceSolver`:
+```python
+from inference_solver import FlexARInferenceSolver
+from PIL import Image
+inference_solver = FlexARInferenceSolver(
+    model_path="kanashi6/UniLIP-3B", # Or "kanashi6/UniLIP-1B"
+    precision="bf16",
+    target_size=768, # Or 512, 1024 depending on the model
+)
+question = "Generate an image of 768x768 according to the following prompt:
+" \
+           "Image of a dog playing water, and a waterfall is in the background."
+generated = inference_solver.generate(
+    images=[],
+    qas=[[question, None]],
+    max_gen_len=8192,
+    temperature=1.0,
+    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
+)
+generated_image = generated[1][0]
+generated_image.save("generated_dog_waterfall.png")
+print("Generated image saved as 'generated_dog_waterfall.png'")
+```
+For more detailed usage examples for image understanding, image editing, and omni-potent tasks, please refer to the [GitHub repository's inference section](https://github.com/nnnth/UniLIP#%EF%B8%8F-quick-start).
+## 📘 Citation
+Please consider citing our work as follows if it is helpful.
+```
+@article{tang2025unilip,
+  title={UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing},
+  author={Tang, Hao and Xie, Chenwei and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei},
+  journal={arXiv preprint arXiv:2507.23278},
+  year={2025}
+}
+```