--- base_model: - OpenGVLab/InternVL3-1B datasets: - BLIP3o/BLIP3o-Pretrain-Long-Caption - BLIP3o/BLIP3o-Pretrain-Short-Caption - BLIP3o/BLIP3o-Pretrain-JourneyDB - UCSC-VLAA/GPT-Image-Edit-1.5M - BLIP3o/BLIP3o-60k - FreedomIntelligence/ShareGPT-4o-Image license: apache-2.0 pipeline_tag: any-to-any library_name: transformers --- # UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing This repository contains the model (1B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing. ## Introduction Previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. To overcome this, we propose UniLIP: - **Two-Stage Self-Distillation**: A novel training scheme that teaches CLIP high-fidelity reconstruction without degrading its powerful comprehension abilities. - **Dual-Condition Architecture**: Enhances reasoning and edit consistency by combining rich multimodal context with learnable queries that harness the power of MLLMs. - **State-of-the-Art Performance**: Achieves top results on GenEval (0.88/0.90), WISE (0.56/0.63), and ImgEdit (3.81/3.94) with efficient 1B/3B models, demonstrating superior instruction following and edit fidelity. For more details, please refer to the original paper and the GitHub repository: Paper: https://www.arxiv.org/abs/2507.23278 GitHub: https://github.com/nnnth/UniLIP ## 🚀 Main Results ### Image Reconstruction on ImageNet val | Model | Res. | ratio | rFID ↓ | PSNR↑ | SSIM↑ | | :--- | :--- | :--- | :--- | :--- | :--- | | VILA-U | 256 | 16 | 1.80 | - | - | | Tokenflow | 256 | 16 | 1.37 | 21.41 | 0.687 | | DualViTok | 256 | 16 | 1.37 | 22.53 | 0.741 | | **UniLIP** | 256 | 32 | **0.79** | **22.99** | **0.747** | | Emu2 | 448 | 14 | 3.27 | 13.49 | 0.423 | | **UniLIP** | 448 | 32 | **0.31** | **24.62** | **0.788** | ### Image Understanding | Model | # LLM Params | MME-P | MMB | MMMU | MM-Vet | SEED | AI2D | MMVP | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | InternVL3-1B | 1B | 1492 | 72.6 | 43.4 | 59.5 | 71.1 | 69.4 | 67.3 | | InternVL3-2B | 1.8B | 1633 | 80.6 | 48.2 | 62.2 | 75.0 | 78.5 | 72.7 | | BAGEL-3B | 3B | 1610 | 79.2 | 43.2 | 48.2 | - | - | 54.7 | | BLIP3o-4B | 4B | 1528 | 78.6 | 46.6 | 60.1 | 73.8 | - | - | | TokLIP-7B | 7B | 1410 | - | 42.1 | - | 65.2 | - | - | | Tar-7B | 7B | 1571 | 74.4 | 39.0 | | 73.0 | - | - | | **UniLIP-1B** | 1B | 1499 | 72.6 | 43.3 | 59.4 | 71.0 | 70.7 | 68.7 | | **UniLIP-3B** | 2B | **1636** | **80.7** | **48.7** | **62.2** | **75.0** | **78.6** | **73.0** | ### Image Generation and Editing | Model | # Params | GenEval | WISE | ImgEdit | | :--- | :--- | :--- | :--- | :--- | | BAGEL | 7B+7B | 0.82 | 0.52 | 3.20 | | BLIP3o-4B | 3B+1.4B | 0.81 | 0.50 | - | | UniWorld-V1 | 7B+12B | - | - | 3.26 | | **UniLIP-1B** | 1B+0.6B | 0.88 | 0.56 | 3.81 | | **UniLIP-3B** | 2B+1.6B | **0.90** | **0.63** | **3.94** | ## 🛠️ Quick Start (Text-to-Image Generation) Here's a simple example for text-to-image generation using `FlexARInferenceSolver`: ```python from inference_solver import FlexARInferenceSolver from PIL import Image inference_solver = FlexARInferenceSolver( model_path="kanashi6/UniLIP-3B", # Or "kanashi6/UniLIP-1B" precision="bf16", target_size=768, # Or 512, 1024 depending on the model ) question = "Generate an image of 768x768 according to the following prompt: " \ "Image of a dog playing water, and a waterfall is in the background." generated = inference_solver.generate( images=[], qas=[[question, None]], max_gen_len=8192, temperature=1.0, logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000), ) generated_image = generated[1][0] generated_image.save("generated_dog_waterfall.png") print("Generated image saved as 'generated_dog_waterfall.png'") ``` For more detailed usage examples for image understanding, image editing, and omni-potent tasks, please refer to the [GitHub repository's inference section](https://github.com/nnnth/UniLIP#%EF%B8%8F-quick-start). ## 📘 Citation Please consider citing our work as follows if it is helpful. ``` @article{tang2025unilip, title={UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing}, author={Tang, Hao and Xie, Chenwei and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei}, journal={arXiv preprint arXiv:2507.23278}, year={2025} } ```