MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe Paper ⢠2509.18154 ⢠Published Sep 16, 2025 ⢠56
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Paper ⢠2504.16030 ⢠Published Apr 22, 2025 ⢠36
RLPR: Extrapolating RLVR to General Domains without Verifiers Paper ⢠2506.18254 ⢠Published Jun 23, 2025 ⢠33
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Paper ⢠2504.17192 ⢠Published Apr 24, 2025 ⢠124
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper ⢠2411.04996 ⢠Published Nov 7, 2024 ⢠50
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents Paper ⢠2410.10594 ⢠Published Oct 14, 2024 ⢠29
UI Agent Collection a collection of algorithmic agents for user interfaces/interactions, program synthesis, and robotics ⢠487 items ⢠Updated about 17 hours ago ⢠68
GUICourse: From General Vision Language Models to Versatile GUI Agents Paper ⢠2406.11317 ⢠Published Jun 17, 2024 ⢠2
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper ⢠2403.11703 ⢠Published Mar 18, 2024 ⢠17
view article Article ColPali: Efficient Document Retrieval with Vision Language Models š Jul 5, 2024 ⢠317
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper ⢠2406.18521 ⢠Published Jun 26, 2024 ⢠30
view article Article An Analysis of Chinese LLM Censorship and Bias with Qwen 2 Instruct Jun 11, 2024 ⢠68
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper ⢠2405.21075 ⢠Published May 31, 2024 ⢠26
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation Paper ⢠2405.14598 ⢠Published May 23, 2024 ⢠13
RoHM: Robust Human Motion Reconstruction via Diffusion Paper ⢠2401.08570 ⢠Published Jan 16, 2024 ⢠1
MultiBooth: Towards Generating All Your Concepts in an Image from Text Paper ⢠2404.14239 ⢠Published Apr 22, 2024 ⢠9