--- license: other license_name: bsd-3-clause license_link: https://github.com/TencentARC/TimeLens/blob/main/LICENSE language: - en tags: - video-grounding - temporal-grounding - video-understanding - qwen2-vl library_name: transformers pipeline_tag: video-text-to-text datasets: - TencentARC/TimeLens-100K - TencentARC/TimeLens-Bench base_model: - Qwen/Qwen2.5-VL-7B-Instruct --- # TimeLens-7B 📑 [**Paper**](https://arxiv.org/abs/2512.14698) | 💻 [**Code**](https://github.com/TencentARC/TimeLens) | 🏠 [**Project Page**](https://timelens-arc-lab.github.io/) | 🤗 [**Model & Data**](https://huggingface.co/collections/TencentARC/timelens) ## ✨ Model Description **TimeLens-7B** is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K). ## 📊 Performance TimeLens-7B achieves strong video temporal grounding performance:
Model
(with 🤗HuggingFace Link)
Charades-TimeLens ActivityNet-TimeLens QVHighlights-TimeLens
R1
@0.3
R1
@0.5
R1
@0.7
mIoU R1
@0.3
R1
@0.5
R1
@0.7
mIoU R1
@0.3
R1
@0.5
R1
@0.7
mIoU
Qwen2.5-VL-7B-Instruct 59.7 37.8 16.6 39.3 44.1 31.0 16.1 31.4 41.5 27.8 15.2 31.6
TimeLens-7B🚀 70.5 55.6 28.4 48.8 62.8 51.0 32.6 46.2 74.1 62.7 43.1 56.0
Qwen3-VL-8B-Instruct 69.2 53.4 27.5 48.3 62.1 51.2 34.4 46.8 74.2 64.6 49.3 59.4
TimeLens-8B🚀 76.6 63.0 35.2 55.2 68.9 58.4 40.6 53.2 80.2 71.6 55.5 65.5
> For detailed comparison with other models, please refer to the [Leaderboard](https://timelens-arc-lab.github.io/#leaderboard). ## 🚀 Usage Install the following packages: ```bash pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0 pip install qwen-vl-utils[decord]==0.0.14 # use Flash-Attention 2 to speed up generation pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir ``` Using 🤗Transformers for Inference: ```python import torch from transformers import AutoModelForImageTextToText, AutoProcessor from qwen_vl_utils import process_vision_info # Load model and processor model = AutoModelForImageTextToText.from_pretrained( "TencentARC/TimeLens-7B", dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) processor = AutoProcessor.from_pretrained( "TencentARC/TimeLens-7B", padding_side="left", do_resize=False, trust_remote_code=True, ) # Prepare input query = "A man is sitting on a chair" video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4" GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in - seconds'." messages = [{ 'role': 'user', 'content': [ { 'type': 'video', 'video': video_path, 'min_pixels': 64 * 28 * 28, 'total_pixels': 14336 * 28 * 28, 'fps': 2, }, { 'type': 'text', 'text': GROUNDER_PROMPT.format(query) } ] }] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) images, videos = process_vision_info(messages, return_video_metadata=True) inputs = processor( text=[text], images=images, videos=videos, padding=True, return_tensors='pt' ).to("cuda") output_ids = model.generate( **inputs, do_sample=False, max_new_tokens=512, ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids) ] answer = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f"Answer: {answer}") ``` ## Citation If you find our work helpful for your research and applications, please cite our paper: ```bibtex TODO ```