--- license: other license_name: bsd-3-clause license_link: https://github.com/TencentARC/TimeLens/blob/main/LICENSE language: - en tags: - video-grounding - temporal-grounding - video-understanding - qwen2-vl library_name: transformers pipeline_tag: video-text-to-text datasets: - TencentARC/TimeLens-100K - TencentARC/TimeLens-Bench base_model: - Qwen/Qwen2.5-VL-7B-Instruct --- # TimeLens-7B 📑 [**Paper**](https://arxiv.org/abs/2512.14698) | 💻 [**Code**](https://github.com/TencentARC/TimeLens) | 🏠 [**Project Page**](https://timelens-arc-lab.github.io/) | 🤗 [**Model & Data**](https://huggingface.co/collections/TencentARC/timelens) ## ✨ Model Description **TimeLens-7B** is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K). ## 📊 Performance TimeLens-7B achieves strong video temporal grounding performance:

Model (with 🤗HuggingFace Link)	Charades-TimeLens				ActivityNet-TimeLens				QVHighlights-TimeLens
Model (with 🤗HuggingFace Link)	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU
Qwen2.5-VL-7B-Instruct	59.7	37.8	16.6	39.3	44.1	31.0	16.1	31.4	41.5	27.8	15.2	31.6
TimeLens-7B🚀	70.5	55.6	28.4	48.8	62.8	51.0	32.6	46.2	74.1	62.7	43.1	56.0
Qwen3-VL-8B-Instruct	69.2	53.4	27.5	48.3	62.1	51.2	34.4	46.8	74.2	64.6	49.3	59.4
TimeLens-8B🚀	76.6	63.0	35.2	55.2	68.9	58.4	40.6	53.2	80.2	71.6	55.5	65.5

> For detailed comparison with other models, please refer to the [Leaderboard](https://timelens-arc-lab.github.io/#leaderboard). ## 🚀 Usage Install the following packages: ```bash pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0 pip install qwen-vl-utils[decord]==0.0.14 # use Flash-Attention 2 to speed up generation pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir ``` Using 🤗Transformers for Inference: ```python import torch from transformers import AutoModelForImageTextToText, AutoProcessor from qwen_vl_utils import process_vision_info # Load model and processor model = AutoModelForImageTextToText.from_pretrained( "TencentARC/TimeLens-7B", dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) processor = AutoProcessor.from_pretrained( "TencentARC/TimeLens-7B", padding_side="left", do_resize=False, trust_remote_code=True, ) # Prepare input query = "A man is sitting on a chair" video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4" GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in - seconds'." messages = [{ 'role': 'user', 'content': [ { 'type': 'video', 'video': video_path, 'min_pixels': 64 * 28 * 28, 'total_pixels': 14336 * 28 * 28, 'fps': 2, }, { 'type': 'text', 'text': GROUNDER_PROMPT.format(query) } ] }] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) images, videos = process_vision_info(messages, return_video_metadata=True) inputs = processor( text=[text], images=images, videos=videos, padding=True, return_tensors='pt' ).to("cuda") output_ids = model.generate( **inputs, do_sample=False, max_new_tokens=512, ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids) ] answer = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f"Answer: {answer}") ``` ## Citation If you find our work helpful for your research and applications, please cite our paper: ```bibtex TODO ```