---
license: other
license_name: bsd-3-clause
license_link: https://github.com/TencentARC/TimeLens/blob/main/LICENSE
language:
- en
tags:
- video-grounding
- temporal-grounding
- video-understanding
- qwen2-vl
library_name: transformers
pipeline_tag: video-text-to-text
datasets:
- TencentARC/TimeLens-100K
- TencentARC/TimeLens-Bench
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---
# TimeLens-7B
📑 [**Paper**](https://arxiv.org/abs/2512.14698) | 💻 [**Code**](https://github.com/TencentARC/TimeLens) | 🏠 [**Project Page**](https://timelens-arc-lab.github.io/) | 🤗 [**Model & Data**](https://huggingface.co/collections/TencentARC/timelens)
## ✨ Model Description
**TimeLens-7B** is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K).
## 📊 Performance
TimeLens-7B achieves strong video temporal grounding performance:
Model (with 🤗HuggingFace Link) |
Charades-TimeLens |
ActivityNet-TimeLens |
QVHighlights-TimeLens |
R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU |
R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU |
R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU |
| Qwen2.5-VL-7B-Instruct |
59.7 |
37.8 |
16.6 |
39.3 |
44.1 |
31.0 |
16.1 |
31.4 |
41.5 |
27.8 |
15.2 |
31.6 |
| TimeLens-7B🚀 |
70.5 |
55.6 |
28.4 |
48.8 |
62.8 |
51.0 |
32.6 |
46.2 |
74.1 |
62.7 |
43.1 |
56.0 |
| Qwen3-VL-8B-Instruct |
69.2 |
53.4 |
27.5 |
48.3 |
62.1 |
51.2 |
34.4 |
46.8 |
74.2 |
64.6 |
49.3 |
59.4 |
| TimeLens-8B🚀 |
76.6 |
63.0 |
35.2 |
55.2 |
68.9 |
58.4 |
40.6 |
53.2 |
80.2 |
71.6 |
55.5 |
65.5 |
> For detailed comparison with other models, please refer to the [Leaderboard](https://timelens-arc-lab.github.io/#leaderboard).
## 🚀 Usage
Install the following packages:
```bash
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
```
Using 🤗Transformers for Inference:
```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
"TencentARC/TimeLens-7B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(
"TencentARC/TimeLens-7B",
padding_side="left",
do_resize=False,
trust_remote_code=True,
)
# Prepare input
query = "A man is sitting on a chair"
video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"
GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in - seconds'."
messages = [{
'role': 'user',
'content': [
{
'type': 'video',
'video': video_path,
'min_pixels': 64 * 28 * 28,
'total_pixels': 14336 * 28 * 28,
'fps': 2,
},
{
'type': 'text',
'text': GROUNDER_PROMPT.format(query)
}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = process_vision_info(messages, return_video_metadata=True)
inputs = processor(
text=[text],
images=images,
videos=videos,
padding=True,
return_tensors='pt'
).to("cuda")
output_ids = model.generate(
**inputs,
do_sample=False,
max_new_tokens=512,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")
```
## Citation
If you find our work helpful for your research and applications, please cite our paper:
```bibtex
TODO
```