|
|
--- |
|
|
license: other |
|
|
license_name: bsd-3-clause |
|
|
license_link: https://github.com/TencentARC/TimeLens/blob/main/LICENSE |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- video-grounding |
|
|
- temporal-grounding |
|
|
- video-understanding |
|
|
- qwen2-vl |
|
|
library_name: transformers |
|
|
pipeline_tag: video-text-to-text |
|
|
datasets: |
|
|
- TencentARC/TimeLens-100K |
|
|
- TencentARC/TimeLens-Bench |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
|
--- |
|
|
|
|
|
# TimeLens-7B |
|
|
|
|
|
📑 [**Paper**](https://arxiv.org/abs/2512.14698) | 💻 [**Code**](https://github.com/TencentARC/TimeLens) | 🏠 [**Project Page**](https://timelens-arc-lab.github.io/) | 🤗 [**Model & Data**](https://huggingface.co/collections/TencentARC/timelens) |
|
|
|
|
|
|
|
|
## ✨ Model Description |
|
|
|
|
|
**TimeLens-7B** is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K). |
|
|
|
|
|
## 📊 Performance |
|
|
|
|
|
TimeLens-7B achieves strong video temporal grounding performance: |
|
|
|
|
|
<table> |
|
|
<thead> |
|
|
<tr> |
|
|
<th rowspan="2" align="center">Model</th> |
|
|
<th colspan="4" align="center">Charades-TimeLens</th> |
|
|
<th colspan="4" align="center">ActivityNet-TimeLens</th> |
|
|
<th colspan="4" align="center">QVHighlights-TimeLens</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th align="center">R1<br>@0.3</th> |
|
|
<th align="center">R1<br>@0.5</th> |
|
|
<th align="center">R1<br>@0.7</th> |
|
|
<th align="center">mIoU</th> |
|
|
<th align="center">R1<br>@0.3</th> |
|
|
<th align="center">R1<br>@0.5</th> |
|
|
<th align="center">R1<br>@0.7</th> |
|
|
<th align="center">mIoU</th> |
|
|
<th align="center">R1<br>@0.3</th> |
|
|
<th align="center">R1<br>@0.5</th> |
|
|
<th align="center">R1<br>@0.7</th> |
|
|
<th align="center">mIoU</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td><a href="https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct">Qwen2.5-VL-7B-Instruct</a></td> |
|
|
<td align="center">59.7</td> |
|
|
<td align="center">37.8</td> |
|
|
<td align="center">16.6</td> |
|
|
<td align="center">39.3</td> |
|
|
<td align="center">44.1</td> |
|
|
<td align="center">31.0</td> |
|
|
<td align="center">16.1</td> |
|
|
<td align="center">31.4</td> |
|
|
<td align="center">41.5</td> |
|
|
<td align="center">27.8</td> |
|
|
<td align="center">15.2</td> |
|
|
<td align="center">31.6</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><a href="https://huggingface.co/TencentARC/TimeLens-7B"><b>TimeLens-7B</b>🚀</a></td> |
|
|
<td align="center"><b>70.5</b></td> |
|
|
<td align="center"><b>55.6</b></td> |
|
|
<td align="center"><b>28.4</b></td> |
|
|
<td align="center"><b>48.8</b></td> |
|
|
<td align="center"><b>62.8</b></td> |
|
|
<td align="center"><b>51.0</b></td> |
|
|
<td align="center"><b>32.6</b></td> |
|
|
<td align="center"><b>46.2</b></td> |
|
|
<td align="center"><b>74.1</b></td> |
|
|
<td align="center"><b>62.7</b></td> |
|
|
<td align="center"><b>43.1</b></td> |
|
|
<td align="center"><b>56.0</b></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct">Qwen3-VL-8B-Instruct</a></td> |
|
|
<td align="center">69.2</td> |
|
|
<td align="center">53.4</td> |
|
|
<td align="center">27.5</td> |
|
|
<td align="center">48.3</td> |
|
|
<td align="center">62.1</td> |
|
|
<td align="center">51.2</td> |
|
|
<td align="center">34.4</td> |
|
|
<td align="center">46.8</td> |
|
|
<td align="center">74.2</td> |
|
|
<td align="center">64.6</td> |
|
|
<td align="center">49.3</td> |
|
|
<td align="center">59.4</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><a href="https://huggingface.co/TencentARC/TimeLens-8B"><b>TimeLens-8B</b>🚀</a></td> |
|
|
<td align="center"><b>76.6</b></td> |
|
|
<td align="center"><b>63.0</b></td> |
|
|
<td align="center"><b>35.2</b></td> |
|
|
<td align="center"><b>55.2</b></td> |
|
|
<td align="center"><b>68.9</b></td> |
|
|
<td align="center"><b>58.4</b></td> |
|
|
<td align="center"><b>40.6</b></td> |
|
|
<td align="center"><b>53.2</b></td> |
|
|
<td align="center"><b>80.2</b></td> |
|
|
<td align="center"><b>71.6</b></td> |
|
|
<td align="center"><b>55.5</b></td> |
|
|
<td align="center"><b>65.5</b></td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
> For detailed comparison with other models, please refer to the 🏆 [Leaderboard](https://timelens-arc-lab.github.io/#leaderboard). |
|
|
|
|
|
|
|
|
## 🚀 Usage |
|
|
|
|
|
Install the following packages: |
|
|
```bash |
|
|
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0 |
|
|
pip install qwen-vl-utils[decord]==0.0.14 |
|
|
# use Flash-Attention 2 to speed up generation |
|
|
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir |
|
|
``` |
|
|
|
|
|
Using 🤗Transformers for Inference: |
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForImageTextToText, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
# Load model and processor |
|
|
model = AutoModelForImageTextToText.from_pretrained( |
|
|
"TencentARC/TimeLens-7B", |
|
|
dtype=torch.bfloat16, |
|
|
attn_implementation="flash_attention_2", |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained( |
|
|
"TencentARC/TimeLens-7B", |
|
|
padding_side="left", |
|
|
do_resize=False, |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
# Prepare input |
|
|
query = "A man is sitting on a chair" |
|
|
video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4" |
|
|
|
|
|
GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'." |
|
|
|
|
|
messages = [{ |
|
|
'role': 'user', |
|
|
'content': [ |
|
|
{ |
|
|
'type': 'video', |
|
|
'video': video_path, |
|
|
'min_pixels': 64 * 28 * 28, |
|
|
'total_pixels': 14336 * 28 * 28, |
|
|
'fps': 2, |
|
|
}, |
|
|
{ |
|
|
'type': 'text', |
|
|
'text': GROUNDER_PROMPT.format(query) |
|
|
} |
|
|
] |
|
|
}] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
images, videos = process_vision_info(messages, return_video_metadata=True) |
|
|
|
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=images, |
|
|
videos=videos, |
|
|
padding=True, |
|
|
return_tensors='pt' |
|
|
).to("cuda") |
|
|
|
|
|
output_ids = model.generate( |
|
|
**inputs, |
|
|
do_sample=False, |
|
|
max_new_tokens=512, |
|
|
) |
|
|
|
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids) |
|
|
] |
|
|
answer = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
)[0] |
|
|
print(f"Answer: {answer}") |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful for your research and applications, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
TODO |
|
|
``` |