TimeLens-7B / README.md

Update README.md

d5eb482 verified about 16 hours ago

6.95 kB

	---
	license: other
	license_name: bsd-3-clause
	license_link: https://github.com/TencentARC/TimeLens/blob/main/LICENSE
	language:
	- en
	tags:
	- video-grounding
	- temporal-grounding
	- video-understanding
	- qwen2-vl
	library_name: transformers
	pipeline_tag: video-text-to-text
	datasets:
	- TencentARC/TimeLens-100K
	- TencentARC/TimeLens-Bench
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	---

	# TimeLens-7B

	📑 [Paper](https://arxiv.org/abs/2512.14698) \| 💻 [Code](https://github.com/TencentARC/TimeLens) \| 🏠 [Project Page](https://timelens-arc-lab.github.io/) \| 🤗 [Model & Data](https://huggingface.co/collections/TencentARC/timelens)


	## ✨ Model Description

	TimeLens-7B is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K).

	## 📊 Performance

	TimeLens-7B achieves strong video temporal grounding performance:

	<table>
	<thead>
	<tr>
	<th rowspan="2" align="center">Model</th>
	<th colspan="4" align="center">Charades-TimeLens</th>
	<th colspan="4" align="center">ActivityNet-TimeLens</th>
	<th colspan="4" align="center">QVHighlights-TimeLens</th>
	</tr>
	<tr>
	<th align="center">R1<br>@0.3</th>
	<th align="center">R1<br>@0.5</th>
	<th align="center">R1<br>@0.7</th>
	<th align="center">mIoU</th>
	<th align="center">R1<br>@0.3</th>
	<th align="center">R1<br>@0.5</th>
	<th align="center">R1<br>@0.7</th>
	<th align="center">mIoU</th>
	<th align="center">R1<br>@0.3</th>
	<th align="center">R1<br>@0.5</th>
	<th align="center">R1<br>@0.7</th>
	<th align="center">mIoU</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><a href="https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct">Qwen2.5-VL-7B-Instruct</a></td>
	<td align="center">59.7</td>
	<td align="center">37.8</td>
	<td align="center">16.6</td>
	<td align="center">39.3</td>
	<td align="center">44.1</td>
	<td align="center">31.0</td>
	<td align="center">16.1</td>
	<td align="center">31.4</td>
	<td align="center">41.5</td>
	<td align="center">27.8</td>
	<td align="center">15.2</td>
	<td align="center">31.6</td>
	</tr>
	<tr>
	<td><a href="https://huggingface.co/TencentARC/TimeLens-7B"><b>TimeLens-7B</b>🚀</a></td>
	<td align="center"><b>70.5</b></td>
	<td align="center"><b>55.6</b></td>
	<td align="center"><b>28.4</b></td>
	<td align="center"><b>48.8</b></td>
	<td align="center"><b>62.8</b></td>
	<td align="center"><b>51.0</b></td>
	<td align="center"><b>32.6</b></td>
	<td align="center"><b>46.2</b></td>
	<td align="center"><b>74.1</b></td>
	<td align="center"><b>62.7</b></td>
	<td align="center"><b>43.1</b></td>
	<td align="center"><b>56.0</b></td>
	</tr>
	<tr>
	<td><a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct">Qwen3-VL-8B-Instruct</a></td>
	<td align="center">69.2</td>
	<td align="center">53.4</td>
	<td align="center">27.5</td>
	<td align="center">48.3</td>
	<td align="center">62.1</td>
	<td align="center">51.2</td>
	<td align="center">34.4</td>
	<td align="center">46.8</td>
	<td align="center">74.2</td>
	<td align="center">64.6</td>
	<td align="center">49.3</td>
	<td align="center">59.4</td>
	</tr>
	<tr>
	<td><a href="https://huggingface.co/TencentARC/TimeLens-8B"><b>TimeLens-8B</b>🚀</a></td>
	<td align="center"><b>76.6</b></td>
	<td align="center"><b>63.0</b></td>
	<td align="center"><b>35.2</b></td>
	<td align="center"><b>55.2</b></td>
	<td align="center"><b>68.9</b></td>
	<td align="center"><b>58.4</b></td>
	<td align="center"><b>40.6</b></td>
	<td align="center"><b>53.2</b></td>
	<td align="center"><b>80.2</b></td>
	<td align="center"><b>71.6</b></td>
	<td align="center"><b>55.5</b></td>
	<td align="center"><b>65.5</b></td>
	</tr>
	</tbody>
	</table>

	> For detailed comparison with other models, please refer to the 🏆 [Leaderboard](https://timelens-arc-lab.github.io/#leaderboard).


	## 🚀 Usage

	Install the following packages:
	```bash
	pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
	pip install qwen-vl-utils[decord]==0.0.14
	# use Flash-Attention 2 to speed up generation
	pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
	```

	Using 🤗Transformers for Inference:
	```python
	import torch
	from transformers import AutoModelForImageTextToText, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# Load model and processor
	model = AutoModelForImageTextToText.from_pretrained(
	"TencentARC/TimeLens-7B",
	dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto",
	)

	processor = AutoProcessor.from_pretrained(
	"TencentARC/TimeLens-7B",
	padding_side="left",
	do_resize=False,
	trust_remote_code=True,
	)

	# Prepare input
	query = "A man is sitting on a chair"
	video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"

	GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."

	messages = [{
	'role': 'user',
	'content': [
	{
	'type': 'video',
	'video': video_path,
	'min_pixels': 64 * 28 * 28,
	'total_pixels': 14336 * 28 * 28,
	'fps': 2,
	},
	{
	'type': 'text',
	'text': GROUNDER_PROMPT.format(query)
	}
	]
	}]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	images, videos = process_vision_info(messages, return_video_metadata=True)

	inputs = processor(
	text=[text],
	images=images,
	videos=videos,
	padding=True,
	return_tensors='pt'
	).to("cuda")

	output_ids = model.generate(
	**inputs,
	do_sample=False,
	max_new_tokens=512,
	)

	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
	]
	answer = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]
	print(f"Answer: {answer}")
	```

	## Citation

	If you find our work helpful for your research and applications, please cite our paper:

	```bibtex
	TODO
	```