TimeLens
Collection
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
β’
4 items
β’
Updated
β’
4
π Paper | π» Code | π Project Page | π€ Model & Data
TimeLens-7B is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from Qwen2.5-VL-7B-Instruct. It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our paper, utilizing our high-quality VTG training dataset TimeLens-100K.
TimeLens-7B achieves strong video temporal grounding performance:
| Model (with π€HuggingFace Link) |
Charades-TimeLens | ActivityNet-TimeLens | QVHighlights-TimeLens | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | |
| Qwen2.5-VL-7B-Instruct | 59.7 | 37.8 | 16.6 | 39.3 | 44.1 | 31.0 | 16.1 | 31.4 | 41.5 | 27.8 | 15.2 | 31.6 |
| TimeLens-7Bπ | 70.5 | 55.6 | 28.4 | 48.8 | 62.8 | 51.0 | 32.6 | 46.2 | 74.1 | 62.7 | 43.1 | 56.0 |
| Qwen3-VL-8B-Instruct | 69.2 | 53.4 | 27.5 | 48.3 | 62.1 | 51.2 | 34.4 | 46.8 | 74.2 | 64.6 | 49.3 | 59.4 |
| TimeLens-8Bπ | 76.6 | 63.0 | 35.2 | 55.2 | 68.9 | 58.4 | 40.6 | 53.2 | 80.2 | 71.6 | 55.5 | 65.5 |
For detailed comparison with other models, please refer to the Leaderboard.
Install the following packages:
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
Using π€Transformers for Inference:
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
"TencentARC/TimeLens-7B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(
"TencentARC/TimeLens-7B",
padding_side="left",
do_resize=False,
trust_remote_code=True,
)
# Prepare input
query = "A man is sitting on a chair"
video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"
GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."
messages = [{
'role': 'user',
'content': [
{
'type': 'video',
'video': video_path,
'min_pixels': 64 * 28 * 28,
'total_pixels': 14336 * 28 * 28,
'fps': 2,
},
{
'type': 'text',
'text': GROUNDER_PROMPT.format(query)
}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = process_vision_info(messages, return_video_metadata=True)
inputs = processor(
text=[text],
images=images,
videos=videos,
padding=True,
return_tensors='pt'
).to("cuda")
output_ids = model.generate(
**inputs,
do_sample=False,
max_new_tokens=512,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")
If you find our work helpful for your research and applications, please cite our paper:
TODO