TimeLens-7B

📑 Paper | 💻 Code | 🏠 Project Page | 🤗 Model & Data

✨ Model Description

TimeLens-7B is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from Qwen2.5-VL-7B-Instruct. It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our paper, utilizing our high-quality VTG training dataset TimeLens-100K.

📊 Performance

TimeLens-7B achieves strong video temporal grounding performance:

Model (with 🤗HuggingFace Link)	Charades-TimeLens				ActivityNet-TimeLens				QVHighlights-TimeLens
Model (with 🤗HuggingFace Link)	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU
Qwen2.5-VL-7B-Instruct	59.7	37.8	16.6	39.3	44.1	31.0	16.1	31.4	41.5	27.8	15.2	31.6
TimeLens-7B🚀	70.5	55.6	28.4	48.8	62.8	51.0	32.6	46.2	74.1	62.7	43.1	56.0
Qwen3-VL-8B-Instruct	69.2	53.4	27.5	48.3	62.1	51.2	34.4	46.8	74.2	64.6	49.3	59.4
TimeLens-8B🚀	76.6	63.0	35.2	55.2	68.9	58.4	40.6	53.2	80.2	71.6	55.5	65.5

For detailed comparison with other models, please refer to the Leaderboard.

🚀 Usage

Install the following packages:

pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

Using 🤗Transformers for Inference:

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
    "TencentARC/TimeLens-7B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    "TencentARC/TimeLens-7B",
    padding_side="left",
    do_resize=False,
    trust_remote_code=True,
)

# Prepare input
query = "A man is sitting on a chair"
video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"

GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."

messages = [{
    'role': 'user',
    'content': [
        {
            'type': 'video',
            'video': video_path,
            'min_pixels': 64 * 28 * 28,
            'total_pixels': 14336 * 28 * 28,
            'fps': 2,
        },
        {
            'type': 'text',
            'text': GROUNDER_PROMPT.format(query)
        }
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = process_vision_info(messages, return_video_metadata=True)

inputs = processor(
    text=[text],
    images=images,
    videos=videos,
    padding=True,
    return_tensors='pt'
).to("cuda")

output_ids = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=512,
)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")