Image Feature Extraction
PyTorch
deltatok
cvpr2026-highlight
File size: 2,277 Bytes
b8ed337
 
 
 
 
 
 
 
 
 
 
5f149b5
 
10fc157
a8646ee
74d9e65
5f149b5
ecf29e6
 
 
 
5edda90
 
ecf29e6
 
 
 
 
 
5f149b5
 
74d9e65
 
 
 
 
 
5f149b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ecf29e6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
datasets:
- kinetics700
library_name: pytorch
license: apache-2.0
pipeline_tag: image-feature-extraction
tags:
- deltatok
- cvpr2026-highlight
---

# DeltaTok (Tokenizer) — Kinetics-700

DeltaTok is a video tokenizer that compresses the frame-to-frame change in vision foundation model features into a single continuous "delta" token, as introduced in [A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens](https://huggingface.co/papers/2604.04913) (CVPR 2026 Highlight). This approach significantly reduces the token count in video sequences (e.g., 1,024x reduction) and enables efficient generative world modeling.

This repository contains the ViT-B encoder and decoder trained on Kinetics-700 at 512x512 resolution.

## Metrics

Reconstruction quality, measured by applying downstream task heads to the reconstructed features.

| Method | Horizon | VSPW mIoU (↑) | Cityscapes mIoU (↑) | KITTI RMSE (↓) |
|--------|---------|---------------|---------------------|----------------|
| *Present (upper bound)* | — | *58.4* | *70.5* | *2.79* |
| DeltaTok | Short (1 frame) | 58.6 | 69.6 | 2.78 |
| DeltaTok | Mid (3 frames)* | 58.5 | 67.9 | 2.86 |

*Parallel encoding from ground-truth frames with autoregressive decoding from previous reconstructions.

## Usage

Requires a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-B backbone. Full training and evaluation code is available in the [DeltaTok GitHub repository](https://github.com/amazon-far/deltatok). To evaluate:

```bash
python main.py validate -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin
```

## Acknowledgements

- [DINOv3](https://github.com/facebookresearch/dinov3)
- [Kinetics-700](https://github.com/cvdfoundation/kinetics-dataset)

## Citation

```bibtex
@inproceedings{kerssies2026deltatok,
  title     = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
  author    = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
```