--- datasets: - kinetics700 library_name: pytorch license: apache-2.0 pipeline_tag: image-feature-extraction tags: - deltatok - cvpr2026-highlight --- # DeltaTok (Tokenizer) — Kinetics-700 DeltaTok is a video tokenizer that compresses the frame-to-frame change in vision foundation model features into a single continuous "delta" token, as introduced in [A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens](https://huggingface.co/papers/2604.04913) (CVPR 2026 Highlight). This approach significantly reduces the token count in video sequences (e.g., 1,024x reduction) and enables efficient generative world modeling. This repository contains the ViT-B encoder and decoder trained on Kinetics-700 at 512x512 resolution. ## Metrics Reconstruction quality, measured by applying downstream task heads to the reconstructed features. | Method | Horizon | VSPW mIoU (↑) | Cityscapes mIoU (↑) | KITTI RMSE (↓) | |--------|---------|---------------|---------------------|----------------| | *Present (upper bound)* | — | *58.4* | *70.5* | *2.79* | | DeltaTok | Short (1 frame) | 58.6 | 69.6 | 2.78 | | DeltaTok | Mid (3 frames)* | 58.5 | 67.9 | 2.86 | *Parallel encoding from ground-truth frames with autoregressive decoding from previous reconstructions. ## Usage Requires a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-B backbone. Full training and evaluation code is available in the [DeltaTok GitHub repository](https://github.com/amazon-far/deltatok). To evaluate: ```bash python main.py validate -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \ --model.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin ``` ## Acknowledgements - [DINOv3](https://github.com/facebookresearch/dinov3) - [Kinetics-700](https://github.com/cvdfoundation/kinetics-dataset) ## Citation ```bibtex @inproceedings{kerssies2026deltatok, title = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens}, author = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} } ```