Amazon-FAR
/

deltatok-kinetics

Image Feature Extraction

cvpr2026-highlight

Model card Files Files and versions

tommiekerssies commited on Apr 7

Commit

74d9e65

·

1 Parent(s): a8646ee

Standardize model card

Files changed (1) hide show

README.md +9 -8

README.md CHANGED Viewed

@@ -10,19 +10,20 @@ tags:
 # DeltaTok (Tokenizer) — Kinetics-700
-This repository contains the DeltaTok weights as presented in the paper [A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens](https://huggingface.co/papers/2604.04913) (CVPR 2026).
-[**Project Page**](https://deltatok.github.io) | [**GitHub**](https://github.com/amazon-far/deltatok)
-DeltaTok is a video tokenizer that encodes the vision foundation model (VFM) feature differences between consecutive frames into a single continuous "delta" token. This approach significantly reduces the token count in video sequences (e.g., 1,024x reduction) while enabling efficient generative world modeling.
-## Model Description
-This repository contains the ViT-B encoder and decoder trained on Kinetics-700 at 512x512 resolution. The model is designed to work with a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-B backbone (not included).
 ## Usage
-Please refer to the [DeltaTok GitHub repository](https://github.com/amazon-far/deltatok) for setup, training, and evaluation instructions.
 ## Acknowledgements

 # DeltaTok (Tokenizer) — Kinetics-700
+DeltaTok is a video tokenizer that encodes the vision foundation model (VFM) feature differences between consecutive frames into a single continuous "delta" token, as introduced in [A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens](https://huggingface.co/papers/2604.04913) (CVPR 2026). This approach significantly reduces the token count in video sequences (e.g., 1,024x reduction) and enables efficient generative world modeling.
+[**Project Page**](https://deltatok.github.io) | [**GitHub**](https://github.com/amazon-far/deltatok) | [**Paper**](https://huggingface.co/papers/2604.04913)
+This repository contains the ViT-B encoder and decoder trained on Kinetics-700 at 512x512 resolution.
 ## Usage
+Requires a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-B backbone. Full training and evaluation code is available in the [DeltaTok GitHub repository](https://github.com/amazon-far/deltatok). To evaluate:
+```bash
+python main.py validate -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
+  --model.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin
+```
 ## Acknowledgements