--- license: apache-2.0 tags: - vision-language model - llama - video understanding pipeline_tag: video-text-to-text library_name: transformers --- # Flash-VStream Model Card This repository contains the Flash-VStream model presented in the paper [Flash-VStream: Efficient Real-Time Understanding for Long Video Streams](https://huggingface.co/papers/2506.23825).

## Model details We proposed Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. ## Training data This model is trained based on image data from LLaVA-1.5 dataset, and video data from WebVid and ActivityNet datasets following LLaMA-VID, including - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP. - 158K GPT-generated multimodal instruction-following data. - 450K academic-task-oriented VQA data mixture. - 40K ShareGPT data. - 232K video-caption pairs sampled from the WebVid 2.5M dataset. - 98K videos from ActivityNet with QA pairs from Video-ChatGPT. ## Sample Usage You can load and use Flash-VStream with the `transformers` library. ```python import torch from transformers import AutoModel, AutoTokenizer # The model can be loaded using multiple GPUs or offloaded to CPU if needed. # This example assumes GPU is available. model_path = 'IVGSZ/Flash-VStream-7b' # Replace with the actual model ID if different model = AutoModel.from_pretrained( model_path, torch_dtype=torch.bfloat16, # Use bfloat16 for efficient memory usage low_cpu_mem_usage=True, trust_remote_code=True ).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False) # For detailed instructions on image/video preprocessing and chat interactions, # please refer to the official GitHub repository: # https://github.com/IVGSZ/Flash-VStream ``` ## License This project is licensed under the [Apache-2.0 License](https://github.com/IVGSZ/Flash-VStream/blob/main/LICENSE).