| | --- |
| | language: en |
| | license: apache-2.0 |
| | tags: |
| | - vision-language-model |
| | - visual-storytelling |
| | - chain-of-thought |
| | - grounded-text-generation |
| | - cross-frame-consistency |
| | - storytelling |
| | - image-to-text |
| | datasets: |
| | - daniel3303/StoryReasoning |
| | metrics: |
| | - precision |
| | - recall |
| | - bleu |
| | - meteor |
| | - rouge |
| | base_model: |
| | - Qwen/Qwen2.5-VL-7B-Instruct |
| | pipeline_tag: image-to-text |
| | model-index: |
| | - name: QwenStoryteller |
| | results: |
| | - task: |
| | type: visual-storytelling |
| | name: Visual Storytelling |
| | dataset: |
| | name: StoryReasoning |
| | type: daniel3303/StoryReasoning |
| | split: test |
| | metrics: |
| | - name: Character Precision |
| | type: precision |
| | value: 0.83 |
| | - name: Object Precision |
| | type: precision |
| | value: 0.46 |
| | - name: Total Precision |
| | type: precision |
| | value: 0.57 |
| | - name: mAP |
| | type: mean_average_precision |
| | value: 0.27 |
| | - name: Character Recall |
| | type: recall |
| | value: 0.62 |
| | - name: Object Recall |
| | type: recall |
| | value: 0.25 |
| | - name: Total Recall |
| | type: recall |
| | value: 0.40 |
| | - name: METEOR |
| | type: meteor |
| | value: 0.14 |
| | - name: ROUGE-L |
| | type: rouge-l |
| | value: 0.16 |
| | - name: BLEU-4 |
| | type: bleu-4 |
| | value: 0.054 |
| | - name: Description Accuracy |
| | type: accuracy |
| | value: 2.76 |
| | description: "Rating on a scale of 1-5" |
| | - name: Average Hallucinations |
| | type: error_rate |
| | value: 3.56 |
| | description: "Average number of hallucinations per story" |
| | library_name: transformers |
| | --- |
| | |
| | # QwenStoryteller |
| |
|
| | QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story. |
| |
|
| | ## Model Description |
| |
|
| | **Base Model:** Qwen2.5-VL 7B |
| | **Training Method:** LoRA fine-tuning (rank 2048, alpha 4096) |
| | **Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning) |
| |
|
| | QwenStoryteller processes sequences of images to perform: |
| | - End-to-end object detection |
| | - Cross-frame object re-identification |
| | - Landmark detection |
| | - Chain-of-thought reasoning for scene understanding |
| | - Grounded story generation with explicit visual references |
| |
|
| | The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision. |
| |
|
| | ## System Prompt |
| | The model was trained with the following system prompt, and we recommend using it as it is for inference. |
| |
|
| | ``` |
| | You are an AI storyteller that can analyze sequences of images and create creative narratives. |
| | First think step-by-step to analyze characters, objects, settings, and narrative structure. |
| | Then create a grounded story that maintains consistent character identity and object references across frames. |
| | Use <think></think> tags to show your reasoning process before writing the final story. |
| | ``` |
| |
|
| | ## Key Features |
| |
|
| | - **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques |
| | - **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure |
| | - **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities |
| | - **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
| | from qwen_vl_utils import process_vision_info |
| | import torch |
| | from PIL import Image |
| | |
| | # Load the model |
| | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| | "daniel3303/QwenStoryteller", torch_dtype="auto", device_map="auto" |
| | ) |
| | |
| | # Load processor |
| | processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller") |
| | |
| | # Load images |
| | images = [ |
| | Image.open("image1.jpg"), |
| | Image.open("image2.jpg"), |
| | Image.open("image3.jpg"), |
| | Image.open("image4.jpg"), |
| | Image.open("image5.jpg") |
| | ] |
| | |
| | # Create image content list |
| | image_content = [] |
| | for img in images: |
| | image_content.append({ |
| | "type": "image", |
| | "image": img, |
| | }) |
| | |
| | # Add text prompt at the end |
| | image_content.append({"type": "text", "text": "Generate a story based on these images."}) |
| | |
| | # Create messages with system prompt |
| | messages = [ |
| | { |
| | "role": "system", |
| | "content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story." |
| | }, |
| | { |
| | "role": "user", |
| | "content": image_content, |
| | } |
| | ] |
| | |
| | # Preparation for inference |
| | text = processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | ) |
| | image_inputs, video_inputs = process_vision_info(messages) |
| | inputs = processor( |
| | text=[text], |
| | images=image_inputs, |
| | videos=video_inputs, |
| | padding=True, |
| | return_tensors="pt", |
| | ) |
| | inputs = inputs.to(model.device) |
| | |
| | # Inference: Generation of the output |
| | generated_ids = model.generate( |
| | **inputs, |
| | max_new_tokens=4096, |
| | do_sample=True, |
| | temperature=0.7, |
| | top_p=0.9 |
| | ) |
| | generated_ids_trimmed = [ |
| | out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| | ] |
| | story = processor.batch_decode( |
| | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | )[0] |
| | |
| | print(story) |
| | ``` |
| |
|
| | ### Using vLLM for faster inference |
| |
|
| | For significantly faster inference, you can use vLLM to serve the model. Simply install vLLM and run: |
| |
|
| | ```bash |
| | # Install vLLM |
| | pip install vllm |
| | |
| | # Serve the model with vLLM |
| | vllm serve daniel3303/QwenStoryteller |
| | ``` |
| |
|
| | ## Output Format |
| |
|
| | QwenStoryteller produces two main outputs: |
| |
|
| | 1. **Chain-of-Thought Analysis (`<think></think>`):** A structured analysis containing: |
| | - Character tables with consistent identity references, emotions, actions, and spatial locations |
| | - Object tables with functions, interactions, and spatial coordinates |
| | - Setting tables categorizing environmental elements |
| | - Narrative structure tables modeling story progression |
| |
|
| | 2. **Grounded Story:** A narrative with specialized XML tags linking text to visual elements: |
| | - `<gdi>`: Image tags for specific frames |
| | - `<gdo>`: Entity reference tags for character and object mentions |
| | - `<gda>`: Action tags for character actions |
| | - `<gdl>`: Location/landmark tags for background elements |
| |
|
| | ## Limitations |
| |
|
| | - Re-identification relies primarily on object appearance rather than overall context, which can lead to confusion with similar-looking objects/persons |
| | - Movie-derived training data introduces biases from cinematic composition that may not generalize to candid visual sequences |
| | - Low grounding rates for first-person pronouns as they primarily appear in character dialogues |
| | - May still produce hallucinations, albeit at a reduced rate compared to the base model |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | @misc{oliveira2025storyreasoningdatasetusingchainofthought, |
| | title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, |
| | author={Daniel A. P. Oliveira and David Martins de Matos}, |
| | year={2025}, |
| | eprint={2505.10292}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2505.10292}, |
| | } |
| | ``` |
| |
|
| | ## Contact |
| |
|
| | For questions or feedback regarding this model, please contact: |
| | - Daniel A. P. Oliveira (daniel.oliveira@inesc-id.pt) |