Improve model card: Add full paper abstract

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +68 -35
README.md CHANGED
@@ -1,18 +1,18 @@
1
  ---
2
- license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternViT-6B-448px-V2_5
7
- - internlm/internlm2_5-20b-chat
8
- base_model_relation: merge
 
9
  language:
10
- - multilingual
 
 
 
11
  tags:
12
- - internvl
13
- - custom_code
14
- datasets:
15
- - HuggingFaceFV/finevideo
16
  ---
17
 
18
  # InternVL2_5-26B
@@ -25,11 +25,9 @@ datasets:
25
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
26
  </div>
27
 
28
- ## Introduction
29
 
30
- We are excited to introduce **InternVL 2.5**, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality.
31
-
32
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/5HDAGOQOZvS1EtI107Ac-.png)
33
 
34
  ## InternVL 2.5 Family
35
 
@@ -361,40 +359,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
361
  # pure-text conversation (纯文本对话)
362
  question = 'Hello, who are you?'
363
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
364
- print(f'User: {question}\nAssistant: {response}')
 
365
 
366
  question = 'Can you tell me a story?'
367
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
368
- print(f'User: {question}\nAssistant: {response}')
 
369
 
370
  # single-image single-round conversation (单图单轮对话)
371
- question = '<image>\nPlease describe the image shortly.'
 
372
  response = model.chat(tokenizer, pixel_values, question, generation_config)
373
- print(f'User: {question}\nAssistant: {response}')
 
374
 
375
  # single-image multi-round conversation (单图多轮对话)
376
- question = '<image>\nPlease describe the image in detail.'
 
377
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
378
- print(f'User: {question}\nAssistant: {response}')
 
379
 
380
  question = 'Please write a poem according to the image.'
381
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
382
- print(f'User: {question}\nAssistant: {response}')
 
383
 
384
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
385
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
386
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
387
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
388
 
389
- question = '<image>\nDescribe the two images in detail.'
 
390
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
391
  history=None, return_history=True)
392
- print(f'User: {question}\nAssistant: {response}')
 
393
 
394
  question = 'What are the similarities and differences between these two images.'
395
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
396
  history=history, return_history=True)
397
- print(f'User: {question}\nAssistant: {response}')
 
398
 
399
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
400
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -402,17 +410,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
402
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
403
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
404
 
405
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
406
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
407
  num_patches_list=num_patches_list,
408
  history=None, return_history=True)
409
- print(f'User: {question}\nAssistant: {response}')
 
410
 
411
  question = 'What are the similarities and differences between these two images.'
412
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
413
  num_patches_list=num_patches_list,
414
  history=history, return_history=True)
415
- print(f'User: {question}\nAssistant: {response}')
 
416
 
417
  # batch inference, single image per sample (单图批处理)
418
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -420,13 +432,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
420
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
421
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
422
 
423
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
424
  responses = model.batch_chat(tokenizer, pixel_values,
425
  num_patches_list=num_patches_list,
426
  questions=questions,
427
  generation_config=generation_config)
428
  for question, response in zip(questions, responses):
429
- print(f'User: {question}\nAssistant: {response}')
 
430
 
431
  # video multi-round conversation (视频多轮对话)
432
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -464,17 +478,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
464
  video_path = './examples/red-panda.mp4'
465
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
466
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
467
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
468
  question = video_prefix + 'What is the red panda doing?'
469
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
470
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
471
  num_patches_list=num_patches_list, history=None, return_history=True)
472
- print(f'User: {question}\nAssistant: {response}')
 
473
 
474
  question = 'Describe this video in detail.'
475
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
476
  num_patches_list=num_patches_list, history=history, return_history=True)
477
- print(f'User: {question}\nAssistant: {response}')
 
478
  ```
479
 
480
  #### Streaming Output
@@ -556,7 +577,9 @@ image_urls=[
556
 
557
  images = [load_image(img_url) for img_url in image_urls]
558
  # Numbering images improves multi-image conversations
559
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
560
  print(response.text)
561
  ```
562
 
@@ -676,3 +699,13 @@ If you find this project useful in your research, please consider citing:
676
  year={2024}
677
  }
678
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternViT-6B-448px-V2_5
4
+ - internlm/internlm2_5-20b-chat
5
+ datasets:
6
+ - HuggingFaceFV/finevideo
7
  language:
8
+ - multilingual
9
+ library_name: transformers
10
+ license: mit
11
+ pipeline_tag: image-text-to-text
12
  tags:
13
+ - internvl
14
+ - custom_code
15
+ base_model_relation: merge
 
16
  ---
17
 
18
  # InternVL2_5-26B
 
25
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
26
  </div>
27
 
28
+ ## Abstract
29
 
30
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see this https URL
 
 
31
 
32
  ## InternVL 2.5 Family
33
 
 
359
  # pure-text conversation (纯文本对话)
360
  question = 'Hello, who are you?'
361
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
362
+ print(f'User: {question}
363
+ Assistant: {response}')
364
 
365
  question = 'Can you tell me a story?'
366
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
367
+ print(f'User: {question}
368
+ Assistant: {response}')
369
 
370
  # single-image single-round conversation (单图单轮对话)
371
+ question = '<image>
372
+ Please describe the image shortly.'
373
  response = model.chat(tokenizer, pixel_values, question, generation_config)
374
+ print(f'User: {question}
375
+ Assistant: {response}')
376
 
377
  # single-image multi-round conversation (单图多轮对话)
378
+ question = '<image>
379
+ Please describe the image in detail.'
380
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
381
+ print(f'User: {question}
382
+ Assistant: {response}')
383
 
384
  question = 'Please write a poem according to the image.'
385
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
386
+ print(f'User: {question}
387
+ Assistant: {response}')
388
 
389
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
390
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
391
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
392
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
393
 
394
+ question = '<image>
395
+ Describe the two images in detail.'
396
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
397
  history=None, return_history=True)
398
+ print(f'User: {question}
399
+ Assistant: {response}')
400
 
401
  question = 'What are the similarities and differences between these two images.'
402
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
403
  history=history, return_history=True)
404
+ print(f'User: {question}
405
+ Assistant: {response}')
406
 
407
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
408
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
410
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
411
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
412
 
413
+ question = 'Image-1: <image>
414
+ Image-2: <image>
415
+ Describe the two images in detail.'
416
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
417
  num_patches_list=num_patches_list,
418
  history=None, return_history=True)
419
+ print(f'User: {question}
420
+ Assistant: {response}')
421
 
422
  question = 'What are the similarities and differences between these two images.'
423
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
424
  num_patches_list=num_patches_list,
425
  history=history, return_history=True)
426
+ print(f'User: {question}
427
+ Assistant: {response}')
428
 
429
  # batch inference, single image per sample (单图批处理)
430
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
432
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
433
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
434
 
435
+ questions = ['<image>
436
+ Describe the image in detail.'] * len(num_patches_list)
437
  responses = model.batch_chat(tokenizer, pixel_values,
438
  num_patches_list=num_patches_list,
439
  questions=questions,
440
  generation_config=generation_config)
441
  for question, response in zip(questions, responses):
442
+ print(f'User: {question}
443
+ Assistant: {response}')
444
 
445
  # video multi-round conversation (视频多轮对话)
446
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
478
  video_path = './examples/red-panda.mp4'
479
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
480
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
481
+ video_prefix = ''.join([f'Frame{i+1}: <image>
482
+ ' for i in range(len(num_patches_list))])
483
  question = video_prefix + 'What is the red panda doing?'
484
+ # Frame1: <image>
485
+ Frame2: <image>
486
+ ...
487
+ Frame8: <image>
488
+ {question}
489
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
490
  num_patches_list=num_patches_list, history=None, return_history=True)
491
+ print(f'User: {question}
492
+ Assistant: {response}')
493
 
494
  question = 'Describe this video in detail.'
495
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
496
  num_patches_list=num_patches_list, history=history, return_history=True)
497
+ print(f'User: {question}
498
+ Assistant: {response}')
499
  ```
500
 
501
  #### Streaming Output
 
577
 
578
  images = [load_image(img_url) for img_url in image_urls]
579
  # Numbering images improves multi-image conversations
580
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
581
+ Image-2: {IMAGE_TOKEN}
582
+ describe these two images', images))
583
  print(response.text)
584
  ```
585
 
 
699
  year={2024}
700
  }
701
  ```
702
+
703
+ ## Acknowledgement
704
+
705
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
706
+
707
+ ______________________________________________________________________
708
+
709
+ Scan the following QR Code, join our WeChat group.
710
+
711
+ <p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>