PaddlePaddle/PaddleOCR-VL-1.5 · Fix PaddleOCR-VL image size inference for ambiguous RGB shapes

Fix PaddleOCR-VL image size inference for ambiguous RGB shapes

#28

by xiaohei66 - opened 5 days ago

base: refs/heads/main

←

from: refs/pr/28

Discussion Files changed

-3

xiaohei66

PaddlePaddle org 5 days ago

•

edited 5 days ago

What does this PR do?

This PR fixes an edge case in PaddleOCRVLImageProcessor when processing RGB images whose converted numpy shape is ambiguous, such as (3, x, 3).

For example, a PIL image with size (width=11, height=3) is converted to a numpy array with shape (3, 11, 3). This shape can be incorrectly interpreted as either:

channels-last: (height=3, width=11, channels=3)
channels-first: (channels=3, height=11, width=3)

If the channel dimension is inferred incorrectly, the computed image_grid_thw may have swapped height and width.

Fix

The image processor now preserves the correct input channel format before converting PIL images to numpy arrays, avoiding incorrect height/width inference for these ambiguous shapes.

Reproduction

from PIL import Image
from transformers import AutoProcessor

image = Image.new("RGB", (11, 3), "white")

processor = AutoProcessor.from_pretrained(
    "PaddlePaddle/PaddleOCR-VL-1.5",
    trust_remote_code=True,
    min_pixels=112896,
    max_pixels=1003520,
)

out = processor(
    images=image,
    text=processor.image_token + "\nOCR:",
    return_tensors="pt",
)

print(out.image_grid_thw)

# wrong
# tensor([[ 1, 46, 14]])

# correct
tensor([[ 1, 14, 48]])

Before this fix, the remote processor may produce a transposed grid for this ambiguous shape.

Acknowledgement

Thanks to jimmyzhuu for identifying this issue. His PR was very helpful in exposing the bug and guiding the root-cause analysis.

The issue turned out to be specific to the remote PaddleOCRVLImageProcessor implementation, so this PR addresses the fix directly in the image processor.

Update image_processing_paddleocr_vl.pye8ede36d

xiaohei66

PaddlePaddle org 5 days ago

Accuracy validation is still required before merging this PR, and the validation is currently in progress.

xiaohei66 changed pull request title from Update image_processing_paddleocr_vl.py to Fix PaddleOCR-VL image size inference for ambiguous RGB shapes 5 days ago

update143b6892

update264a95e2

xiaohei66 changed pull request status to merged 5 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment