Fix PaddleOCR-VL image size inference for ambiguous RGB shapes

#28
PaddlePaddle org
edited 5 days ago

What does this PR do?

This PR fixes an edge case in PaddleOCRVLImageProcessor when processing RGB images whose converted numpy shape is ambiguous, such as (3, x, 3).

For example, a PIL image with size (width=11, height=3) is converted to a numpy array with shape (3, 11, 3). This shape can be incorrectly interpreted as either:

  • channels-last: (height=3, width=11, channels=3)
  • channels-first: (channels=3, height=11, width=3)

If the channel dimension is inferred incorrectly, the computed image_grid_thw may have swapped height and width.

Fix

The image processor now preserves the correct input channel format before converting PIL images to numpy arrays, avoiding incorrect height/width inference for these ambiguous shapes.

Reproduction

from PIL import Image
from transformers import AutoProcessor

image = Image.new("RGB", (11, 3), "white")

processor = AutoProcessor.from_pretrained(
    "PaddlePaddle/PaddleOCR-VL-1.5",
    trust_remote_code=True,
    min_pixels=112896,
    max_pixels=1003520,
)

out = processor(
    images=image,
    text=processor.image_token + "\nOCR:",
    return_tensors="pt",
)

print(out.image_grid_thw)
# wrong
# tensor([[ 1, 46, 14]])

# correct
tensor([[ 1, 14, 48]])

Before this fix, the remote processor may produce a transposed grid for this ambiguous shape.

Acknowledgement

Thanks to jimmyzhuu for identifying this issue. His PR was very helpful in exposing the bug and guiding the root-cause analysis.

The issue turned out to be specific to the remote PaddleOCRVLImageProcessor implementation, so this PR addresses the fix directly in the image processor.

PaddlePaddle org

Accuracy validation is still required before merging this PR, and the validation is currently in progress.

xiaohei66 changed pull request title from Update image_processing_paddleocr_vl.py to Fix PaddleOCR-VL image size inference for ambiguous RGB shapes
xiaohei66 changed pull request status to merged

Sign up or log in to comment