Fix PaddleOCR-VL image size inference for ambiguous RGB shapes
What does this PR do?
This PR fixes an edge case in PaddleOCRVLImageProcessor when processing RGB images whose converted numpy shape is ambiguous, such as (3, x, 3).
For example, a PIL image with size (width=11, height=3) is converted to a numpy array with shape (3, 11, 3). This shape can be incorrectly interpreted as either:
- channels-last:
(height=3, width=11, channels=3) - channels-first:
(channels=3, height=11, width=3)
If the channel dimension is inferred incorrectly, the computed image_grid_thw may have swapped height and width.
Fix
The image processor now preserves the correct input channel format before converting PIL images to numpy arrays, avoiding incorrect height/width inference for these ambiguous shapes.
Reproduction
from PIL import Image
from transformers import AutoProcessor
image = Image.new("RGB", (11, 3), "white")
processor = AutoProcessor.from_pretrained(
"PaddlePaddle/PaddleOCR-VL-1.5",
trust_remote_code=True,
min_pixels=112896,
max_pixels=1003520,
)
out = processor(
images=image,
text=processor.image_token + "\nOCR:",
return_tensors="pt",
)
print(out.image_grid_thw)
# wrong
# tensor([[ 1, 46, 14]])
# correct
tensor([[ 1, 14, 48]])
Before this fix, the remote processor may produce a transposed grid for this ambiguous shape.
Acknowledgement
Thanks to jimmyzhuu for identifying this issue. His PR was very helpful in exposing the bug and guiding the root-cause analysis.
The issue turned out to be specific to the remote PaddleOCRVLImageProcessor implementation, so this PR addresses the fix directly in the image processor.
Accuracy validation is still required before merging this PR, and the validation is currently in progress.