LightOnOCR-2-1B ExecuTorch (Android-ready)

On-device OCR model converted from lightonai/LightOnOCR-2-1B to ExecuTorch .pte format for mobile/edge deployment.

1B parameter end-to-end OCR β€” converts document images (PDFs, receipts, scans) to clean text, running entirely on-device with no cloud dependency.

Models

File Size Description Recommended
vision_encoder_int8.pte 398 MB Vision encoder, INT8 weight-only, XNNPACK ⭐ Yes
text_decoder_int8.pte 1.2 GB Text decoder, INT8 weight-only, XNNPACK ⭐ Yes
vision_encoder.pte 1.6 GB Vision encoder, FP32, XNNPACK
text_decoder_4096.pte 2.9 GB Text decoder, FP32, XNNPACK

For on-device deployment, use the INT8 variants (1.6 GB total).

Quick Start β€” Python (ExecuTorch Runtime)

from executorch.runtime import Runtime
from huggingface_hub import hf_hub_download
import torch

# Download INT8 models
vis_path = hf_hub_download("acul3/LightOnOCR-2-1B-ExecuTorch", "vision_encoder_int8.pte")
dec_path = hf_hub_download("acul3/LightOnOCR-2-1B-ExecuTorch", "text_decoder_int8.pte")

# Load ExecuTorch runtime
runtime = Runtime.get()
vision = runtime.load_program(vis_path).load_method("forward")
decoder = runtime.load_program(dec_path).load_method("forward")

# Preprocess image to fixed 1120Γ—1540
from PIL import Image
img = Image.open("document.jpg").convert("RGB").resize((1540, 1120))
# Normalize (use HF processor or manual normalization)
pixel_values = torch.tensor(...)  # [1, 3, 1120, 1540] float32

# Run vision encoder
image_features = vision.execute([pixel_values])  # [1, 2200, 1024]

# Build input tokens + autoregressive decode (see scripts/test_e2e_v2.py for full example)

Quick Start β€” Android (Kotlin)

import org.pytorch.executorch.Module
import org.pytorch.executorch.EValue
import org.pytorch.executorch.Tensor

// Load models from assets or downloaded files
val visionModule = Module.load("vision_encoder_int8.pte")
val decoderModule = Module.load("text_decoder_int8.pte")

// Preprocess: resize image to 1120Γ—1540, normalize
val pixelTensor = preprocessImage(bitmap)  // [1, 3, 1120, 1540]

// Run vision encoder
val imageFeatures = visionModule.forward(EValue.from(pixelTensor))

// Run text decoder autoregressively
// See ExecuTorch LLM Demo App for the decode loop pattern

Full E2E Pipeline

For a complete working example with autoregressive decoding, see:

The pipeline works as follows:

  1. Resize input image to exactly 1120Γ—1540 pixels
  2. Run vision encoder β†’ image embeddings [1, 2200, 1024]
  3. Build input tokens using the chat template with 2200 [IMG] placeholders (token id 151655)
  4. Scatter vision embeddings into the [IMG] positions in the token embedding space
  5. Prefill the text decoder with the full combined sequence (2260 tokens)
  6. Decode autoregressively until EOS token (151645) or max length

Token Template

<|im_start|>user\n[IMG][IMG]...(2200 total)...[VPAD]...(39 row separators)...[IMG_END]
OCR this document. Extract all text.<|im_end|>\n<|im_start|>assistant\n

Special Token IDs

Token ID Purpose
[IMG] 151655 Image patch placeholder (2200 per image)
[VPAD] 151654 Vision row separator (39 per image)
[IMG_END] 151653 End of vision tokens
<|im_start|> 151644 Chat turn start
<|im_end|> 151645 Chat turn end / EOS

Architecture

LightOnOCR-2-1B (~1B params)
β”œβ”€β”€ Vision Encoder: Pixtral ViT (400M params)
β”‚   β”œβ”€β”€ 24 transformer layers, hidden=1024, heads=16
β”‚   β”œβ”€β”€ Patch size: 14Γ—14, 2D RoPE
β”‚   └── PatchMerger: 2Γ—2 spatial merge β†’ 4Γ— token reduction
β”œβ”€β”€ MultiModal Projector (6M params)
β”‚   β”œβ”€β”€ RMSNorm β†’ PatchMerger β†’ Linear(1024β†’1024) β†’ GELU β†’ Linear
β”‚   └── Output: [1, 2200, 1024] for 1120Γ—1540 input
└── Text Decoder: Qwen3 (700M params)
    β”œβ”€β”€ 28 transformer layers, GQA (16 heads, 8 KV heads)
    β”œβ”€β”€ head_dim=128, intermediate=3072, hidden=1024
    β”œβ”€β”€ QK-norm (RMSNorm on Q,K per layer)
    β”œβ”€β”€ Static KV cache, max_seq_len=4096
    └── Vocab: 151936 (Qwen2 tokenizer)

Validation Results

E2E validated against the original HuggingFace model on receipt and synthetic document images:

Variant Receipt (360 tok) Synthetic (76 tok)
FP32 modules βœ… Exact match (edit dist = 0) βœ… Exact match (edit dist = 0)
INT8 quantized βœ… Exact match (edit dist = 0) βœ… Exact match (edit dist = 0)

Zero quality degradation from INT8 weight-only quantization.

Export Details

Property Value
ExecuTorch version 1.1.0
Backend XNNPACK (CPU, cross-platform)
Quantization torchao INT8 weight-only (per-channel, no calibration)
Fixed input resolution 1120 Γ— 1540
Max sequence length 4096 tokens
Source model lightonai/LightOnOCR-2-1B

Model Surgery Applied

The original HF model required several modifications for ExecuTorch compatibility:

  • PatchMerger: Rewritten to eliminate Python loops and dynamic shapes (fixed single-image)
  • KV Cache: HF DynamicCache β†’ static tensors as explicit model inputs/outputs
  • QK-Norm: @use_kernel_forward_from_hub decorator bypassed, ops inlined
  • Attention: GQA manually implemented without SDPA dispatch (export-clean)
  • Token substitution: Vision features scattered at [IMG] positions via index operations

Scripts

Script Description
scripts/export_vision.py Vision encoder surgery + export
scripts/export_decoder.py Text decoder surgery + export
scripts/quantize_wo.py INT8 weight-only quantization
scripts/test_e2e_v2.py Full E2E validation (FP32)
scripts/test_e2e_int8.py Full E2E validation (INT8)

Hardware Requirements

For on-device inference (INT8)

  • RAM: 6 GB minimum (8 GB+ recommended)
  • Storage: 1.6 GB for both model files
  • CPU: ARM64 (Android) or x86_64

Tested target devices

  • XNNPACK (current): Any Android/iOS device with CPU
  • QNN (planned): Snapdragon 8 Gen 2+ with Hexagon NPU

For export/quantization (development)

  • Jetson AGX Orin 64GB or similar (64GB+ unified memory recommended)
  • Python 3.10+, PyTorch 2.10+, ExecuTorch 1.1.0

Reproduce

# Clone and setup
git clone https://huggingface.co/acul3/LightOnOCR-2-1B-ExecuTorch
cd LightOnOCR-2-1B-ExecuTorch

# Install deps
pip install executorch transformers torchao pillow

# Download source model
huggingface-cli download lightonai/LightOnOCR-2-1B --local-dir models/LightOnOCR-2-1B

# Export (generates .pte files)
python scripts/export_vision.py
python scripts/export_decoder.py

# Quantize
python scripts/quantize_wo.py

# Validate
python scripts/test_e2e_int8.py

License

Apache 2.0 (same as source model)

Citation

@misc{lightonocr2_executorch_2026,
  title = {LightOnOCR-2-1B-ExecuTorch: On-Device OCR with ExecuTorch},
  author = {Samsul Rahmadani},
  year = {2026},
  url = {https://huggingface.co/acul3/LightOnOCR-2-1B-ExecuTorch},
  note = {Converted from lightonai/LightOnOCR-2-1B}
}

Acknowledgments

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for acul3/LightOnOCR-2-1B-ExecuTorch

Quantized
(13)
this model