LightOnOCR-2-1B ExecuTorch (Android-ready)

On-device OCR model converted from lightonai/LightOnOCR-2-1B to ExecuTorch .pte format for mobile/edge deployment.

1B parameter end-to-end OCR — converts document images (PDFs, receipts, scans) to clean text, running entirely on-device with no cloud dependency.

Models

File	Size	Description	Recommended
`vision_encoder_int8.pte`	398 MB	Vision encoder, INT8 weight-only, XNNPACK	⭐ Yes
`text_decoder_int8.pte`	1.2 GB	Text decoder, INT8 weight-only, XNNPACK	⭐ Yes
`vision_encoder.pte`	1.6 GB	Vision encoder, FP32, XNNPACK
`text_decoder_4096.pte`	2.9 GB	Text decoder, FP32, XNNPACK

For on-device deployment, use the INT8 variants (1.6 GB total).

Quick Start — Python (ExecuTorch Runtime)

from executorch.runtime import Runtime
from huggingface_hub import hf_hub_download
import torch

# Download INT8 models
vis_path = hf_hub_download("acul3/LightOnOCR-2-1B-ExecuTorch", "vision_encoder_int8.pte")
dec_path = hf_hub_download("acul3/LightOnOCR-2-1B-ExecuTorch", "text_decoder_int8.pte")

# Load ExecuTorch runtime
runtime = Runtime.get()
vision = runtime.load_program(vis_path).load_method("forward")
decoder = runtime.load_program(dec_path).load_method("forward")

# Preprocess image to fixed 1120×1540
from PIL import Image
img = Image.open("document.jpg").convert("RGB").resize((1540, 1120))
# Normalize (use HF processor or manual normalization)
pixel_values = torch.tensor(...)  # [1, 3, 1120, 1540] float32

# Run vision encoder
image_features = vision.execute([pixel_values])  # [1, 2200, 1024]

# Build input tokens + autoregressive decode (see scripts/test_e2e_v2.py for full example)

Quick Start — Android (Kotlin)

import org.pytorch.executorch.Module
import org.pytorch.executorch.EValue
import org.pytorch.executorch.Tensor

// Load models from assets or downloaded files
val visionModule = Module.load("vision_encoder_int8.pte")
val decoderModule = Module.load("text_decoder_int8.pte")

// Preprocess: resize image to 1120×1540, normalize
val pixelTensor = preprocessImage(bitmap)  // [1, 3, 1120, 1540]

// Run vision encoder
val imageFeatures = visionModule.forward(EValue.from(pixelTensor))

// Run text decoder autoregressively
// See ExecuTorch LLM Demo App for the decode loop pattern

Full E2E Pipeline

For a complete working example with autoregressive decoding, see:

scripts/test_e2e_v2.py — FP32 E2E validation
scripts/test_e2e_int8.py — INT8 E2E validation

The pipeline works as follows:

Resize input image to exactly 1120×1540 pixels
Run vision encoder → image embeddings [1, 2200, 1024]
Build input tokens using the chat template with 2200 [IMG] placeholders (token id 151655)
Scatter vision embeddings into the [IMG] positions in the token embedding space
Prefill the text decoder with the full combined sequence (2260 tokens)
Decode autoregressively until EOS token (151645) or max length

Token Template

<|im_start|>user\n[IMG][IMG]...(2200 total)...[VPAD]...(39 row separators)...[IMG_END]
OCR this document. Extract all text.<|im_end|>\n<|im_start|>assistant\n

Special Token IDs

Token	ID	Purpose
`[IMG]`	151655	Image patch placeholder (2200 per image)
`[VPAD]`	151654	Vision row separator (39 per image)
`[IMG_END]`	151653	End of vision tokens
`<\|im_start\|>`	151644	Chat turn start
`<\|im_end\|>`	151645	Chat turn end / EOS

Architecture

LightOnOCR-2-1B (~1B params)
├── Vision Encoder: Pixtral ViT (400M params)
│   ├── 24 transformer layers, hidden=1024, heads=16
│   ├── Patch size: 14×14, 2D RoPE
│   └── PatchMerger: 2×2 spatial merge → 4× token reduction
├── MultiModal Projector (6M params)
│   ├── RMSNorm → PatchMerger → Linear(1024→1024) → GELU → Linear
│   └── Output: [1, 2200, 1024] for 1120×1540 input
└── Text Decoder: Qwen3 (700M params)
    ├── 28 transformer layers, GQA (16 heads, 8 KV heads)
    ├── head_dim=128, intermediate=3072, hidden=1024
    ├── QK-norm (RMSNorm on Q,K per layer)
    ├── Static KV cache, max_seq_len=4096
    └── Vocab: 151936 (Qwen2 tokenizer)

Validation Results

E2E validated against the original HuggingFace model on receipt and synthetic document images:

Variant	Receipt (360 tok)	Synthetic (76 tok)
FP32 modules	✅ Exact match (edit dist = 0)	✅ Exact match (edit dist = 0)
INT8 quantized	✅ Exact match (edit dist = 0)	✅ Exact match (edit dist = 0)

Zero quality degradation from INT8 weight-only quantization.

Export Details

Property	Value
ExecuTorch version	1.1.0
Backend	XNNPACK (CPU, cross-platform)
Quantization	torchao INT8 weight-only (per-channel, no calibration)
Fixed input resolution	1120 × 1540
Max sequence length	4096 tokens
Source model	lightonai/LightOnOCR-2-1B

Model Surgery Applied

The original HF model required several modifications for ExecuTorch compatibility:

PatchMerger: Rewritten to eliminate Python loops and dynamic shapes (fixed single-image)
KV Cache: HF DynamicCache → static tensors as explicit model inputs/outputs
QK-Norm: @use_kernel_forward_from_hub decorator bypassed, ops inlined
Attention: GQA manually implemented without SDPA dispatch (export-clean)
Token substitution: Vision features scattered at [IMG] positions via index operations

Scripts

Script	Description
`scripts/export_vision.py`	Vision encoder surgery + export
`scripts/export_decoder.py`	Text decoder surgery + export
`scripts/quantize_wo.py`	INT8 weight-only quantization
`scripts/test_e2e_v2.py`	Full E2E validation (FP32)
`scripts/test_e2e_int8.py`	Full E2E validation (INT8)

Hardware Requirements

For on-device inference (INT8)

RAM: 6 GB minimum (8 GB+ recommended)
Storage: 1.6 GB for both model files
CPU: ARM64 (Android) or x86_64

Tested target devices

XNNPACK (current): Any Android/iOS device with CPU
QNN (planned): Snapdragon 8 Gen 2+ with Hexagon NPU

For export/quantization (development)

Jetson AGX Orin 64GB or similar (64GB+ unified memory recommended)
Python 3.10+, PyTorch 2.10+, ExecuTorch 1.1.0

Reproduce

# Clone and setup
git clone https://huggingface.co/acul3/LightOnOCR-2-1B-ExecuTorch
cd LightOnOCR-2-1B-ExecuTorch

# Install deps
pip install executorch transformers torchao pillow

# Download source model
huggingface-cli download lightonai/LightOnOCR-2-1B --local-dir models/LightOnOCR-2-1B

# Export (generates .pte files)
python scripts/export_vision.py
python scripts/export_decoder.py

# Quantize
python scripts/quantize_wo.py

# Validate
python scripts/test_e2e_int8.py

License

Apache 2.0 (same as source model)

Citation

@misc{lightonocr2_executorch_2026,
  title = {LightOnOCR-2-1B-ExecuTorch: On-Device OCR with ExecuTorch},
  author = {Samsul Rahmadani},
  year = {2026},
  url = {https://huggingface.co/acul3/LightOnOCR-2-1B-ExecuTorch},
  note = {Converted from lightonai/LightOnOCR-2-1B}
}

Acknowledgments

LightOn AI for the original LightOnOCR-2-1B model
PyTorch ExecuTorch team for the on-device runtime
torchao for INT8 weight-only quantization

Downloads last month: -

Model tree for acul3/LightOnOCR-2-1B-ExecuTorch

Base model

lightonai/LightOnOCR-2-1B

Quantized

(13)

this model