LightOnOCR-2-1B ExecuTorch (Android-ready)
On-device OCR model converted from lightonai/LightOnOCR-2-1B to ExecuTorch .pte format for mobile/edge deployment.
1B parameter end-to-end OCR β converts document images (PDFs, receipts, scans) to clean text, running entirely on-device with no cloud dependency.
Models
| File | Size | Description | Recommended |
|---|---|---|---|
vision_encoder_int8.pte |
398 MB | Vision encoder, INT8 weight-only, XNNPACK | β Yes |
text_decoder_int8.pte |
1.2 GB | Text decoder, INT8 weight-only, XNNPACK | β Yes |
vision_encoder.pte |
1.6 GB | Vision encoder, FP32, XNNPACK | |
text_decoder_4096.pte |
2.9 GB | Text decoder, FP32, XNNPACK |
For on-device deployment, use the INT8 variants (1.6 GB total).
Quick Start β Python (ExecuTorch Runtime)
from executorch.runtime import Runtime
from huggingface_hub import hf_hub_download
import torch
# Download INT8 models
vis_path = hf_hub_download("acul3/LightOnOCR-2-1B-ExecuTorch", "vision_encoder_int8.pte")
dec_path = hf_hub_download("acul3/LightOnOCR-2-1B-ExecuTorch", "text_decoder_int8.pte")
# Load ExecuTorch runtime
runtime = Runtime.get()
vision = runtime.load_program(vis_path).load_method("forward")
decoder = runtime.load_program(dec_path).load_method("forward")
# Preprocess image to fixed 1120Γ1540
from PIL import Image
img = Image.open("document.jpg").convert("RGB").resize((1540, 1120))
# Normalize (use HF processor or manual normalization)
pixel_values = torch.tensor(...) # [1, 3, 1120, 1540] float32
# Run vision encoder
image_features = vision.execute([pixel_values]) # [1, 2200, 1024]
# Build input tokens + autoregressive decode (see scripts/test_e2e_v2.py for full example)
Quick Start β Android (Kotlin)
import org.pytorch.executorch.Module
import org.pytorch.executorch.EValue
import org.pytorch.executorch.Tensor
// Load models from assets or downloaded files
val visionModule = Module.load("vision_encoder_int8.pte")
val decoderModule = Module.load("text_decoder_int8.pte")
// Preprocess: resize image to 1120Γ1540, normalize
val pixelTensor = preprocessImage(bitmap) // [1, 3, 1120, 1540]
// Run vision encoder
val imageFeatures = visionModule.forward(EValue.from(pixelTensor))
// Run text decoder autoregressively
// See ExecuTorch LLM Demo App for the decode loop pattern
Full E2E Pipeline
For a complete working example with autoregressive decoding, see:
- scripts/test_e2e_v2.py β FP32 E2E validation
- scripts/test_e2e_int8.py β INT8 E2E validation
The pipeline works as follows:
- Resize input image to exactly 1120Γ1540 pixels
- Run vision encoder β image embeddings
[1, 2200, 1024] - Build input tokens using the chat template with 2200
[IMG]placeholders (token id151655) - Scatter vision embeddings into the
[IMG]positions in the token embedding space - Prefill the text decoder with the full combined sequence (2260 tokens)
- Decode autoregressively until EOS token (
151645) or max length
Token Template
<|im_start|>user\n[IMG][IMG]...(2200 total)...[VPAD]...(39 row separators)...[IMG_END]
OCR this document. Extract all text.<|im_end|>\n<|im_start|>assistant\n
Special Token IDs
| Token | ID | Purpose |
|---|---|---|
[IMG] |
151655 | Image patch placeholder (2200 per image) |
[VPAD] |
151654 | Vision row separator (39 per image) |
[IMG_END] |
151653 | End of vision tokens |
<|im_start|> |
151644 | Chat turn start |
<|im_end|> |
151645 | Chat turn end / EOS |
Architecture
LightOnOCR-2-1B (~1B params)
βββ Vision Encoder: Pixtral ViT (400M params)
β βββ 24 transformer layers, hidden=1024, heads=16
β βββ Patch size: 14Γ14, 2D RoPE
β βββ PatchMerger: 2Γ2 spatial merge β 4Γ token reduction
βββ MultiModal Projector (6M params)
β βββ RMSNorm β PatchMerger β Linear(1024β1024) β GELU β Linear
β βββ Output: [1, 2200, 1024] for 1120Γ1540 input
βββ Text Decoder: Qwen3 (700M params)
βββ 28 transformer layers, GQA (16 heads, 8 KV heads)
βββ head_dim=128, intermediate=3072, hidden=1024
βββ QK-norm (RMSNorm on Q,K per layer)
βββ Static KV cache, max_seq_len=4096
βββ Vocab: 151936 (Qwen2 tokenizer)
Validation Results
E2E validated against the original HuggingFace model on receipt and synthetic document images:
| Variant | Receipt (360 tok) | Synthetic (76 tok) |
|---|---|---|
| FP32 modules | β Exact match (edit dist = 0) | β Exact match (edit dist = 0) |
| INT8 quantized | β Exact match (edit dist = 0) | β Exact match (edit dist = 0) |
Zero quality degradation from INT8 weight-only quantization.
Export Details
| Property | Value |
|---|---|
| ExecuTorch version | 1.1.0 |
| Backend | XNNPACK (CPU, cross-platform) |
| Quantization | torchao INT8 weight-only (per-channel, no calibration) |
| Fixed input resolution | 1120 Γ 1540 |
| Max sequence length | 4096 tokens |
| Source model | lightonai/LightOnOCR-2-1B |
Model Surgery Applied
The original HF model required several modifications for ExecuTorch compatibility:
- PatchMerger: Rewritten to eliminate Python loops and dynamic shapes (fixed single-image)
- KV Cache: HF
DynamicCacheβ static tensors as explicit model inputs/outputs - QK-Norm:
@use_kernel_forward_from_hubdecorator bypassed, ops inlined - Attention: GQA manually implemented without SDPA dispatch (export-clean)
- Token substitution: Vision features scattered at
[IMG]positions via index operations
Scripts
| Script | Description |
|---|---|
scripts/export_vision.py |
Vision encoder surgery + export |
scripts/export_decoder.py |
Text decoder surgery + export |
scripts/quantize_wo.py |
INT8 weight-only quantization |
scripts/test_e2e_v2.py |
Full E2E validation (FP32) |
scripts/test_e2e_int8.py |
Full E2E validation (INT8) |
Hardware Requirements
For on-device inference (INT8)
- RAM: 6 GB minimum (8 GB+ recommended)
- Storage: 1.6 GB for both model files
- CPU: ARM64 (Android) or x86_64
Tested target devices
- XNNPACK (current): Any Android/iOS device with CPU
- QNN (planned): Snapdragon 8 Gen 2+ with Hexagon NPU
For export/quantization (development)
- Jetson AGX Orin 64GB or similar (64GB+ unified memory recommended)
- Python 3.10+, PyTorch 2.10+, ExecuTorch 1.1.0
Reproduce
# Clone and setup
git clone https://huggingface.co/acul3/LightOnOCR-2-1B-ExecuTorch
cd LightOnOCR-2-1B-ExecuTorch
# Install deps
pip install executorch transformers torchao pillow
# Download source model
huggingface-cli download lightonai/LightOnOCR-2-1B --local-dir models/LightOnOCR-2-1B
# Export (generates .pte files)
python scripts/export_vision.py
python scripts/export_decoder.py
# Quantize
python scripts/quantize_wo.py
# Validate
python scripts/test_e2e_int8.py
License
Apache 2.0 (same as source model)
Citation
@misc{lightonocr2_executorch_2026,
title = {LightOnOCR-2-1B-ExecuTorch: On-Device OCR with ExecuTorch},
author = {Samsul Rahmadani},
year = {2026},
url = {https://huggingface.co/acul3/LightOnOCR-2-1B-ExecuTorch},
note = {Converted from lightonai/LightOnOCR-2-1B}
}
Acknowledgments
- LightOn AI for the original LightOnOCR-2-1B model
- PyTorch ExecuTorch team for the on-device runtime
- torchao for INT8 weight-only quantization
- Downloads last month
- -
Model tree for acul3/LightOnOCR-2-1B-ExecuTorch
Base model
lightonai/LightOnOCR-2-1B