YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Grid-JEPA: JEPA-Based World Model for ARC-AGI-3

A neural architecture for the ARC-AGI-3 competition combining Joint Embedding Predictive Architecture (JEPA), Recurrent State-Space Models (RSSM), and Test-Time Training (TTT) to solve novel interactive grid-world tasks.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        ARC-AGI-3 Agent                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Observation Grid (64x64, 16 colors)                                β”‚
β”‚           ↓                                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                    β”‚
β”‚  β”‚ Grid-JEPA   β”‚  ← I-JEPA adapted for discrete grid worlds        β”‚
β”‚  β”‚ Encoder     β”‚     1Γ—1 patches, latent-space prediction         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                    β”‚
β”‚           ↓  Latent Representation                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                    β”‚
β”‚  β”‚    RSSM     β”‚  ← Recurrent State-Space Model (DreamerV3-style)  β”‚
β”‚  β”‚ World Model β”‚     GRU dynamics + discrete latents               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                    β”‚
β”‚           ↓  Hidden State (PERSISTS across levels!)                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                β”‚
β”‚  β”‚  Planning   β”‚ ←→ β”‚ Exploration β”‚                                β”‚
β”‚  β”‚  (Imaginationβ”‚    β”‚  (Novelty)  β”‚                                β”‚
β”‚  β”‚  Rollouts)  β”‚    β”‚             β”‚                                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                β”‚
β”‚           ↓                                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                    β”‚
β”‚  β”‚Goal Inferenceβ”‚  ← Discovers objectives from terminal states   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                    β”‚
β”‚           ↓                                                        β”‚
β”‚  Action (key, position) β†’ Environment                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Innovations

1. JEPA for Discrete Grid Worlds

  • 1Γ—1 patch embeddings: Each grid cell is semantically meaningful (colors are categorical)
  • Latent-space prediction: Predicts transformations (rotate, fill, move) without pixel reconstruction
  • Action-conditioned predictor: Inspired by Image World Models (Meta, 2024)

2. Persistent World Model State (Critical for ARC-AGI-3)

  • RSSM state persists across levels within the same environment
  • Level 3 requires knowledge from Level 1-2; resetting = instant failure

3. Uncertainty-Aware Prediction

  • Tracks prediction errors over a sliding window
  • Triggers hypothesis revision when errors are consistently high
  • Prevents "latching onto early hypothesis" failure mode

4. Test-Time Training (TTT)

  • Per-task LoRA adapters for each novel environment
  • Fine-tunes on collected demos with geometric augmentations
  • Based on TTT for ARC (arXiv:2411.07279) achieving 53% on ARC-AGI-1

Repository Structure

arc-jepa/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ encoder.py          # GridPatchEmbed + ViT encoders + EMA
β”‚   β”‚   β”œβ”€β”€ predictor.py        # Action-conditioned predictor
β”‚   β”‚   β”œβ”€β”€ grid_jepa.py        # Complete Grid-JEPA system
β”‚   β”‚   β”œβ”€β”€ rssm.py             # Recurrent State-Space Model
β”‚   β”‚   β”œβ”€β”€ agent.py            # Full ARC agent (JEPA + RSSM + planning)
β”‚   β”‚   └── ttt_adapter.py      # LoRA TTT adapter
β”‚   β”œβ”€β”€ data/                    # Dataset loaders + augmentations
β”‚   β”œβ”€β”€ training/                # Training scripts
β”‚   └── utils/                   # Utilities
β”œβ”€β”€ tests/                       # Unit tests
└── README.md                    # This file

Core Components

agent.py β€” Complete Agent (Central Module)

  • ARCAgent: Full agent loop encoding the core insight of this project
  • GoalInferenceModule: Discovers objectives from terminal/done states
  • ExplorationPolicy: Novelty-seeking with undo loop avoidance
  • PlanningModule: Imagination-based action selection via RSSM rollouts
  • UncertaintyTracker: Hypothesis revision when predictions fail consistently

encoder.py β€” Grid-JEPA Encoder

  • GridPatchEmbed: 1Γ—1 patch embeddings for color grids
  • ViTEncoder: Multi-head attention transformer blocks
  • EMATargetEncoder: EMA-updated target encoder (prevents collapse)

predictor.py β€” Action-Conditioned Predictor

  • DiscreteActionEmbed: Embeds (action_key, cell_position) pairs
  • ActionConditionedPredictor: Predicts target patches from context + action
  • GridWorldPredictor: Full predictor + decoder to color logits

rssm.py β€” Recurrent State-Space Model

  • observe(): Update state with new observation (posterior)
  • imagine(): Predict next state given action (prior)
  • rollout(): Imagine future trajectories for planning
  • Straight-through gradients for discrete latents

ttt_adapter.py β€” Test-Time Training

  • LoRALayer: Low-rank adaptation (W' = W + BA)
  • PredictorLoRAAdapter: Per-task LoRA on JEPA predictor
  • TTTTrainer: Fine-tunes on demos with augmentation voting

Key Design Decisions

Decision Rationale
1Γ—1 patches Grid cells are semantically meaningful, unlike image pixels
L2 latent loss Reconstruction forces modeling irrelevant visual details
EMA target encoder Prevents representation collapse in self-supervised learning
Feature conditioning Outperforms concatenation for action conditioning
Straight-through latents Enables gradient flow through discrete RSSM states
State persistence ARC-AGI-3 levels build on each other
Uncertainty tracking Prevents getting stuck on wrong hypotheses
LoRA TTT Efficient per-task adaptation without catastrophic forgetting

Papers

  1. I-JEPA (arXiv:2301.08243) β€” Foundation of encoder design
  2. Image World Models (arXiv:2403.00504) β€” Action-conditioned predictor
  3. DreamerV3 (arXiv:2301.04104) β€” RSSM dynamics architecture
  4. TTT for ARC (arXiv:2411.07279) β€” Per-task LoRA fine-tuning
  5. ARC-AGI-3 (arXiv:2603.24621) β€” Competition specification

License

MIT License β€” Open source as required for ARC Prize eligibility.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for guychuk/arc-agi-3-grid-jepa