🥕 Introducing BetaEarth - your own Earth embedding emulator [𝐏𝐫𝐞-𝐑𝐞𝐥𝐞𝐚𝐬𝐞]
The past year has brought many notable embedding products, like AlphaEarth, TESSERA or OlmoEarth. We are entering a phase where embeddings begin to act as a substitute for real observation data.
BetaEarth is an attempt to explore how much one can learn from a model based on its embeddings alone, and whether those embeddings can serve as a useful training target for other models. Huge credit to the AlphaEarth team for releasing the embedding archive openly — it's what made this kind of community-built extension possible.
BetaEarth is a flexible (and relatively lightweight) emulator of the AlphaEarth annual product. It doesn't reproduce AlphaEarth's exact outputs, nor the product, but it reaches ~0.87 cosine similarity on held-out data and retains 97% of downstream land-cover classification accuracy. It only took 1-2 days to train.
It can encode any combination (including multi-temporal) of: - Sentinel-2 L1C - Sentinel-2 L2A - Sentinel-1 RTC - COP-DEM 30 product
The model weights are open, just like its training data (built exclusively using Major TOM). The GitHub repository provides a script for automated generation of embeddings across any footprint. You can also try the workflow over small bounding boxes on the free Hugging Face web app!
We should really have a release date range slider on the /models page. Tired of "trending/most downloaded" being the best way to sort and still seeing models from 2023 on the first page just because they're embedded in enterprise pipelines and get downloaded repeatedly. "Recently Created/Recently Updated" don't solve the discovery problem considering the amount of noise to sift through.
Slight caveat: Trending actually does have some recency bias, but it's not strong/precise enough.