eSFM — Enzyme ↔ Substrate Specificity Foundation Model
Paper: Vibe Coding Specificity Foundation Models · doi: 10.64898/2026.06.04.730134 All VC-SFM models: huggingface.co/SFM-BIIE-ETHZ Code: github.com/SFM-BIIE-ETHZ/Vibe-Coding-SFMs
What it does
This SFM learns a joint embedding space for enzyme protein sequences and substrate small molecules (SMILES) via contrastive learning on reaction data. Given an enzyme, retrieve its most likely substrate — or score enzyme–substrate pairs directly.
| Component | Model |
|---|---|
| Agent encoder | ESM-2 (protein language model) |
| Target encoder | MoLFormer-XL (SMILES) |
| Training data | ReactZyme (177,389 enzymes / 2,646 substrates; 177,442 pairs) |
| Split | Sequence-identity 100% holdout · fold 0 |
Performance — pool-512 retrieval (from the paper)
Evaluated by pool-512 retrieval: each test pair's true target is placed in a pool of 512 candidates (1 positive + 511 random negatives), scored by cosine similarity, over 100 random trials at the best-validation checkpoint. Random baseline = 0.2%. Values are the 5-fold cross-validated mean ± SD (folds 0–3; fold 4 excluded for split degeneracy) reported in the paper for this SFM. The released checkpoint is the fold-0, identity-100 model trained with the identical configuration, data, and split.
| Direction | R@1 (%) | R@5 (%) | R@10 (%) |
|---|---|---|---|
| enzyme → substrate | 86.3 ± 1.5 | 95.3 ± 1.0 | 97.1 ± 0.8 |
| substrate → enzyme | 61.8 ± 0.4 | 88.2 ± 0.2 | 93.5 ± 0.5 |
Released checkpoint = fold 0; per-fold values are not separately tabulated in the paper (mean reported).
Quick start
from huggingface_hub import hf_hub_download
import torch, torch.nn.functional as F
ckpt_path = hf_hub_download("SFM-BIIE-ETHZ/eSFM_VC-SFM", "model.pth")
# Load with the Vibe-Coding-SFMs codebase
# (https://github.com/SFM-BIIE-ETHZ/Vibe-Coding-SFMs)
from calm.encoder.model import CALMEncoder
model = CALMEncoder.from_pretrained(ckpt_path)
model.eval()
agent_emb = model.encode_query("MKALLIVLGLVSSVSQASST...") # enzyme protein
target_emb = model.encode_target("OC(=O)c1ccccc1") # substrate SMILES
score = F.cosine_similarity(agent_emb, target_emb, dim=-1)
Files in this repo
| File | Description |
|---|---|
model.pth |
Released checkpoint · fold 0 · identity-100 split |
results_train_val_test.csv |
Per-epoch training/validation/test logs (training-time batch metrics, not the pool-512 numbers above) |
Citation
@article{reddy2026vcsfm,
title = {Vibe Coding Specificity Foundation Models},
author = {Reddy, Sai T.},
journal = {bioRxiv},
year = {2026},
doi = {10.64898/2026.06.04.730134}
}
License
Released under the SFM Research Preview License v1.0-preview (see LICENSE.md).
Free for research use — academic, non-profit, government, and industry research. The specific
molecules disclosed in the accompanying preprints are dedicated to the public. Commercial-use
and patent-licensing terms are deferred and being arranged with ETH Zürich / BIIE; the SFM
architectures and training methods are the subject of pending patent applications.
For commercial enquiries: sai.reddy@ethz.ch