GeoPep
Geometric-aware Peptide-Protein Binding Site Prediction
GeoPep is a per-residue binding site predictor that combines the ESM3 protein foundation model with Kolmogorov-Arnold Network (KAN) heads. Given a peptide and a protein, it predicts which residues of the protein bind the peptide and which residues of the peptide make contact.
- π¦ Code: https://github.com/Dian0212/GeoPep
- π Checkpoint:
model_distanceLoss.ckpt(~16 GB)
Model Summary
| Backbone | ESM3 (sm-open-v1, 1.4B params) |
| Head | 5 stacked FastKAN layers, 1536 β 1153 β 770 β 387 β 3 β 3 |
| Granularity | Per-residue (one prediction per amino acid) |
| Classes | 3 β 0: non-interface, 1: interface (binding), 2: padding |
| Input | Peptide (β€50 residues) ` |
| Output | Logits [B, 3, 551] β softmax binding probability per residue |
| Training loss | Weighted cross-entropy + differentiable geometric distance loss |
How to Use
1. Install the GeoPep code
git clone https://github.com/Dian0212/GeoPep.git
cd GeoPep
conda create -n geopep python=3.10 -y
conda activate geopep
pip install -r requirements.txt
2. Download the checkpoint from this repo
huggingface-cli download dchenqwer/GeoPep model_distanceLoss.ckpt \
--local-dir model_weights/ --local-dir-use-symlinks False
This places the file at model_weights/model_distanceLoss.ckpt.
3. Run inference on your PDBs
Put PDB files (named <PDBID>_<peptide_chain>_<protein_chain>.pdb) into a
folder, then:
cd scripts
python inference_pipeline.py \
--pdb-dir /path/to/pdb \
--checkpoint ../model_weights/model_distanceLoss.ckpt
Output: result/predictions.json with per-residue binding probabilities.
4. Result format
{
"1a1r_C_A": {
"peptide_chain": "GSVVIVGRIVLSGKPA",
"protein_chain": "VEGEVQIVSTATQTFLAT...",
"peptide_bindingProbability": "0.99 0.99 0.99 ...",
"protein_bindingProbability": "0.35 0.97 0.12 ..."
}
}
peptide_chain/protein_chainβ actual residue sequences (padding stripped).*_bindingProbabilityβ space-separated 2-decimal floats, one per residue. Value is the raw class-1 (interface) softmax probability.
Architecture
input tokens [B, 553] BOS + 551 residues + EOS
β
ESM3 encoder
β embeddings[:, 1:552, :] drop BOS / EOS
[B, 551, 1536] per-residue embeddings
β reshape
[B * 551, 1536] each residue independent
β
KAN_1: 1536 β 1153
KAN_2: 1153 β 770
KAN_3: 770 β 387
KAN_4: 387 β 3
KAN_5: 3 β 3
β reshape + permute
logits [B, 3, 551]
β softmax(dim=1)
binding probability = softmax[:, 1, :]
The per-residue head is what makes GeoPep different from the original CLS-token approach β every residue's embedding flows independently through the KAN stack, giving sharper position-level predictions.
Training
- Dataset: peptideβprotein complexes from the PDB, encoded as
paired
complex/(full structure) +interface/(binding residues only) PDB files. - Length filters: peptide β [10, 50] residues, protein β [10, 500] residues.
- Loss:
- Per-half cross-entropy with class weights
[0.2, 0.8, 0.0](padding contributes zero gradient). - Differentiable distance loss
L_dist = Ξ£α΅’ P_binding(i) Β· dist(i) / num_valid_residues, which penalizes high binding probability at residues far from the true interface (distance computed from residue centers of mass).
- Per-half cross-entropy with class weights
- Backbone: ESM3 backbone is fine-tuned together with the KAN head.
- Optimizer: AdamW, learning rate 1e-4, weight decay 1e-4.
- Mixed precision: FP16.
Input Format
The model expects a single concatenated string of length 551 tokens:
PEPTIDESEQ<pad><pad>...|PROTEINSEQ<pad><pad>...
|<------- 50 ------->|<-------- 500 ----------->|
| Position | Content |
|---|---|
| 0 .. 49 | Peptide (left-padded with <pad> to 50) |
| 50 | Separator ` |
| 51 .. 550 | Protein (left-padded with <pad> to 500) |
The inference_pipeline.py script handles padding and tokenization for you β
you only provide raw PDB files.
Limitations
- Sequence-length cap: peptide must be β€ 50 residues, protein β€ 500 residues. Sequences longer than this are silently skipped at the preprocessing stage.
- Requires per-PDB chain annotation: filenames must follow the
<PDBID>_<peptide_chain>_<protein_chain>.pdbconvention so the pipeline knows which chain is the peptide. - No uncertainty calibration: raw softmax probabilities are not calibrated; use them as relative scores rather than absolute confidences.
- Single peptide-protein pair per call: multi-peptide or multi-chain contexts are not supported in the standard pipeline.
Citation
If you use GeoPep in your work, please cite:
@misc{geopep2026,
title = {GeoPep: Geometric-aware Peptide-Protein Binding Site Prediction},
author = {Chen, Dian},
year = {2026},
howpublished = {\url{https://github.com/Dian0212/GeoPep}}
}
License
MIT