GeoPep

Geometric-aware Peptide-Protein Binding Site Prediction

GeoPep is a per-residue binding site predictor that combines the ESM3 protein foundation model with Kolmogorov-Arnold Network (KAN) heads. Given a peptide and a protein, it predicts which residues of the protein bind the peptide and which residues of the peptide make contact.

Model Summary

Backbone ESM3 (sm-open-v1, 1.4B params)
Head 5 stacked FastKAN layers, 1536 β†’ 1153 β†’ 770 β†’ 387 β†’ 3 β†’ 3
Granularity Per-residue (one prediction per amino acid)
Classes 3 β€” 0: non-interface, 1: interface (binding), 2: padding
Input Peptide (≀50 residues) `
Output Logits [B, 3, 551] β†’ softmax binding probability per residue
Training loss Weighted cross-entropy + differentiable geometric distance loss

How to Use

1. Install the GeoPep code

git clone https://github.com/Dian0212/GeoPep.git
cd GeoPep
conda create -n geopep python=3.10 -y
conda activate geopep
pip install -r requirements.txt

2. Download the checkpoint from this repo

huggingface-cli download dchenqwer/GeoPep model_distanceLoss.ckpt \
    --local-dir model_weights/ --local-dir-use-symlinks False

This places the file at model_weights/model_distanceLoss.ckpt.

3. Run inference on your PDBs

Put PDB files (named <PDBID>_<peptide_chain>_<protein_chain>.pdb) into a folder, then:

cd scripts
python inference_pipeline.py \
    --pdb-dir /path/to/pdb \
    --checkpoint ../model_weights/model_distanceLoss.ckpt

Output: result/predictions.json with per-residue binding probabilities.

4. Result format

{
  "1a1r_C_A": {
    "peptide_chain": "GSVVIVGRIVLSGKPA",
    "protein_chain": "VEGEVQIVSTATQTFLAT...",
    "peptide_bindingProbability": "0.99 0.99 0.99 ...",
    "protein_bindingProbability": "0.35 0.97 0.12 ..."
  }
}
  • peptide_chain / protein_chain β€” actual residue sequences (padding stripped).
  • *_bindingProbability β€” space-separated 2-decimal floats, one per residue. Value is the raw class-1 (interface) softmax probability.

Architecture

input tokens [B, 553]                  BOS + 551 residues + EOS
       β”‚
   ESM3 encoder
       ↓ embeddings[:, 1:552, :]       drop BOS / EOS
   [B, 551, 1536]                      per-residue embeddings
       ↓ reshape
   [B * 551, 1536]                     each residue independent
       ↓
   KAN_1: 1536 β†’ 1153
   KAN_2: 1153 β†’ 770
   KAN_3:  770 β†’ 387
   KAN_4:  387 β†’ 3
   KAN_5:    3 β†’ 3
       ↓ reshape + permute
   logits [B, 3, 551]
       ↓ softmax(dim=1)
   binding probability = softmax[:, 1, :]

The per-residue head is what makes GeoPep different from the original CLS-token approach β€” every residue's embedding flows independently through the KAN stack, giving sharper position-level predictions.

Training

  • Dataset: peptide–protein complexes from the PDB, encoded as paired complex/ (full structure) + interface/ (binding residues only) PDB files.
  • Length filters: peptide ∈ [10, 50] residues, protein ∈ [10, 500] residues.
  • Loss:
    • Per-half cross-entropy with class weights [0.2, 0.8, 0.0] (padding contributes zero gradient).
    • Differentiable distance loss L_dist = Ξ£α΅’ P_binding(i) Β· dist(i) / num_valid_residues, which penalizes high binding probability at residues far from the true interface (distance computed from residue centers of mass).
  • Backbone: ESM3 backbone is fine-tuned together with the KAN head.
  • Optimizer: AdamW, learning rate 1e-4, weight decay 1e-4.
  • Mixed precision: FP16.

Input Format

The model expects a single concatenated string of length 551 tokens:

PEPTIDESEQ<pad><pad>...|PROTEINSEQ<pad><pad>...
|<------- 50 ------->|<-------- 500 ----------->|
Position Content
0 .. 49 Peptide (left-padded with <pad> to 50)
50 Separator `
51 .. 550 Protein (left-padded with <pad> to 500)

The inference_pipeline.py script handles padding and tokenization for you β€” you only provide raw PDB files.

Limitations

  • Sequence-length cap: peptide must be ≀ 50 residues, protein ≀ 500 residues. Sequences longer than this are silently skipped at the preprocessing stage.
  • Requires per-PDB chain annotation: filenames must follow the <PDBID>_<peptide_chain>_<protein_chain>.pdb convention so the pipeline knows which chain is the peptide.
  • No uncertainty calibration: raw softmax probabilities are not calibrated; use them as relative scores rather than absolute confidences.
  • Single peptide-protein pair per call: multi-peptide or multi-chain contexts are not supported in the standard pipeline.

Citation

If you use GeoPep in your work, please cite:

@misc{geopep2026,
  title  = {GeoPep: Geometric-aware Peptide-Protein Binding Site Prediction},
  author = {Chen, Dian},
  year   = {2026},
  howpublished = {\url{https://github.com/Dian0212/GeoPep}}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support