File size: 8,969 Bytes
215b548
 
 
 
edcaa01
215b548
29c2b52
215b548
 
 
 
 
 
 
 
 
 
 
29c2b52
215b548
 
 
 
 
 
 
 
 
 
55bc2e9
 
 
215b548
 
 
 
29c2b52
8d616ba
 
55bc2e9
8d616ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b55e4ed
8d616ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215b548
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
license: cc-by-nc-sa-2.0
---
<div align='center'>
<h1> MolCRAFT Series for Drug Design: MolPilot </h1>

[![Github](https://img.shields.io/badge/Github-4d8cd8?style=for-the-badge&logo=github&logoColor=white)](https://github.com/AlgoMole/MolCRAFT/tree/master)
[![Blog](https://img.shields.io/badge/Blog-3858bf?style=for-the-badge&logo=homepage&logoColor=white)](https://MolCRAFT-GenSI.github.io/)
[![Data](https://img.shields.io/badge/Data-4d8cd8?style=for-the-badge&logo=googledrive&logoColor=white)](https://drive.google.com/drive/folders/16KiwfMGUIk4a6mNU20GnUd0ah-mjNlhC?usp=share_link)

</div>


Welcome to the official repository for the MolCRAFT series of projects! This series focuses on developing and improving deep learning models for **structure-based drug design (SBDD)** and **molecule optimization (SBMO)**. Our goal is to create molecules with high binding affinity and plausible 3D conformations.

This repository contains the source code for the following projects:

* [**MolCRAFT**: Structure-Based Drug Design in Continuous Parameter Space](https://arxiv.org/abs/2404.12141) (ICML'24)
* [**MolJO**: Empower Structure-Based Molecule Optimization with Gradient Guided Bayesian Flow Networks](https://arxiv.org/abs/2411.13280) (ICML'25)
* [**MolPilot**: Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule](https://arxiv.org/abs/2505.07286) (ICML'25)

## 📜 Overview

The MolCRAFT series addresses critical challenges in generative models for SBDD, including modeling molecular geometries, handling hybrid continuous-discrete spaces, and optimizing molecules against protein targets. Each project introduces novel methodologies and achieves **state-of-the-art** performance on relevant benchmarks.

## 🧭 Navigation

| Folder                         | TL, DR               | Description                                                                                                                                                                                             |
| --------------------------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [MolCRAFT](https://huggingface.co/GenSI/MolCRAFT)      | Unified Space for Molecule Generation  | MolCRAFT is the first SBDD generative model based on Bayesian Flow Network (BFN) operating in the unified continuous parameter space for different modalities, with variance reduction sampling strategy to generate high-quality samples with more than 10x speedup.
| [MolJO](https://huggingface.co/GenSI/MolJO)            | Gradient-Guided Molecule Optimization   | MolJO is a gradient-based Structure-Based Molecule Optimization (SBMO) framework derived within BFN. It employs joint guidance across continuous coordinates and discrete atom types, alongside a backward correction strategy for effective optimization.
| [MolPilot](https://huggingface.co/GenSI/MolPilot)      | Optimal Scheduling  | MolPilot enhances SBDD by introducing a VLB-Optimal Scheduling (VOS) strategy for the twisted multimodal probability paths, significantly improving molecular geometries and interaction modeling, achieving 95.9% PB-Valid rate.  |


---

# 🚀 MolPilot
Official implementation of ICML 2025 ["Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule"](https://arxiv.org/abs/2505.07286).

<!-- ![](https://github.com/AlgoMole/MolCRAFT/blob/master/asset/molpilot_vos.png) -->

We propose VLB-Optimal Scheduling (VOS) and demonstrate its generality on the popular diffusion-based models (TargetDiff, with the code in `targetdiff` folder) and BFN-based models (our MolPilot). 

In fact, VOS can be easily integrated into other frameworks, with only minor changes w.r.t. training:

```python
# Example: TargetDiff molopt_score_model.py

class ScorePosNet3D(nn.Module):
  def get_diffusion_loss(...):
    ##### Original Training Loss #####
    time_step, pt = self.sample_time(num_graphs, protein_pos.device, self.sample_time_method)
    # Xt = a.sqrt() * X0 + (1-a).sqrt() * eps
    ligand_pos_perturbed = a_pos.sqrt() * ligand_pos + (1.0 - a_pos).sqrt() * pos_noise  # pos_noise * std

    ##### VOS Generalized Loss #####
    time_step_v, pt = self.sample_time(num_graphs, protein_pos.device, self.sample_time_method)
    # Vt = a * V0 + (1-a) / K
    log_ligand_v0 = index_to_log_onehot(ligand_v, self.num_classes)
    ligand_v_perturbed, log_ligand_vt = self.q_v_sample(log_ligand_v0, time_step_v, batch_ligand)
    kl_v = self.compute_v_Lt(log_v_model_prob=log_v_model_prob, log_v0=log_ligand_v0,
                             log_v_true_prob=log_v_true_prob, t=time_step_v, batch=batch_ligand)

```

The optimal test-time noise schedule can be obtained by first storing the gridded loss surface values, and then running the dynamic programming script in `test/test_geodesic_budget.py`.

<!-- ![](../asset/molpilot_top1_bond_len_angle.png) -->

## Environment
It is highly recommended to install via docker if a Linux server with NVIDIA GPU is available.

Otherwise, you might check [README for env](docker/README.md) for further details of docker or conda setup.

### Prerequisite
A docker with `nvidia-container-runtime` enabled on your Linux system is required.

> [!TIP]
> - This repo provides an easy-to-use script to install docker and nvidia-container-runtime, in `./docker` run `sudo ./setup_docker_for_host.sh` to set up your host machine.
> - For details, please refer to the [install guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).


### Install via Docker
We highly recommend you to set up the environment via docker, since all you need to do is a simple `make` command.
```bash
cd ./docker
make
```

-----
## Data
We use the same data as [TargetDiff](https://github.com/guanjq/targetdiff/tree/main?tab=readme-ov-file#data). Data used for training / evaluating the model should be put in the `data` folder by default, and accessible in the [data](https://drive.google.com/drive/folders/1j21cc7-97TedKh_El5E34yI8o5ckI7eK?usp=share_link) Google Drive folder.

To train the model from scratch, download the lmdb file and split file into data folder:
* `crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdb`
* `crossdocked_pocket10_pose_split.pt`

To evaluate the model on the test set, download _and_ unzip the `test_set.zip` into data folder. It includes the original PDB files that will be used in Vina Docking.

```yaml
data:
  name: pl # [pl, pl_tr] where tr means offline-transformed
```

---
## Training
```bash
python train_bfn_twisted.py --exp_name ${EXP_NAME} --revision ${REVISION} --config_file configs/crossdock_train_test.yaml  --time_decoupled
```

where the default values should be set the same as:
```bash
python train_bfn_twisted.py --sigma1_coord 0.05 --beta1 1.5 --beta1_bond 1.5 --lr 5e-4 --time_emb_dim 0  --self_condition --epochs 30 --batch_size 16 --max_grad_norm Q --scheduler plateau --destination_prediction True --use_discrete_t True --num_samples 10 --sampling_strategy end_back_pmf --sample_num_atoms ref --ligand_atom_mode add_aromatic
```

### Debugging
```bash
python train_bfn_twisted.py --no_wandb --debug --epochs 1
```

## Sampling
We provide the pretrained MolPilot checkpoint [here](https://drive.google.com/file/d/1c-lD3yfRx6JlbTWq-jAdirrK6sK2lGLq/view?usp=share_link). 


### Sampling for pockets in the testset
To sample for CrossDock, set the `CONFIG` to `configs/crossdock_train_test.yaml`. For PoseBusters, set it to `configs/posebusters_test.yaml`.

```bash
# Sample with time scheduler
python train_bfn_twisted.py --config_file ${CONFIG} --ckpt_path ${CKPT_PATH} --time_scheduler_path ${TIME_SCHEDULER} --test_only --exp_name ${EXP_NAME} --revision ${REVISION} --num_samples ${NUM_MOLS_PER_POCKET} --sample_steps 100 --eval_batch_size ${BATCH_SIZE}
```

### Sampling from pdb file
To sample from a whole protein pdb file, we need the corresponding reference ligand to clip the protein pocket (a 10A region around the reference position).

```bash
python sample_for_pocket.py --protein_path ${PDB_PATH} --ligand_path ${SDF_PATH} --time_scheduler_path ${TIME_SCHEDULER} --num_samples ${NUM_MOLS_PER_POCKET}
```

## Evaluation

### Evaluating meta files
We provide our samples as `molpilot_ref_vina_docked.pt` on CrossDock in the [sample](https://drive.google.com/drive/folders/1A3Mthm9ksbfUnMCe5T2noGsiEV1RfChH?usp=sharing) Google Drive folder.

<!-- TODO:, together with all the baseline results on PoseBusters in the [sample_posebusters]() folder. -->


## Citation

```
@article{qiu2025piloting,
  title={Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule},
  author={Qiu, Keyue and Song, Yuxuan and Fan, Zhehuan and Liu, Peidong and Zhang, Zhe and Zheng, Mingyue and Zhou, Hao and Ma, Wei-Ying},
  journal={ICML 2025},
  year={2025}
}
```