aakothari's picture
Update README.md
b828334 verified
---
pipeline_tag: fill-mask
library_name: transformers
base_model: roberta
tags:
- chemistry
- Transformers
- roberta
- ChemBERTa
- PyTorch
- MLM
datasets:
- zinc
license: mit
language:
- en
metrics:
- roc_auc
---
Note: DeepBERTa is not affiliated with Microsoft's DeBERTa model. The name is inspired by BERT-based models and refers to a DeepSMILES-trained RoBERTa variant.
# DeepBERTa_zinc_base_100k_v4
A pretrained RoBERTa-style transformer model for learning molecular representations from **DeepSMILES**-encoded chemical structures.
This model was trained on a corpus of 100,000 canonicalized molecules sampled from **ZINC**, using a BPE tokenizer with a vocabulary size of 767.
It was trained similarily to ChemBERTa but with DeepSMILES instead of SMILES.
---
# Preprocessing
All molecules were converted to DeepSMILES from SMILES before tokenization.
DeepSMILES is a molecular notation for describing chemical structures that eliminates the need for explicit ring closure digits, making it more suitable for NLP approaches.
## Model Overview
- **Architecture**: RoBERTa-base (6 Transformer layers, 12 attention heads)
- **Objective**: Masked Language Modeling (MLM)
- **Input Format**: DeepSMILES strings (tokenized using custom BPE)
- **Training Framework**: PyTorch + Hugging Face Transformers
- **Training Steps**: ~0.35 epochs on 100k molecules
- **Final Evaluation Loss**: **1.57**
- **Model Type**: `RobertaForMaskedLM`
---
## Training Setup
| Parameter | Value |
|---------------------------|--------------------------------------------|
| Dataset | 100,000 DeepSMILES from ZINC |
| Tokenizer Type | Byte-Pair Encoding (BPE) |
| Vocabulary Size | 767 |
| Masking Ratio | 15% |
| Optimizer | AdamW |
| Batch Size | 8 per device |
| Training Time | ~0.35 epochs |
## Intended Usages
**Unsupervised molecular representation learning**
Use the pretrained model to encode molecules into contextual embeddings. These embeddings capture chemical structure and substructure information and can be used as input to other models.
- **Fine-tuning on downstream property classification tasks**
Examples include:
- Toxicity prediction (e.g., Tox21, ToxCast)
- Brain toxicity (e.g., BBBP, B3DB)
- Binding affinity or bioactivity prediction (e.g., BACE, HIV)
To use the model for these tasks, add a classification head and fine-tune on labeled datasets.
- **Masked token prediction or augmentation**
The MLM objective enables:
- Predicting missing or corrupted substructures in DeepSMILES strings
- Creating chemically valid perturbations for data augmentation
### This model is **not** intended for:
- **De novo molecular generation**
Masked Language Models are bidirectional and do not generate sequences autoregressively. For generation, use models like:
- RNN/LSTM-based generators
- Variational Autoencoders (Junction Tree VAE)
- **End-to-end property prediction without fine-tuning**
This model does not come with a classification head. To predict properties, it must be fine-tuned on a task-specific dataset.
- **Direct use with raw SMILES (without converting to DeepSMILES)**
This version is trained on **DeepSMILES**, which differs from canonical SMILES. Preprocessing is required to convert SMILES → DeepSMILES.
---
Example Usage (fragment prediction)
```python
from transformers import RobertaForMaskedLM, RobertaTokenizer
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
import torch.optim as optim
# Load model and tokenizer from HuggingFace
model_path = "aakothari/DeepBERTa_zinc_base_100k_v1"
model = RobertaForMaskedLM.from_pretrained(model_path).to(device)
tokenizer = RobertaTokenizer.from_pretrained(model_path)
# Preprocess data
train_ds = DeepSMILESDataset(train_df, tokenizer)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate)
# Optimizer and scheduler
total_steps = len(train_loader) * NUM_EPOCHS
optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps)
# Training
model.train()
for epoch in range(NUM_EPOCHS):
for batch in train_loader:
optimizer.zero_grad()
inputs = {k: v.to(device) for k, v in batch.items() if k != "actual_fragment_deepsmiles"}
outputs = model(**inputs)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()