---
pipeline_tag: fill-mask
library_name: transformers
base_model: roberta
tags:
- chemistry
- Transformers
- roberta
- ChemBERTa
- PyTorch
- MLM
datasets:
- zinc
license: mit
language:
- en
metrics:
- roc_auc
---

Note: DeepBERTa is not affiliated with Microsoft's DeBERTa model. The name is inspired by BERT-based models and refers to a DeepSMILES-trained RoBERTa variant.


# DeepBERTa_zinc_base_100k_v4

A pretrained RoBERTa-style transformer model for learning molecular representations from **DeepSMILES**-encoded chemical structures.
This model was trained on a corpus of 100,000 canonicalized molecules sampled from  **ZINC**, using a BPE tokenizer with a vocabulary size of 767.
It was trained similarily to ChemBERTa but with DeepSMILES instead of SMILES. 

---

# Preprocessing
All molecules were converted to DeepSMILES from SMILES before tokenization. 
DeepSMILES is a molecular notation for describing chemical structures that eliminates the need for explicit ring closure digits, making it more suitable for NLP approaches.

## Model Overview

- **Architecture**: RoBERTa-base (6 Transformer layers, 12 attention heads)
- **Objective**: Masked Language Modeling (MLM)
- **Input Format**: DeepSMILES strings (tokenized using custom BPE)
- **Training Framework**: PyTorch + Hugging Face Transformers
- **Training Steps**: ~0.35 epochs on 100k molecules
- **Final Evaluation Loss**: **1.57**
- **Model Type**: `RobertaForMaskedLM`

---

## Training Setup

| Parameter                 | Value                                     |
|---------------------------|--------------------------------------------|
| Dataset                   | 100,000 DeepSMILES from ZINC               |
| Tokenizer Type            | Byte-Pair Encoding (BPE)                   |
| Vocabulary Size           | 767                                        |
| Masking Ratio             | 15%                                        |
| Optimizer                 | AdamW                                      |
| Batch Size                | 8 per device                               |
| Training Time             | ~0.35  epochs                               |


## Intended Usages

 **Unsupervised molecular representation learning**  
  Use the pretrained model to encode molecules into contextual embeddings. These embeddings capture chemical structure and substructure information and can be used as input to other models.

- **Fine-tuning on downstream property classification tasks**  
  Examples include:
  - Toxicity prediction (e.g., Tox21, ToxCast)
  - Brain toxicity  (e.g., BBBP, B3DB)
  - Binding affinity or bioactivity prediction (e.g., BACE, HIV)
  
  To use the model for these tasks, add a classification head and fine-tune on labeled datasets.

- **Masked token prediction or augmentation**

  The MLM objective enables:
  - Predicting missing or corrupted substructures in DeepSMILES strings
  - Creating chemically valid perturbations for data augmentation


###  This model is **not** intended for:

- **De novo molecular generation**  
  Masked Language Models are bidirectional and do not generate sequences autoregressively. For generation, use models like:
  - RNN/LSTM-based generators
  - Variational Autoencoders (Junction Tree VAE)

- **End-to-end property prediction without fine-tuning**  
  This model does not come with a classification head. To predict properties, it must be fine-tuned on a task-specific dataset.

- **Direct use with raw SMILES (without converting to DeepSMILES)**  
  This version is trained on **DeepSMILES**, which differs from canonical SMILES. Preprocessing is required to convert SMILES → DeepSMILES.
---

Example Usage (fragment prediction)

```python
from transformers import RobertaForMaskedLM, RobertaTokenizer
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
import torch.optim as optim

# Load model and tokenizer from HuggingFace
model_path = "aakothari/DeepBERTa_zinc_base_100k_v1"
model = RobertaForMaskedLM.from_pretrained(model_path).to(device)
tokenizer = RobertaTokenizer.from_pretrained(model_path)

# Preprocess data 
train_ds = DeepSMILESDataset(train_df, tokenizer)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate)

# Optimizer and scheduler
total_steps = len(train_loader) * NUM_EPOCHS
optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps)

# Training 
model.train()
for epoch in range(NUM_EPOCHS):
    for batch in train_loader:
        optimizer.zero_grad()
        inputs = {k: v.to(device) for k, v in batch.items() if k != "actual_fragment_deepsmiles"}
        outputs = model(**inputs)
        loss = outputs.loss 

        loss.backward()
        optimizer.step()
        scheduler.step()