Instructions to use aakothari/DeepBERTa_zinc_base_100k_v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use aakothari/DeepBERTa_zinc_base_100k_v4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="aakothari/DeepBERTa_zinc_base_100k_v4")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("aakothari/DeepBERTa_zinc_base_100k_v4") model = AutoModelForMaskedLM.from_pretrained("aakothari/DeepBERTa_zinc_base_100k_v4") - Notebooks
- Google Colab
- Kaggle
| pipeline_tag: fill-mask | |
| library_name: transformers | |
| base_model: roberta | |
| tags: | |
| - chemistry | |
| - Transformers | |
| - roberta | |
| - ChemBERTa | |
| - PyTorch | |
| - MLM | |
| datasets: | |
| - zinc | |
| license: mit | |
| language: | |
| - en | |
| metrics: | |
| - roc_auc | |
| Note: DeepBERTa is not affiliated with Microsoft's DeBERTa model. The name is inspired by BERT-based models and refers to a DeepSMILES-trained RoBERTa variant. | |
| # DeepBERTa_zinc_base_100k_v4 | |
| A pretrained RoBERTa-style transformer model for learning molecular representations from **DeepSMILES**-encoded chemical structures. | |
| This model was trained on a corpus of 100,000 canonicalized molecules sampled from **ZINC**, using a BPE tokenizer with a vocabulary size of 767. | |
| It was trained similarily to ChemBERTa but with DeepSMILES instead of SMILES. | |
| --- | |
| # Preprocessing | |
| All molecules were converted to DeepSMILES from SMILES before tokenization. | |
| DeepSMILES is a molecular notation for describing chemical structures that eliminates the need for explicit ring closure digits, making it more suitable for NLP approaches. | |
| ## Model Overview | |
| - **Architecture**: RoBERTa-base (6 Transformer layers, 12 attention heads) | |
| - **Objective**: Masked Language Modeling (MLM) | |
| - **Input Format**: DeepSMILES strings (tokenized using custom BPE) | |
| - **Training Framework**: PyTorch + Hugging Face Transformers | |
| - **Training Steps**: ~0.35 epochs on 100k molecules | |
| - **Final Evaluation Loss**: **1.57** | |
| - **Model Type**: `RobertaForMaskedLM` | |
| --- | |
| ## Training Setup | |
| | Parameter | Value | | |
| |---------------------------|--------------------------------------------| | |
| | Dataset | 100,000 DeepSMILES from ZINC | | |
| | Tokenizer Type | Byte-Pair Encoding (BPE) | | |
| | Vocabulary Size | 767 | | |
| | Masking Ratio | 15% | | |
| | Optimizer | AdamW | | |
| | Batch Size | 8 per device | | |
| | Training Time | ~0.35 epochs | | |
| ## Intended Usages | |
| **Unsupervised molecular representation learning** | |
| Use the pretrained model to encode molecules into contextual embeddings. These embeddings capture chemical structure and substructure information and can be used as input to other models. | |
| - **Fine-tuning on downstream property classification tasks** | |
| Examples include: | |
| - Toxicity prediction (e.g., Tox21, ToxCast) | |
| - Brain toxicity (e.g., BBBP, B3DB) | |
| - Binding affinity or bioactivity prediction (e.g., BACE, HIV) | |
| To use the model for these tasks, add a classification head and fine-tune on labeled datasets. | |
| - **Masked token prediction or augmentation** | |
| The MLM objective enables: | |
| - Predicting missing or corrupted substructures in DeepSMILES strings | |
| - Creating chemically valid perturbations for data augmentation | |
| ### This model is **not** intended for: | |
| - **De novo molecular generation** | |
| Masked Language Models are bidirectional and do not generate sequences autoregressively. For generation, use models like: | |
| - RNN/LSTM-based generators | |
| - Variational Autoencoders (Junction Tree VAE) | |
| - **End-to-end property prediction without fine-tuning** | |
| This model does not come with a classification head. To predict properties, it must be fine-tuned on a task-specific dataset. | |
| - **Direct use with raw SMILES (without converting to DeepSMILES)** | |
| This version is trained on **DeepSMILES**, which differs from canonical SMILES. Preprocessing is required to convert SMILES → DeepSMILES. | |
| --- | |
| Example Usage (fragment prediction) | |
| ```python | |
| from transformers import RobertaForMaskedLM, RobertaTokenizer | |
| from torch.utils.data import DataLoader | |
| from transformers import get_linear_schedule_with_warmup | |
| import torch.optim as optim | |
| # Load model and tokenizer from HuggingFace | |
| model_path = "aakothari/DeepBERTa_zinc_base_100k_v1" | |
| model = RobertaForMaskedLM.from_pretrained(model_path).to(device) | |
| tokenizer = RobertaTokenizer.from_pretrained(model_path) | |
| # Preprocess data | |
| train_ds = DeepSMILESDataset(train_df, tokenizer) | |
| train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate) | |
| # Optimizer and scheduler | |
| total_steps = len(train_loader) * NUM_EPOCHS | |
| optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=1e-5) | |
| scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps) | |
| # Training | |
| model.train() | |
| for epoch in range(NUM_EPOCHS): | |
| for batch in train_loader: | |
| optimizer.zero_grad() | |
| inputs = {k: v.to(device) for k, v in batch.items() if k != "actual_fragment_deepsmiles"} | |
| outputs = model(**inputs) | |
| loss = outputs.loss | |
| loss.backward() | |
| optimizer.step() | |
| scheduler.step() |