--- pipeline_tag: fill-mask library_name: transformers base_model: roberta tags: - chemistry - Transformers - roberta - ChemBERTa - PyTorch - MLM datasets: - zinc license: mit language: - en metrics: - roc_auc --- Note: DeepBERTa is not affiliated with Microsoft's DeBERTa model. The name is inspired by BERT-based models and refers to a DeepSMILES-trained RoBERTa variant. # DeepBERTa_zinc_base_100k_v4 A pretrained RoBERTa-style transformer model for learning molecular representations from **DeepSMILES**-encoded chemical structures. This model was trained on a corpus of 100,000 canonicalized molecules sampled from **ZINC**, using a BPE tokenizer with a vocabulary size of 767. It was trained similarily to ChemBERTa but with DeepSMILES instead of SMILES. --- # Preprocessing All molecules were converted to DeepSMILES from SMILES before tokenization. DeepSMILES is a molecular notation for describing chemical structures that eliminates the need for explicit ring closure digits, making it more suitable for NLP approaches. ## Model Overview - **Architecture**: RoBERTa-base (6 Transformer layers, 12 attention heads) - **Objective**: Masked Language Modeling (MLM) - **Input Format**: DeepSMILES strings (tokenized using custom BPE) - **Training Framework**: PyTorch + Hugging Face Transformers - **Training Steps**: ~0.35 epochs on 100k molecules - **Final Evaluation Loss**: **1.57** - **Model Type**: `RobertaForMaskedLM` --- ## Training Setup | Parameter | Value | |---------------------------|--------------------------------------------| | Dataset | 100,000 DeepSMILES from ZINC | | Tokenizer Type | Byte-Pair Encoding (BPE) | | Vocabulary Size | 767 | | Masking Ratio | 15% | | Optimizer | AdamW | | Batch Size | 8 per device | | Training Time | ~0.35 epochs | ## Intended Usages **Unsupervised molecular representation learning** Use the pretrained model to encode molecules into contextual embeddings. These embeddings capture chemical structure and substructure information and can be used as input to other models. - **Fine-tuning on downstream property classification tasks** Examples include: - Toxicity prediction (e.g., Tox21, ToxCast) - Brain toxicity (e.g., BBBP, B3DB) - Binding affinity or bioactivity prediction (e.g., BACE, HIV) To use the model for these tasks, add a classification head and fine-tune on labeled datasets. - **Masked token prediction or augmentation** The MLM objective enables: - Predicting missing or corrupted substructures in DeepSMILES strings - Creating chemically valid perturbations for data augmentation ### This model is **not** intended for: - **De novo molecular generation** Masked Language Models are bidirectional and do not generate sequences autoregressively. For generation, use models like: - RNN/LSTM-based generators - Variational Autoencoders (Junction Tree VAE) - **End-to-end property prediction without fine-tuning** This model does not come with a classification head. To predict properties, it must be fine-tuned on a task-specific dataset. - **Direct use with raw SMILES (without converting to DeepSMILES)** This version is trained on **DeepSMILES**, which differs from canonical SMILES. Preprocessing is required to convert SMILES → DeepSMILES. --- Example Usage (fragment prediction) ```python from transformers import RobertaForMaskedLM, RobertaTokenizer from torch.utils.data import DataLoader from transformers import get_linear_schedule_with_warmup import torch.optim as optim # Load model and tokenizer from HuggingFace model_path = "aakothari/DeepBERTa_zinc_base_100k_v1" model = RobertaForMaskedLM.from_pretrained(model_path).to(device) tokenizer = RobertaTokenizer.from_pretrained(model_path) # Preprocess data train_ds = DeepSMILESDataset(train_df, tokenizer) train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate) # Optimizer and scheduler total_steps = len(train_loader) * NUM_EPOCHS optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=1e-5) scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps) # Training model.train() for epoch in range(NUM_EPOCHS): for batch in train_loader: optimizer.zero_grad() inputs = {k: v.to(device) for k, v in batch.items() if k != "actual_fragment_deepsmiles"} outputs = model(**inputs) loss = outputs.loss loss.backward() optimizer.step() scheduler.step()