Update README.md

b828334 verified 6 months ago

4.9 kB

	---
	pipeline_tag: fill-mask
	library_name: transformers
	base_model: roberta
	tags:
	- chemistry
	- Transformers
	- roberta
	- ChemBERTa
	- PyTorch
	- MLM
	datasets:
	- zinc
	license: mit
	language:
	- en
	metrics:
	- roc_auc
	---

	Note: DeepBERTa is not affiliated with Microsoft's DeBERTa model. The name is inspired by BERT-based models and refers to a DeepSMILES-trained RoBERTa variant.


	# DeepBERTa_zinc_base_100k_v4

	A pretrained RoBERTa-style transformer model for learning molecular representations from DeepSMILES-encoded chemical structures.
	This model was trained on a corpus of 100,000 canonicalized molecules sampled from ZINC, using a BPE tokenizer with a vocabulary size of 767.
	It was trained similarily to ChemBERTa but with DeepSMILES instead of SMILES.

	---

	# Preprocessing
	All molecules were converted to DeepSMILES from SMILES before tokenization.
	DeepSMILES is a molecular notation for describing chemical structures that eliminates the need for explicit ring closure digits, making it more suitable for NLP approaches.

	## Model Overview

	- Architecture: RoBERTa-base (6 Transformer layers, 12 attention heads)
	- Objective: Masked Language Modeling (MLM)
	- Input Format: DeepSMILES strings (tokenized using custom BPE)
	- Training Framework: PyTorch + Hugging Face Transformers
	- Training Steps: ~0.35 epochs on 100k molecules
	- Final Evaluation Loss: 1.57
	- Model Type: `RobertaForMaskedLM`

	---

	## Training Setup

	\| Parameter \| Value \|
	\|---------------------------\|--------------------------------------------\|
	\| Dataset \| 100,000 DeepSMILES from ZINC \|
	\| Tokenizer Type \| Byte-Pair Encoding (BPE) \|
	\| Vocabulary Size \| 767 \|
	\| Masking Ratio \| 15% \|
	\| Optimizer \| AdamW \|
	\| Batch Size \| 8 per device \|
	\| Training Time \| ~0.35 epochs \|


	## Intended Usages

	Unsupervised molecular representation learning
	Use the pretrained model to encode molecules into contextual embeddings. These embeddings capture chemical structure and substructure information and can be used as input to other models.

	- Fine-tuning on downstream property classification tasks
	Examples include:
	- Toxicity prediction (e.g., Tox21, ToxCast)
	- Brain toxicity (e.g., BBBP, B3DB)
	- Binding affinity or bioactivity prediction (e.g., BACE, HIV)

	To use the model for these tasks, add a classification head and fine-tune on labeled datasets.

	- Masked token prediction or augmentation

	The MLM objective enables:
	- Predicting missing or corrupted substructures in DeepSMILES strings
	- Creating chemically valid perturbations for data augmentation


	### This model is not intended for:

	- De novo molecular generation
	Masked Language Models are bidirectional and do not generate sequences autoregressively. For generation, use models like:
	- RNN/LSTM-based generators
	- Variational Autoencoders (Junction Tree VAE)

	- End-to-end property prediction without fine-tuning
	This model does not come with a classification head. To predict properties, it must be fine-tuned on a task-specific dataset.

	- Direct use with raw SMILES (without converting to DeepSMILES)
	This version is trained on DeepSMILES, which differs from canonical SMILES. Preprocessing is required to convert SMILES → DeepSMILES.
	---

	Example Usage (fragment prediction)

	```python
	from transformers import RobertaForMaskedLM, RobertaTokenizer
	from torch.utils.data import DataLoader
	from transformers import get_linear_schedule_with_warmup
	import torch.optim as optim

	# Load model and tokenizer from HuggingFace
	model_path = "aakothari/DeepBERTa_zinc_base_100k_v1"
	model = RobertaForMaskedLM.from_pretrained(model_path).to(device)
	tokenizer = RobertaTokenizer.from_pretrained(model_path)

	# Preprocess data
	train_ds = DeepSMILESDataset(train_df, tokenizer)
	train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate)

	# Optimizer and scheduler
	total_steps = len(train_loader) * NUM_EPOCHS
	optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=1e-5)
	scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps)

	# Training
	model.train()
	for epoch in range(NUM_EPOCHS):
	for batch in train_loader:
	optimizer.zero_grad()
	inputs = {k: v.to(device) for k, v in batch.items() if k != "actual_fragment_deepsmiles"}
	outputs = model(**inputs)
	loss = outputs.loss

	loss.backward()
	optimizer.step()
	scheduler.step()