BERTs that chat: turn any BERT into a chatbot with dLLM
TLDR: With a small amount of open-source instruction-following data, a standard BERT can gain conversational ability with diffusion. Specifically, a finetuned ModernBERT-large, with a similar number of parameters, performs close to Qwen1.5-0.5B in multiple benchmarks. In addition, we open-source dllm, a unified framework for diffusion language models that transparently supports training, inference, and evaluation — and fully powers the reproduction of ModernBERT-Chat.
Why Diffusion LMs Need Better Tools
Despite growing interest in diffusion language models (DLMs), progress remains limited by two practical obstacles:
- Lack of an accessible, unified framework—most existing implementations are scattered, inconsistent, and hard for newcomers to run or extend; and
- High compute cost of training or reproducing many DLM pipelines, which raises the barrier for experimentation.
To address these issues, we provide two complementary solutions:
- dLLM, an open-source, all-in-one framework that standardizes training, inference, and evaluation for DLMs; and
- ModernBERT-Chat, a minimal yet fully functional “Hello World” example demonstrating that a practical DLM can be trained with modest resources.
dLLM: the tool behind and beyond BERT Chat
dLLM is the foundation behind the BERT-Chat experiments and fully supports end-to-end reproduction.
It serves as a general, open framework for building, training, and evaluating DLMs.
- Compatible with mainstream discrete DLMs (LLaDA, Dream, RND, etc.), enabling easy reproduction of experiments
- Provides open implementations of algorithms that previously lacked public code (e.g., Edit Flows).
- Designed for extensibility, serving as a practical foundation for future DLM research.
A "Hello World" example for training DLM
ModernBERT-Chat turns a standard BERT into a chat-capable model using only supervised finetuning under the diffusion framework. The entire pipeline is fully constructed end-to-end, and training can be completed on a single GPU.
- A minimal experiment using SFT only to give BERT generative capability, with no generative pretraining required.
- ModernBERT-large-chat-v0 (0.4B) performs close to Qwen1.5-0.5B across several benchmarks.
- Shows that directly SFTed on instruction–response pairs only is enough to generalize BERT to new prompts, without generative pretraining.
- Fully open-source: report, model checkpoints and ready-to-run scripts.
Why ModernBERT?
BERT is pretrained with Masked Language Modeling (MLM), where only a small fraction of tokens(typically 15–30%) are masked and predicted. This objective teaches BERT to fill in blanks but does not expose it to the full spectrum of masking patterns required for genuine text generation. In particular, an encoder trained only on low mask ratios never learns how to generate sequences from scratch or iteratively denoise a fully masked input.
To adapt BERT into a diffusion language model, we therefore need to train it across the entire range of mask rates (0–100%), enabling the model to refine heavily corrupted inputs and convert mask tokens into text step by step during inference.
As a first step, we conducted continual generative pretraining with MDLM on the Wikitext-103-v1 corpus. According to panels below, ModernBERT achieved the lowest training loss among the candidate encoder backbones, indicating that its architectural improvements and extended context window make it a strong foundation for diffusion-based generation.
Do We Really Need Pretraining?
After establishing ModernBERT as a strong backbone, we extended the generative pretraining stage to a larger corpus (OpenWebText). However, the MDLM loss showed little improvement, suggesting that ModernBERT’s original MLM pretraining already equips the model with substantial linguistic and world knowledge. Continual MDLM pretraining on similar text distributions therefore yields diminishing returns.
This observation raised a natural question: Is generative pretraining even necessary for enabling diffusion-based generation in BERT? To test this, we directly applied SFT using a small instruction-following dataset (Alpaca) on three ModernBERT-large checkpoints: (1) the original untuned ModernBERT-large, (2) the version continually pretrained on Wikitext-103-v1, and (3) the version continually pretrained on OpenWebText.
Although models (2) and (3) started with slightly lower SFT loss, all three converged to nearly identical training and evaluation performance. This indicates that ModernBERT’s MLM pretraining already captures enough knowledge for diffusion SFT to activate generative capability, and that additional MDLM pretraining provides little practical benefit.
Training Recipe for ModernBERT-Chat
To strengthen the diffusion SFT stage, we scaled up the instruction-tuning dataset by combining tulu-3-sft-mixture with smoltalk. Using this enlarged corpus, we trained both ModernBERT-base and ModernBERT-large under the unified diffusion SFT pipeline. This produced the final checkpoints introduced earlier: ModernBERT-base-chat-v0 and ModernBERT-large-chat-v0.
Evaluation Results
We compare ModernBERT-base-chat-v0 (0.1B) and ModernBERT-large-chat-v0 (0.4B) against Qwen1.5-0.5B and Qwen1.5-0.5B-Chat
| Model | LAMBADA | GSM8K | CEVAL-valid | BBH | Minerva-Math | MMLU | Winogrande | HellaSwag | CMMLU |
|---|---|---|---|---|---|---|---|---|---|
| ModernBERT-base-chat-v0 (evaluated) | 49.3 | 5.9 | 25.0 | 17.9 | 3.1 | 26.1 | 49.7 | 41.0 | 24.3 |
| ModernBERT-large-chat-v0 (evaluated) | 46.3 | 17.1 | 24.6 | 25.1 | 3.8 | 33.5 | 53.1 | 45.0 | 27.5 |
| Qwen1.5-0.5B (reported) | 48.6 | 22.0 | 50.5 | 18.3 | 3.1 | 39.2 | 55.0 | 48.2 | 46.6 |
| Qwen1.5-0.5B-chat (reported) | / | 11.3 | 37.2 | / | 3.1 | 35.0 | / | / | / |
Results (evaluated) are evaluated using our framework, while results (reported) come from the original paper. Qwen1.5-0.5B results are from the Qwen1.5 official blog and Qwen1.5-0.5B-chat results are from the Qwen2-0.5B-Instruct model card.
What's next
We will continue expanding the repository with new features and research directions, including:
- Transferring autoregressive models into diffusion-based ones,
- Broader support for additional diffusion LM variants and algorithms.
We welcome contributions of any form—new model backbones, training recipes, evaluation tools, or documentation improvements. Our goal is to make dLLM an accessible, reliable platform for the entire research community, and we’re excited to build it together.


