---
language:
- tr
license: mit
tags:
- roberta
- masked-language-modeling
- turkish
- encoder
- fairseq
- huggingface
pipeline_tag: fill-mask
---

# SindBERT: Charting the Seas of Turkish NLP
**SindBERT** is a family of RoBERTa-based Turkish language models pre-trained from scratch on ~312 GB of Turkish text from mC4, OSCAR23, and Wikipedia. The models aim to provide strong downstream performance for Turkish NLP and an openly available large-scale encoder for the community.

We release two variants:

- `SindBERT-base`: 126M parameters (fp32)
- `SindBERT-large`: 357M parameters (fp32)


## Model Details

| Detail             | SindBERT-base                             | SindBERT-large            |
| ------------------ | ----------------------------------------- | ------------------------- |
| Architecture       | RoBERTa-base                              | RoBERTa-large             |
| Parameters         | ~126M                                     | ~357M                     |
| Tokenizer          | GPT-2 style byte-level BPE (52,009 vocab) | Same                      |
| Pretraining corpus | Turkish mC4, OSCAR23, Wikipedia (~312 GB) | Same                      |
| Objective          | Masked Language Modeling                  | Same                      |
| Training time      | ~29.2 hours (TPUv4-128 pod)               | ~6.0 days (TPUv4-128 pod) |
| Precision          | fp32                                      | fp32                      |
| Framework          | fairseq                                   | fairseq                   |

## Downstream Evaluation

We evaluate SindBERT on four Turkish benchmarks:

- PoS tagging (Turkish UD concat): micro-F1
- NER (WikiANN TR): micro-F1
- Offensive language detection (OffensEval-TR 2020): macro-F1
- Linguistic acceptability (TurBLiMP): average accuracy (16 phenomena)


## 🧪 Evaluation Results

**Legend**: **Bold = best**, *italic = second-best* per model size.

| Model           | PoS            | NER            | OffensEval-TR 2020            |  AVG core | TurBLiMP AVG |   AVG all |
| --------------- | -------------: | -------------: | ----------------------------: | --------: | -----------: | --------: |
| **Large models**|                |                |                               |           |              |           |
| SindBERT_large  |      **94.63** |        *93.64* |                     **82.29** |     90.19 |         89.8 |     90.09 |
| XLM-R_large     |        *94.39* |      **94.44** |                       *81.99* | **90.27** |     **92.7** | **90.73** |
| EuroBERT_610M   |          93.33 |          91.85 |                         75.57 |     86.92 |       *90.0* |     87.84 |
| **Base models** |                |                |                               |           |              |           |
| ELECTRA_small   |          94.28 |          91.92 |                         78.17 |     88.12 |         80.6 |     86.24 |
| DistilBERTurk   |          94.01 |          91.54 |                         79.19 |     88.25 |         87.2 |     87.99 |
| ConvBERTurk     |          94.41 |        *94.03* |                     **81.99** | **90.14** |         60.8 |     82.81 |
| ConvBERTurk_mC4 |      **94.57** |          93.56 |                       *81.90* |   *90.01* |         55.5 |     81.38 |
| ELECTRA_base    |          94.29 |          93.49 |                         81.54 |     89.77 |         89.9 |     89.81 |
| ELECTRA_mC4     |          94.40 |          93.43 |                         81.38 |     89.74 |         89.9 |     89.78 |
| BERTurk_32k     |          93.16 |      **94.38** |                         81.03 |     89.52 |       *93.8* |   *90.59* |
| RoBERTurk       |          87.99 |          81.09 |                         70.01 |     79.70 |            - |         - |
| SindBERT_base   |        *94.47* |          93.19 |                         81.14 |     89.60 |         90.3 |     89.78 |
| mmBERT_small    |          93.75 |          92.51 |                         77.28 |     87.85 |         85.1 |     87.16 |
| BERTurk_128k    |          94.44 |          93.81 |                         81.77 |   *90.01* |     **95.1** | **91.28** |
| EuroBERT_210M   |          92.97 |          90.91 |                         75.73 |     86.54 |         86.3 |     86.48 |
| XLM-R_base      |          94.23 |          92.90 |                         79.77 |     88.97 |         89.2 |     89.03 |
| mmBERT_base     |          93.75 |          93.35 |                         78.49 |     88.53 |         89.3 |     88.72 |


## Fairseq Checkpoint
Get the fairseq checkpoint [here](https://drive.proton.me/urls/RK0X3H3V8W#hyHSRSVRJzpN).

## Citations
If you use SindBERT in your research, please cite the following paper:

```bibtex
@misc{scheibleschmitt2025sindbertsailorchartingseas,
      title={SindBERT, the Sailor: Charting the Seas of Turkish NLP}, 
      author={Raphael Scheible-Schmitt and Stefan Schweter},
      year={2025},
      eprint={2510.21364},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.21364}, 
}
```

## 📜 License

MIT License