--- language: - tr license: mit tags: - roberta - masked-language-modeling - turkish - encoder - fairseq - huggingface pipeline_tag: fill-mask --- # SindBERT: Charting the Seas of Turkish NLP **SindBERT** is a family of RoBERTa-based Turkish language models pre-trained from scratch on ~312 GB of Turkish text from mC4, OSCAR23, and Wikipedia. The models aim to provide strong downstream performance for Turkish NLP and an openly available large-scale encoder for the community. We release two variants: - `SindBERT-base`: 126M parameters (fp32) - `SindBERT-large`: 357M parameters (fp32) ## Model Details | Detail | SindBERT-base | SindBERT-large | | ------------------ | ----------------------------------------- | ------------------------- | | Architecture | RoBERTa-base | RoBERTa-large | | Parameters | ~126M | ~357M | | Tokenizer | GPT-2 style byte-level BPE (52,009 vocab) | Same | | Pretraining corpus | Turkish mC4, OSCAR23, Wikipedia (~312 GB) | Same | | Objective | Masked Language Modeling | Same | | Training time | ~29.2 hours (TPUv4-128 pod) | ~6.0 days (TPUv4-128 pod) | | Precision | fp32 | fp32 | | Framework | fairseq | fairseq | ## Downstream Evaluation We evaluate SindBERT on four Turkish benchmarks: - PoS tagging (Turkish UD concat): micro-F1 - NER (WikiANN TR): micro-F1 - Offensive language detection (OffensEval-TR 2020): macro-F1 - Linguistic acceptability (TurBLiMP): average accuracy (16 phenomena) ## ๐Ÿงช Evaluation Results **Legend**: **Bold = best**, *italic = second-best* per model size. | Model | PoS | NER | OffensEval-TR 2020 | AVG core | TurBLiMP AVG | AVG all | | --------------- | -------------: | -------------: | ----------------------------: | --------: | -----------: | --------: | | **Large models**| | | | | | | | SindBERT_large | **94.63** | *93.64* | **82.29** | 90.19 | 89.8 | 90.09 | | XLM-R_large | *94.39* | **94.44** | *81.99* | **90.27** | **92.7** | **90.73** | | EuroBERT_610M | 93.33 | 91.85 | 75.57 | 86.92 | *90.0* | 87.84 | | **Base models** | | | | | | | | ELECTRA_small | 94.28 | 91.92 | 78.17 | 88.12 | 80.6 | 86.24 | | DistilBERTurk | 94.01 | 91.54 | 79.19 | 88.25 | 87.2 | 87.99 | | ConvBERTurk | 94.41 | *94.03* | **81.99** | **90.14** | 60.8 | 82.81 | | ConvBERTurk_mC4 | **94.57** | 93.56 | *81.90* | *90.01* | 55.5 | 81.38 | | ELECTRA_base | 94.29 | 93.49 | 81.54 | 89.77 | 89.9 | 89.81 | | ELECTRA_mC4 | 94.40 | 93.43 | 81.38 | 89.74 | 89.9 | 89.78 | | BERTurk_32k | 93.16 | **94.38** | 81.03 | 89.52 | *93.8* | *90.59* | | RoBERTurk | 87.99 | 81.09 | 70.01 | 79.70 | - | - | | SindBERT_base | *94.47* | 93.19 | 81.14 | 89.60 | 90.3 | 89.78 | | mmBERT_small | 93.75 | 92.51 | 77.28 | 87.85 | 85.1 | 87.16 | | BERTurk_128k | 94.44 | 93.81 | 81.77 | *90.01* | **95.1** | **91.28** | | EuroBERT_210M | 92.97 | 90.91 | 75.73 | 86.54 | 86.3 | 86.48 | | XLM-R_base | 94.23 | 92.90 | 79.77 | 88.97 | 89.2 | 89.03 | | mmBERT_base | 93.75 | 93.35 | 78.49 | 88.53 | 89.3 | 88.72 | ## Fairseq Checkpoint Get the fairseq checkpoint [here](https://drive.proton.me/urls/RK0X3H3V8W#hyHSRSVRJzpN). ## Citations If you use SindBERT in your research, please cite the following paper: ```bibtex @misc{scheibleschmitt2025sindbertsailorchartingseas, title={SindBERT, the Sailor: Charting the Seas of Turkish NLP}, author={Raphael Scheible-Schmitt and Stefan Schweter}, year={2025}, eprint={2510.21364}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.21364}, } ``` ## ๐Ÿ“œ License MIT License