--- license: cc-by-4.0 datasets: - DSL-13-SRMAP/Telugu-Dataset language: - te tags: - sentiment-analysis - text-classification - telugu - multilingual - xlm-roberta - baseline base_model: xlm-roberta-base pipeline_tag: text-classification metrics: - accuracy - f1 - auroc --- # XLM-R_WOR ## Model Description **XLM-R_WOR** is a Telugu sentiment classification model built on **XLM-RoBERTa (XLM-R)**, a large-scale multilingual Transformer model developed by Facebook AI. XLM-R is designed to enhance cross-lingual understanding by leveraging a substantially larger and more diverse pretraining corpus than mBERT. The base model is pretrained on approximately **2.5 TB of filtered Common Crawl data** covering **100+ languages**, including Telugu. Unlike mBERT, XLM-R is trained **exclusively with the Masked Language Modeling (MLM) objective**, without using the Next Sentence Prediction (NSP) task. This design choice enables stronger contextual representations and improved transfer learning. The suffix **WOR** denotes **Without Rationale supervision**. This model is fine-tuned using only sentiment labels, without incorporating human-annotated rationales, and serves as a **label-only baseline**. --- ## Pretraining Details - **Pretraining corpus:** Filtered Common Crawl (≈2.5 TB, 100+ languages) - **Training objective:** - Masked Language Modeling (MLM) - **Next Sentence Prediction:** Not used - **Language coverage:** Telugu included, but not exclusively targeted --- ## Training Data - **Fine-tuning dataset:** Telugu-Dataset - **Task:** Sentiment classification - **Supervision type:** Label-only (no rationale supervision) --- ## Intended Use This model is intended for: - Telugu sentiment classification - Cross-lingual and multilingual NLP benchmarking - Baseline comparisons for explainability and rationale-supervision studies - Low-resource Telugu NLP research Due to its large-scale multilingual pretraining, XLM-R_WOR is particularly effective for transfer learning scenarios where Telugu-specific labeled data is limited. --- ## Performance Characteristics XLM-R generally provides stronger contextual modeling and improved downstream performance compared to mBERT, owing to its larger and more diverse pretraining corpus and exclusive focus on the MLM objective. ### Strengths - Strong cross-lingual transfer learning - Improved contextual representations over mBERT - Reliable baseline for multilingual sentiment analysis ### Limitations - Not explicitly optimized for Telugu morphology or syntax - May underperform compared to Telugu-specialized models such as MuRIL or L3Cube-Telugu-BERT - Limited ability to capture fine-grained cultural and regional linguistic nuances --- ## Use as a Baseline **XLM-R_WOR** serves as a robust and widely accepted baseline for: - Comparing multilingual models against Telugu-specialized architectures - Evaluating the impact of rationale supervision (WOR vs. WR) - Benchmarking sentiment classification performance in low-resource Telugu settings --- ## References - Conneau et al., 2019 - Hedderich et al., 2021 - Kulkarni et al., 2021 - Joshi, 2022 - Das et al., 2022 - Rajalakshmi et al., 2023