Model Card for Kangri-to-Hindi-Translator
Model Details
Model Description
This model is a fine-tuned version of ai4bharat/indictrans2-indic-indic-1B specifically adapted for Kangri to Hindi translation using Low-Rank Adaptation (LoRA). It was created to address the low-resource nature of the Kangri dialect by leveraging transfer learning from linguistically similar languages like Dogri.
- Developed by: Lovnish Verma
- Model type: Seq2Seq (Encoder-Decoder) with LoRA adapters
- Language(s): Kangri (Source), Hindi (Target)
- License: MIT
- Finetuned from model:
ai4bharat/indictrans2-indic-indic-1B
Model Sources
- Demo Notebook: Open in Google Colab
Uses
Direct Use
This model is designed to translate text from Kangri, a Western Pahari language spoken in Himachal Pradesh, into standard Hindi. It is particularly useful for preserving local dialects and enabling communication between dialect speakers and broader Hindi speakers.
Out-of-Scope Use
This model is trained on a synthetic dataset generated via Context-Free Grammar (CFG) focusing on specific domains (home, market, basic actions). It may struggle with:
- Complex, idiomatic, or highly technical Kangri text.
- Dialects significantly different from the training distribution.
How to Get Started with the Model
import torch
from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# 1. Load the Base Model
base_model_id = "ai4bharat/indictrans2-indic-indic-1B"
base_model = AutoModelForSeq2SeqLM.from_pretrained(
base_model_id,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)
# 2. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
trust_remote_code=True
)
# 3. Load LoRA Adapters
adapter_path = "LovnishVerma/kangri-hindi-translator"
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()
# 4. Inference Function
def translate_kangri(text):
# Note: We use 'doi_Deva' (Dogri) as a proxy tag for Kangri
input_text = f"doi_Deva hin_Deva {text}"
inputs = tokenizer(input_text, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
# Critical: use_cache=False is required for this specific architecture
outputs = model.generate(**inputs, max_length=128, use_cache=False)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Test
print(translate_kangri("कुड़ी धाम खांगी"))
# Expected Output: लड़की धाम खाएगी
##Training Details
###Training DataThe model was trained on a synthetic dataset (kangri_hindi.csv) generated using a Context-Free Grammar (CFG) approach.
- Size: ~5,000 sentence pairs.
- Content: Sentences covering common subjects (I, you, he, she), verbs (eat, drink, go, read), and tenses (past, present, future).
- Strategy: Kangri sentences were tagged with
doi_Deva(Dogri) as a proxy due to linguistic similarity, as Kangri is not natively supported by the base model.
###Training Procedure####Preprocessing* Tokenizer: indictrans2 tokenizer.
- Source Tag:
doi_Deva(Dogri script). - Target Tag:
hin_Deva(Hindi script). - Max Length: 128 tokens.
####Training Hyperparameters* Learning Rate: 1e-3
- Batch Size: 8 (per device) with Gradient Accumulation of 2.
- Epochs: 15
- Optimizer: Adafactor
- LoRA Rank (r): 32
- LoRA Alpha: 64
- LoRA Dropout: 0.1
- Target Modules:
q_proj,v_proj,k_proj,out_proj,fc1,fc2
##Evaluation###MetricsThe model was evaluated using BLEU score (SacreBLEU) on a held-out validation set (10% of data).
- Final BLEU Score: ~93.69
- Validation Loss: ~10.13
##Environmental Impact* Hardware Type: Google Colab T4 GPU (Free Tier)
- Hours used: ~0.7 hours (41 minutes)
- Compute: Mixed Precision (FP16/BF16)
- Downloads last month
- 20
Model tree for LovnishVerma/kangri-hindi-translator
Base model
ai4bharat/indictrans2-indic-indic-1B