You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for Kangri-to-Hindi-Translator

Model Details

Model Description

This model is a fine-tuned version of ai4bharat/indictrans2-indic-indic-1B specifically adapted for Kangri to Hindi translation using Low-Rank Adaptation (LoRA). It was created to address the low-resource nature of the Kangri dialect by leveraging transfer learning from linguistically similar languages like Dogri.

Developed by: Lovnish Verma
Model type: Seq2Seq (Encoder-Decoder) with LoRA adapters
Language(s): Kangri (Source), Hindi (Target)
License: MIT
Finetuned from model: ai4bharat/indictrans2-indic-indic-1B

Model Sources

Demo Notebook: Open in Google Colab

Uses

Direct Use

This model is designed to translate text from Kangri, a Western Pahari language spoken in Himachal Pradesh, into standard Hindi. It is particularly useful for preserving local dialects and enabling communication between dialect speakers and broader Hindi speakers.

Out-of-Scope Use

This model is trained on a synthetic dataset generated via Context-Free Grammar (CFG) focusing on specific domains (home, market, basic actions). It may struggle with:

Complex, idiomatic, or highly technical Kangri text.
Dialects significantly different from the training distribution.

How to Get Started with the Model

import torch
from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# 1. Load the Base Model
base_model_id = "ai4bharat/indictrans2-indic-indic-1B"
base_model = AutoModelForSeq2SeqLM.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)

# 2. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    trust_remote_code=True
)

# 3. Load LoRA Adapters
adapter_path = "LovnishVerma/kangri-hindi-translator"
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

# 4. Inference Function
def translate_kangri(text):
    # Note: We use 'doi_Deva' (Dogri) as a proxy tag for Kangri
    input_text = f"doi_Deva hin_Deva {text}"
    inputs = tokenizer(input_text, return_tensors="pt", padding=True).to(model.device)

    with torch.no_grad():
        # Critical: use_cache=False is required for this specific architecture
        outputs = model.generate(**inputs, max_length=128, use_cache=False)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test
print(translate_kangri("कुड़ी धाम खांगी"))
# Expected Output: लड़की धाम खाएगी

##Training Details

###Training DataThe model was trained on a synthetic dataset (kangri_hindi.csv) generated using a Context-Free Grammar (CFG) approach.

Size: ~5,000 sentence pairs.
Content: Sentences covering common subjects (I, you, he, she), verbs (eat, drink, go, read), and tenses (past, present, future).
Strategy: Kangri sentences were tagged with doi_Deva (Dogri) as a proxy due to linguistic similarity, as Kangri is not natively supported by the base model.

###Training Procedure####Preprocessing* Tokenizer: indictrans2 tokenizer.

Source Tag: doi_Deva (Dogri script).
Target Tag: hin_Deva (Hindi script).
Max Length: 128 tokens.

####Training Hyperparameters* Learning Rate: 1e-3

Batch Size: 8 (per device) with Gradient Accumulation of 2.
Epochs: 15
Optimizer: Adafactor
LoRA Rank (r): 32
LoRA Alpha: 64
LoRA Dropout: 0.1
Target Modules: q_proj, v_proj, k_proj, out_proj, fc1, fc2

##Evaluation###MetricsThe model was evaluated using BLEU score (SacreBLEU) on a held-out validation set (10% of data).

Final BLEU Score: ~93.69
Validation Loss: ~10.13

##Environmental Impact* Hardware Type: Google Colab T4 GPU (Free Tier)

Hours used: ~0.7 hours (41 minutes)
Compute: Mixed Precision (FP16/BF16)

Downloads last month: 20

Model tree for LovnishVerma/kangri-hindi-translator

Base model

ai4bharat/indictrans2-indic-indic-1B

Adapter

(1)

this model