Kara-Kumru-v1.0-2B 🐦⬛
A 2B parameter Turkish LLM that outperforms 70B models on Turkish benchmarks.
Kara-Kumru-v1.0-2B is a fine-tuned version of vngrs-ai/Kumru-2B, specifically optimized for Turkish language tasks including question answering, summarization, and translation. Despite having only 2 billion parameters, it achieves 37.56 average on the Cetvel Turkish LLM Benchmark, surpassing Llama-3.3-70B-Instruct (36.25) — a model 35x its size.
Leaderboard scores for other models are sourced from vngrs-ai/Kumru-2B. Kara-Kumru-v1.0-2B scores were evaluated using our own Cetvel pipeline.
Key Results
| Metric | Kara-Kumru-v1.0-2B | Llama-3.3-70B | Kumru-2B (baseline) | Delta vs baseline |
|---|---|---|---|---|
| Average | 37.56 | 36.25 | 31.98 | +5.58 |
| QA | 32.54 🥇 | 23.97 | 6.50 | +26.04 |
| SUM | 32.55 🥇 | 18.15 | 18.67 | +13.88 |
| MT | 10.58 | 19.99 | 7.10 | +3.48 |
| GEC | 64.96 | 30.10 | 66.34 | -1.38 |
| MCQA | 42.02 | 60.70 | 39.69 | +2.33 |
| NLI | 33.86 | 37.10 | 37.97 | -4.11 |
| TC | 46.39 | 63.73 | 47.57 | -1.18 |
🥇 Kara-Kumru-v1.0-2B achieves the highest QA and SUM scores across the entire Cetvel leaderboard, including models up to 72B parameters.
Detailed Task-Level Results
Click to expand full task breakdown
| Task | Metric | Baseline | Kara-Kumru-v1.0-2B | Delta |
|---|---|---|---|---|
tquad |
f1 | 39.38 | 50.66 | +11.27 |
xquad_tr |
f1 | 31.46 | 39.27 | +7.81 |
wmt-tr-en-prompt |
bleu | 6.17 | 10.58 | +4.42 |
xfact_tr |
acc_norm | 40.83 | 44.38 | +3.55 |
mkqa_tr |
f1 | 5.29 | 7.70 | +2.41 |
tr-wikihow-summ |
rouge1 | 25.18 | 26.84 | +1.67 |
wiki_lingua_tr |
rouge1 | 24.44 | 26.04 | +1.60 |
mlsum_tr |
rouge1 | 42.11 | 43.55 | +1.44 |
exams_tr |
acc_norm | 31.55 | 32.57 | +1.02 |
turkish_plu |
acc_norm | 47.78 | 48.13 | +0.35 |
ironytr |
acc_norm | 50.00 | 50.00 | 0.00 |
offenseval_tr |
acc_norm | 79.71 | 79.71 | 0.00 |
sts_tr |
acc_norm | 11.75 | 11.75 | 0.00 |
trclaim19 |
acc_norm | 60.10 | 60.10 | 0.00 |
xlsum_tr |
rouge1 | 34.49 | 33.78 | -0.71 |
nli_tr |
acc | 35.31 | 33.86 | -1.46 |
xcopa_tr |
acc | 63.20 | 61.60 | -1.60 |
gecturk_generation |
exact_match | 68.39 | 64.96 | -3.43 |
belebele_tr |
acc_norm | 29.22 | 25.78 | -3.44 |
news_cat |
acc_norm | 38.80 | 32.40 | -6.40 |
Cetvel Leaderboard Position
#1 Kumru-7B 41.58 (7B)
#2 Kara-Kumru-v1.0-2B 37.56 (2B) ← YOU ARE HERE
#3 Llama-3.3-70B-Instruct 36.25 (70B)
#4 Kumru-2B 31.98 (2B)
#5 gemma-3-27b-it 27.73 (27B)
#6 gemma-3-12b-it 27.60 (12B)
#7 Qwen2-72B-Instruct 26.07 (72B)
...
Highlights
- 35x smaller, higher score: 2B params beating Llama-3.3-70B-Instruct on Turkish
- Best-in-class QA: 32.54 — highest QA score across ALL models in the Cetvel leaderboard, including 72B models
- Best-in-class SUM: 32.55 — highest summarization score across the entire leaderboard
- TQuAD breakthrough: +11.27 F1 improvement on Turkish reading comprehension
- Edge-deployable: Runs on a single consumer GPU, Mac Mini, or mobile device
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "AlicanKiraz0/Kara-Kumru-v1.0-2B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "user", "content": "Türkiye'nin en büyük gölü hangisidir ve özellikleri nelerdir?"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(
output[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True
)
print(response)
Quantized Inference (GGUF)
# For llama.cpp / Ollama users (if GGUF version is available)
ollama run AlicanKiraz0/Kara-Kumru-v1.0-2B
Training Details
Base Model
- Model: vngrs-ai/Kumru-2B
- Parameters: ~2B
- Architecture: Transformer decoder-only
Fine-tuning Configuration
- Method: Full fine-tuning
- Precision: BF16
- Hardware: SnakeEye Cluster (DGX Spark + Mac Studio M3 Ultra)
What Improved & Why
The fine-tuning primarily strengthened generative capabilities (QA, summarization, translation) while showing minor regression on some discriminative tasks (classification, NLI). This is a well-known trade-off in LLM fine-tuning — the model learned to produce better free-form Turkish text at the cost of some multiple-choice and classification accuracy.
| Capability | Direction | Interpretation |
|---|---|---|
| Question Answering (QA) | ⬆️ +7.17 | Extractive QA dramatically improved |
| Translation (MT) | ⬆️ +4.42 | TR→EN translation quality increased |
| Summarization (SUM) | ⬆️ +1.00 | Abstractive summarization improved |
| Grammar Correction (GEC) | ⬇️ -3.43 | Exact-match GEC slightly regressed |
| Natural Language Inference (NLI) | ⬇️ -1.46 | Entailment classification dipped |
| Text Classification (TC) | ⬇️ -0.47 | Minor regression on classification |
Evaluation
All evaluations were performed using the Cetvel Turkish LLM Benchmark framework.
Cetvel Benchmark Categories
| Category | Description |
|---|---|
| GEC | Grammatical Error Correction (gecturk_generation) |
| MCQA | Multiple Choice QA (belebele_tr, exams_tr, turkish_plu, xcopa_tr) |
| MT | Machine Translation TR→EN (wmt-tr-en-prompt) |
| NLI | Natural Language Inference (nli_tr) |
| QA | Question Answering (xquad_tr, tquad, mkqa_tr) |
| SUM | Summarization (mlsum_tr, xlsum_tr, tr-wikihow-summ, wiki_lingua_tr) |
| TC | Text Classification (ironytr, news_cat, offenseval_tr, sts_tr, trclaim19, xfact_tr) |
Intended Use
- Turkish question answering and information extraction
- Turkish text summarization
- Turkish-to-English translation
- General Turkish language generation
- Research on efficient Turkish LLMs
Limitations
- Classification tasks: Some regression on text classification and NLI compared to baseline
- Grammar correction: GEC performance decreased by ~3.4 points
- Model size trade-offs: While competitive with much larger models on generative tasks, MCQA performance lags behind 7B+ models
- Evaluation caveat: Cross-pipeline benchmark comparison — see note above
Roadmap (Kara-Kumru-v2.0)
- Targeted GEC and NLI distillation to recover regression
- Classification-focused fine-tuning (news categorization, irony detection)
- MCQA and causal reasoning dataset expansion
- Unified evaluation pipeline for fair cross-model comparison
- GGUF quantization for edge deployment
Citation
@misc{kiraz2026karakumru,
title={Kara-Kumru-v1.0-2B: A Fine-tuned 2B Turkish LLM Outperforming 70B Models},
author={Kiraz, Alican},
year={2026},
url={https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}
Acknowledgments
- VNGRS AI for the Kumru base model and the Cetvel benchmark framework
- Built on the SnakeEye Cluster — a multi-node system with DGX Spark and Apple Silicon nodes
Contact
Alican Kiraz
Kara-Kumru (lit. "Dark Dove") — named after the darker variant of the Eurasian collared dove. Small but fierce.
- Downloads last month
- 2,165
Model tree for AlicanKiraz0/Kara-Kumru-v1.0-2B
Evaluation results
- Average Score on Cetvel Turkish LLM Benchmarkself-reported37.560
- QA Score on Cetvel Turkish LLM Benchmarkself-reported32.540
- SUM Score on Cetvel Turkish LLM Benchmarkself-reported32.550