Kara-Kumru-v1.0-2B 🐦‍⬛

A 2B parameter Turkish LLM that outperforms 70B models on Turkish benchmarks.

Kara-Kumru-v1.0-2B is a fine-tuned version of vngrs-ai/Kumru-2B, specifically optimized for Turkish language tasks including question answering, summarization, and translation. Despite having only 2 billion parameters, it achieves 37.56 average on the Cetvel Turkish LLM Benchmark, surpassing Llama-3.3-70B-Instruct (36.25) — a model 35x its size.

Cetvel Turkish LLM Benchmark Leaderboard

Leaderboard scores for other models are sourced from vngrs-ai/Kumru-2B. Kara-Kumru-v1.0-2B scores were evaluated using our own Cetvel pipeline.

Key Results

Metric Kara-Kumru-v1.0-2B Llama-3.3-70B Kumru-2B (baseline) Delta vs baseline
Average 37.56 36.25 31.98 +5.58
QA 32.54 🥇 23.97 6.50 +26.04
SUM 32.55 🥇 18.15 18.67 +13.88
MT 10.58 19.99 7.10 +3.48
GEC 64.96 30.10 66.34 -1.38
MCQA 42.02 60.70 39.69 +2.33
NLI 33.86 37.10 37.97 -4.11
TC 46.39 63.73 47.57 -1.18

🥇 Kara-Kumru-v1.0-2B achieves the highest QA and SUM scores across the entire Cetvel leaderboard, including models up to 72B parameters.

Detailed Task-Level Results

Click to expand full task breakdown
Task Metric Baseline Kara-Kumru-v1.0-2B Delta
tquad f1 39.38 50.66 +11.27
xquad_tr f1 31.46 39.27 +7.81
wmt-tr-en-prompt bleu 6.17 10.58 +4.42
xfact_tr acc_norm 40.83 44.38 +3.55
mkqa_tr f1 5.29 7.70 +2.41
tr-wikihow-summ rouge1 25.18 26.84 +1.67
wiki_lingua_tr rouge1 24.44 26.04 +1.60
mlsum_tr rouge1 42.11 43.55 +1.44
exams_tr acc_norm 31.55 32.57 +1.02
turkish_plu acc_norm 47.78 48.13 +0.35
ironytr acc_norm 50.00 50.00 0.00
offenseval_tr acc_norm 79.71 79.71 0.00
sts_tr acc_norm 11.75 11.75 0.00
trclaim19 acc_norm 60.10 60.10 0.00
xlsum_tr rouge1 34.49 33.78 -0.71
nli_tr acc 35.31 33.86 -1.46
xcopa_tr acc 63.20 61.60 -1.60
gecturk_generation exact_match 68.39 64.96 -3.43
belebele_tr acc_norm 29.22 25.78 -3.44
news_cat acc_norm 38.80 32.40 -6.40

Cetvel Leaderboard Position

#1  Kumru-7B                   41.58  (7B)
#2  Kara-Kumru-v1.0-2B            37.56  (2B) ← YOU ARE HERE
#3  Llama-3.3-70B-Instruct     36.25  (70B)
#4  Kumru-2B                   31.98  (2B)
#5  gemma-3-27b-it             27.73  (27B)
#6  gemma-3-12b-it             27.60  (12B)
#7  Qwen2-72B-Instruct         26.07  (72B)
    ...

Highlights

  • 35x smaller, higher score: 2B params beating Llama-3.3-70B-Instruct on Turkish
  • Best-in-class QA: 32.54 — highest QA score across ALL models in the Cetvel leaderboard, including 72B models
  • Best-in-class SUM: 32.55 — highest summarization score across the entire leaderboard
  • TQuAD breakthrough: +11.27 F1 improvement on Turkish reading comprehension
  • Edge-deployable: Runs on a single consumer GPU, Mac Mini, or mobile device

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AlicanKiraz0/Kara-Kumru-v1.0-2B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Türkiye'nin en büyük gölü hangisidir ve özellikleri nelerdir?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) for k, v in inputs.items()}

output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(
    output[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)

print(response)

Quantized Inference (GGUF)

# For llama.cpp / Ollama users (if GGUF version is available)
ollama run AlicanKiraz0/Kara-Kumru-v1.0-2B

Training Details

Base Model

Fine-tuning Configuration

  • Method: Full fine-tuning
  • Precision: BF16
  • Hardware: SnakeEye Cluster (DGX Spark + Mac Studio M3 Ultra)

What Improved & Why

The fine-tuning primarily strengthened generative capabilities (QA, summarization, translation) while showing minor regression on some discriminative tasks (classification, NLI). This is a well-known trade-off in LLM fine-tuning — the model learned to produce better free-form Turkish text at the cost of some multiple-choice and classification accuracy.

Capability Direction Interpretation
Question Answering (QA) ⬆️ +7.17 Extractive QA dramatically improved
Translation (MT) ⬆️ +4.42 TR→EN translation quality increased
Summarization (SUM) ⬆️ +1.00 Abstractive summarization improved
Grammar Correction (GEC) ⬇️ -3.43 Exact-match GEC slightly regressed
Natural Language Inference (NLI) ⬇️ -1.46 Entailment classification dipped
Text Classification (TC) ⬇️ -0.47 Minor regression on classification

Evaluation

All evaluations were performed using the Cetvel Turkish LLM Benchmark framework.

Cetvel Benchmark Categories

Category Description
GEC Grammatical Error Correction (gecturk_generation)
MCQA Multiple Choice QA (belebele_tr, exams_tr, turkish_plu, xcopa_tr)
MT Machine Translation TR→EN (wmt-tr-en-prompt)
NLI Natural Language Inference (nli_tr)
QA Question Answering (xquad_tr, tquad, mkqa_tr)
SUM Summarization (mlsum_tr, xlsum_tr, tr-wikihow-summ, wiki_lingua_tr)
TC Text Classification (ironytr, news_cat, offenseval_tr, sts_tr, trclaim19, xfact_tr)

Intended Use

  • Turkish question answering and information extraction
  • Turkish text summarization
  • Turkish-to-English translation
  • General Turkish language generation
  • Research on efficient Turkish LLMs

Limitations

  • Classification tasks: Some regression on text classification and NLI compared to baseline
  • Grammar correction: GEC performance decreased by ~3.4 points
  • Model size trade-offs: While competitive with much larger models on generative tasks, MCQA performance lags behind 7B+ models
  • Evaluation caveat: Cross-pipeline benchmark comparison — see note above

Roadmap (Kara-Kumru-v2.0)

  • Targeted GEC and NLI distillation to recover regression
  • Classification-focused fine-tuning (news categorization, irony detection)
  • MCQA and causal reasoning dataset expansion
  • Unified evaluation pipeline for fair cross-model comparison
  • GGUF quantization for edge deployment

Citation

@misc{kiraz2026karakumru,
  title={Kara-Kumru-v1.0-2B: A Fine-tuned 2B Turkish LLM Outperforming 70B Models},
  author={Kiraz, Alican},
  year={2026},
  url={https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}

Acknowledgments

  • VNGRS AI for the Kumru base model and the Cetvel benchmark framework
  • Built on the SnakeEye Cluster — a multi-node system with DGX Spark and Apple Silicon nodes

Contact

Alican Kiraz

LinkedIn X Medium HuggingFace GitHub


Kara-Kumru (lit. "Dark Dove") — named after the darker variant of the Eurasian collared dove. Small but fierce.

Downloads last month
2,165
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlicanKiraz0/Kara-Kumru-v1.0-2B

Finetuned
(7)
this model
Finetunes
6 models
Quantizations
2 models

Evaluation results