Kara-Kumru-v1.0-2B 🐦‍⬛

A 2B parameter Turkish LLM that outperforms 70B models on Turkish benchmarks.

Kara-Kumru-v1.0-2B is a fine-tuned version of vngrs-ai/Kumru-2B, specifically optimized for Turkish language tasks including question answering, summarization, and translation. Despite having only 2 billion parameters, it achieves 37.56 average on the Cetvel Turkish LLM Benchmark, surpassing Llama-3.3-70B-Instruct (36.25) — a model 35x its size.

Cetvel Turkish LLM Benchmark Leaderboard

Leaderboard scores for other models are sourced from vngrs-ai/Kumru-2B. Kara-Kumru-v1.0-2B scores were evaluated using our own Cetvel pipeline.

Key Results

Metric	Kara-Kumru-v1.0-2B	Llama-3.3-70B	Kumru-2B (baseline)	Delta vs baseline
Average	37.56	36.25	31.98	+5.58
QA	32.54 🥇	23.97	6.50	+26.04
SUM	32.55 🥇	18.15	18.67	+13.88
MT	10.58	19.99	7.10	+3.48
GEC	64.96	30.10	66.34	-1.38
MCQA	42.02	60.70	39.69	+2.33
NLI	33.86	37.10	37.97	-4.11
TC	46.39	63.73	47.57	-1.18

🥇 Kara-Kumru-v1.0-2B achieves the highest QA and SUM scores across the entire Cetvel leaderboard, including models up to 72B parameters.

Detailed Task-Level Results

Click to expand full task breakdown

Task	Metric	Baseline	Kara-Kumru-v1.0-2B	Delta
`tquad`	f1	39.38	50.66	+11.27
`xquad_tr`	f1	31.46	39.27	+7.81
`wmt-tr-en-prompt`	bleu	6.17	10.58	+4.42
`xfact_tr`	acc_norm	40.83	44.38	+3.55
`mkqa_tr`	f1	5.29	7.70	+2.41
`tr-wikihow-summ`	rouge1	25.18	26.84	+1.67
`wiki_lingua_tr`	rouge1	24.44	26.04	+1.60
`mlsum_tr`	rouge1	42.11	43.55	+1.44
`exams_tr`	acc_norm	31.55	32.57	+1.02
`turkish_plu`	acc_norm	47.78	48.13	+0.35
`ironytr`	acc_norm	50.00	50.00	0.00
`offenseval_tr`	acc_norm	79.71	79.71	0.00
`sts_tr`	acc_norm	11.75	11.75	0.00
`trclaim19`	acc_norm	60.10	60.10	0.00
`xlsum_tr`	rouge1	34.49	33.78	-0.71
`nli_tr`	acc	35.31	33.86	-1.46
`xcopa_tr`	acc	63.20	61.60	-1.60
`gecturk_generation`	exact_match	68.39	64.96	-3.43
`belebele_tr`	acc_norm	29.22	25.78	-3.44
`news_cat`	acc_norm	38.80	32.40	-6.40

Cetvel Leaderboard Position

#1  Kumru-7B                   41.58  (7B)
#2  Kara-Kumru-v1.0-2B            37.56  (2B) ← YOU ARE HERE
#3  Llama-3.3-70B-Instruct     36.25  (70B)
#4  Kumru-2B                   31.98  (2B)
#5  gemma-3-27b-it             27.73  (27B)
#6  gemma-3-12b-it             27.60  (12B)
#7  Qwen2-72B-Instruct         26.07  (72B)
    ...

Highlights

35x smaller, higher score: 2B params beating Llama-3.3-70B-Instruct on Turkish
Best-in-class QA: 32.54 — highest QA score across ALL models in the Cetvel leaderboard, including 72B models
Best-in-class SUM: 32.55 — highest summarization score across the entire leaderboard
TQuAD breakthrough: +11.27 F1 improvement on Turkish reading comprehension
Edge-deployable: Runs on a single consumer GPU, Mac Mini, or mobile device

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AlicanKiraz0/Kara-Kumru-v1.0-2B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Türkiye'nin en büyük gölü hangisidir ve özellikleri nelerdir?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) for k, v in inputs.items()}

output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(
    output[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)

print(response)

Quantized Inference (GGUF)

# For llama.cpp / Ollama users (if GGUF version is available)
ollama run AlicanKiraz0/Kara-Kumru-v1.0-2B

Training Details

Base Model

Model: vngrs-ai/Kumru-2B
Parameters: ~2B
Architecture: Transformer decoder-only

Fine-tuning Configuration

Method: Full fine-tuning
Precision: BF16
Hardware: SnakeEye Cluster (DGX Spark + Mac Studio M3 Ultra)

What Improved & Why

The fine-tuning primarily strengthened generative capabilities (QA, summarization, translation) while showing minor regression on some discriminative tasks (classification, NLI). This is a well-known trade-off in LLM fine-tuning — the model learned to produce better free-form Turkish text at the cost of some multiple-choice and classification accuracy.

Capability	Direction	Interpretation
Question Answering (QA)	⬆️ +7.17	Extractive QA dramatically improved
Translation (MT)	⬆️ +4.42	TR→EN translation quality increased
Summarization (SUM)	⬆️ +1.00	Abstractive summarization improved
Grammar Correction (GEC)	⬇️ -3.43	Exact-match GEC slightly regressed
Natural Language Inference (NLI)	⬇️ -1.46	Entailment classification dipped
Text Classification (TC)	⬇️ -0.47	Minor regression on classification

Evaluation

All evaluations were performed using the Cetvel Turkish LLM Benchmark framework.

Cetvel Benchmark Categories

Category	Description
GEC	Grammatical Error Correction (gecturk_generation)
MCQA	Multiple Choice QA (belebele_tr, exams_tr, turkish_plu, xcopa_tr)
MT	Machine Translation TR→EN (wmt-tr-en-prompt)
NLI	Natural Language Inference (nli_tr)
QA	Question Answering (xquad_tr, tquad, mkqa_tr)
SUM	Summarization (mlsum_tr, xlsum_tr, tr-wikihow-summ, wiki_lingua_tr)
TC	Text Classification (ironytr, news_cat, offenseval_tr, sts_tr, trclaim19, xfact_tr)

Intended Use

Turkish question answering and information extraction
Turkish text summarization
Turkish-to-English translation
General Turkish language generation
Research on efficient Turkish LLMs

Limitations

Classification tasks: Some regression on text classification and NLI compared to baseline
Grammar correction: GEC performance decreased by ~3.4 points
Model size trade-offs: While competitive with much larger models on generative tasks, MCQA performance lags behind 7B+ models
Evaluation caveat: Cross-pipeline benchmark comparison — see note above

Roadmap (Kara-Kumru-v2.0)

Targeted GEC and NLI distillation to recover regression
Classification-focused fine-tuning (news categorization, irony detection)
MCQA and causal reasoning dataset expansion
Unified evaluation pipeline for fair cross-model comparison
GGUF quantization for edge deployment

Citation

@misc{kiraz2026karakumru,
  title={Kara-Kumru-v1.0-2B: A Fine-tuned 2B Turkish LLM Outperforming 70B Models},
  author={Kiraz, Alican},
  year={2026},
  url={https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}

Acknowledgments

VNGRS AI for the Kumru base model and the Cetvel benchmark framework
Built on the SnakeEye Cluster — a multi-node system with DGX Spark and Apple Silicon nodes