FLAN-T5 Base Turkish OTT Query Parser

This model is a fine-tuned version of google/flan-t5-base for Turkish OTT and media search query parsing.

The model takes a Turkish natural language media search query as input and generates a structured JSON output. Its purpose is to convert user queries into machine-readable filters that can be used in search, recommendation, filtering, or vector database systems.

What This Model Does

This model is designed to understand Turkish OTT/media search queries.

It can extract information such as:

  • genres
  • excluded genres
  • names
  • country filters
  • language filters
  • mood or theme tags
  • title or channel names
  • similar title requests
  • rating and popularity filters
  • live broadcast intent

For example, the query:

yerli dram dizileri olsun ama romantik olmasın

can be converted into:

{
  "intent_type": "content_search",
  "filters": {
    "country_names": ["Türkiye"],
    "genres": ["dram"],
    "exclude_genres": ["romantik"]
  }
}

Intended Use

This model can be used as a query understanding layer in Turkish OTT/media search systems.

A typical usage flow is:

User Query
→ Query Parser Model
→ Structured JSON Filters
→ Search / Filtering / Vector Database
→ Final Results

The model does not directly search for movies, series, or channels. It only extracts structured filters from the user query.

Example Inputs and Outputs

Example 1

Input:

yerli komedi filmleri

Output:

{
  "intent_type": "content_search",
  "filters": {
    "country_names": ["Türkiye"],
    "genres": ["komedi"]
  }
}

Example 2

Input:

trt 1 canlı yayın

Output:

{
  "intent_type": "live_content",
  "filters": {
    "content_tags": ["canli_yayin"],
    "names": ["TRT 1"],
    "exact_match": true
  }
}

Example 3

Input:

popüler korku dizileri

Output:

{
  "intent_type": "content_search",
  "filters": {
    "genres": ["korku"],
    "total_rating_min": ["P75"]
  }
}

How to Use

import json
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "Selenaydmrp/flan-t5-base-turkish-ott-query-parser"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)

PREFIX = (
    "parse_media_query: Sorgudan yalnızca bulunan medya arama filtrelerini JSON olarak çıkar. "
    "Boş alan yazma. Kullanılabilir alanlar: content_tags, genres, exclude_genres, mood_tags, "
    "names, similar_to_titles, country_names, language_names, rating_count, review_count, "
    "total_rating, release_year, exact_match. Sorgu: "
)

def parse_query(query):
    input_text = PREFIX + query

    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        truncation=True,
        max_length=256
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            num_beams=4,
            do_sample=False
        )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    try:
        return json.loads(decoded)
    except json.JSONDecodeError:
        return {"raw_output": decoded}

query = "yerli dram dizileri olsun ama romantik olmasın"
result = parse_query(query)

print(json.dumps(result, ensure_ascii=False, indent=2))

Training Data

The model was fine-tuned on a Turkish OTT/media query parsing dataset.

The dataset contains Turkish search queries and their corresponding structured JSON outputs.

The training examples include:

  • genre-based queries
  • exclusion queries
  • live broadcast queries
  • popularity-based queries
  • rating-based queries
  • country and language filters
  • similar-title queries
  • short Turkish user queries

Evaluation

Preliminary validation results:

Metric Value
Train Loss 0.3083
Validation Loss 0.0488
Exact Match 61.47%
Valid JSON Rate 99.78%

The Exact Match score should be interpreted carefully. In this project, Exact Match is a strict metric and can be affected by formatting differences such as extra spaces, line breaks, field ordering, or small JSON representation differences. Therefore, a prediction may be semantically correct but still counted as incorrect by Exact Match.

For this reason, the Valid JSON Rate and field-level evaluation are also important when measuring the real performance of the model.

For production use, the model should also be evaluated with:

  • field-level precision
  • field-level recall
  • field-level F1 score
  • schema validity
  • intent accuracy
  • real user queries
  • downstream search quality

Limitations

This model is task-specific and should not be used as a general chatbot.

Known limitations:

  • It may produce incorrect filters for ambiguous queries.
  • It may confuse mood tags and genre labels.
  • It may not recognize rare movie, series, actor, or channel names.
  • It may generate valid JSON with semantically incorrect fields.
  • It should be tested with real user queries before production use.

License

This model is released under the Apache 2.0 license.

Author

Developed by Selenaydmrp for Turkish OTT/media query understanding and structured search experiments.

Downloads last month
31
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Selenaydmrp/flan-t5-base-turkish-ott-query-parser

Finetuned
(908)
this model