quora-competitions/quora
Updated • 1.6k • 23
How to use fatihburakkaragoz/quora-cross-encoder with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="fatihburakkaragoz/quora-cross-encoder") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("fatihburakkaragoz/quora-cross-encoder")
model = AutoModelForSequenceClassification.from_pretrained("fatihburakkaragoz/quora-cross-encoder")A fine-tuned DeBERTa-v3-base model for identifying duplicate question pairs, achieving 97.59% ROC AUC on the Quora Question Pairs dataset.
This model is a fine-tuned version of microsoft/deberta-v3-base on the Quora Question Pairs dataset. It uses a cross-encoder architecture to determine whether two questions are semantically equivalent.
Key Features:
| Metric | Value |
|---|---|
| ROC AUC | 97.59% |
| Training Loss | 0.116 |
| Validation Loss | 0.214 |
Primary Use Cases:
Out-of-Scope Use:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "fatihburakkaragoz/quora-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example usage
question1 = "How do I learn Python programming?"
question2 = "What's the best way to learn Python?"
# Tokenize and predict
inputs = tokenizer(question1, question2,
truncation=True, padding=True,
max_length=128, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probability = torch.softmax(logits, dim=-1)[0, 1].item()
print(f"Duplicate probability: {probability:.3f}")
For the most accurate confidence estimates, use the included calibrator:
import joblib
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model, tokenizer, and calibrator
model_name = "fatihburakkaragoz/quora-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Note: Download the calibrator separately from the model repository
calibrator = joblib.load("deberta_cal.pkl")
def predict_duplicate(question1, question2):
# Get raw prediction
inputs = tokenizer(question1, question2, truncation=True,
padding=True, max_length=128, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
raw_prob = torch.sigmoid(logits[0, 1]).item()
# Apply calibration for better confidence estimates
calibrated_prob = calibrator.predict_proba([[raw_prob]])[0, 1]
return calibrated_prob
# Example
prob = predict_duplicate("How to cook pasta?", "What's the best pasta recipe?")
print(f"Calibrated duplicate probability: {prob:.3f}")
| Epoch | Training Loss | Validation Loss | ROC AUC |
|---|---|---|---|
| 1 | 0.219 | 0.211 | 0.972 |
| 2 | 0.171 | 0.198 | 0.976 |
| 3 | 0.116 | 0.214 | 0.976 |
This model includes a calibration component that improves probability estimates:
deberta_cal.pkl (included in repository)Limitations:
Bias Considerations:
If you use this model, please cite:
@misc{deberta-v3-quora-question-pairs,
title={DeBERTa-v3 for Quora Question Pairs Duplicate Detection},
author={Fatih Burak Karagöz},
year={2025},
url={https://huggingface.co/fatihburakkaragoz/quora-cross-encoder}
}
Base model
microsoft/deberta-v3-base