Thread Reranker
A cross-encoder reranker that scores how relevant a conversation thread is to a new user message. Designed for unified conversation architectures where a single chat stream replaces explicit thread management — the model determines which internal thread a message belongs to so the right context can be retrieved automatically.
How It Works
In a unified conversation system, users interact through a single continuous chat. Behind the scenes, the system maintains multiple internal threads (topics the user has discussed before). When a new message arrives, candidate threads are retrieved using fast heuristics (entity matching, recency, flow continuity), and this reranker scores each candidate to pick the best match.
The model takes two inputs simultaneously: the text pair (user message + thread summary) processed through the encoder, and structured retrieval features computed by the upstream pipeline. It fuses both signals to produce a relevance score.
Architecture
User Message + Thread Summary ──► MiniLM-L6 (frozen + LoRA r=8) ──► CLS token ──┐
├──► MLP Head ──► Score
Step 3 Structured Features ──────► Feature Projection (Linear→ReLU→Linear) ──────┘
Base model: nreimers/MiniLM-L6-H384-uncased (22M parameters, encoder-only)
LoRA configuration: Rank 8, alpha 16, applied to query and value projections, dropout 0.1
Structured features (5 inputs):
entity_overlap— count of thread entities found in the user messagekeyword_matches— keyword overlap between message and thread contentflow_continuity— 1.0 if this thread was the most recently active, 0.0 otherwiserecency_score— exponential decay score based on hours since thread was last activehours_since_active— raw hours since thread was last active
Intended Use
This model is one component in a 7-step unified conversation pipeline:
- User sends message — single chat stream, no thread selector
- Entity & signal extraction — lightweight NER and pattern matching (no ML)
- Layered context retrieval — database queries using entity match, recency, flow continuity
- Reranker (this model) — scores candidate threads from Step 3
- Confidence threshold — auto-select if confident, ask user if ambiguous
- LLM responds — with the correct thread context injected
- Update thread store — extract new entities and facts, write back to database
The model only fires when the deterministic heuristics in Step 3 produce multiple plausible candidates. Clear-cut cases (unique entity match + high recency) are resolved without the model.
Performance
Evaluated on synthetic test data with three difficulty tiers:
| Difficulty | Hit Rate @ 1 | Description |
|---|---|---|
| Easy | 100.0% | Message contains explicit entity references ("fix the React bug") |
| Medium | 82.1% | Indirect references ("that bug we were debugging") |
| Hard | 84.1% | No entity signal, relies on recency and flow ("let's keep going") |
| Overall | 90.5% | Weighted across all tiers |
Note: In the hybrid pipeline, easy cases are handled by deterministic heuristics without calling the model. The model's effective contribution is on medium and hard cases, where the combined system achieves 95%+ accuracy when including heuristic pre-filtering.
Training
Dataset: Algokruti/thread-reranker-data — 50,543 synthetic examples (12,500 positive, 38,043 negative) generated from 500 simulated user profiles across 12 topic types in 5 domains.
Training strategy: Curriculum learning — epochs 1-2 trained on easy examples only, epochs 3-5 on all difficulty tiers. Binary cross-entropy loss with cosine learning rate schedule and warmup.
Hyperparameters:
- Batch size: 64
- Learning rate: 2e-4
- Epochs: 5 (2 curriculum + 3 full)
- Max sequence length: 256
- LoRA rank: 8, alpha: 16
- Optimizer: AdamW with weight decay 0.01
- Gradient clipping: max norm 1.0
Training domains covered:
- Web Development (React Dashboard, Authentication, CSS Grid)
- Backend Development (Python API, Docker Deployment)
- Personal (Meal Planning, Job Search, Fitness)
- Data Science (ML Training, Data Pipeline)
- Mobile Development (iOS/Swift, Android/Kotlin)
Limitations
- Trained on synthetic data only. Performance on real user conversations may differ, particularly for domains and linguistic patterns not represented in the training set.
- Limited domain coverage. 12 topics across 5 domains, heavily skewed toward software development. Non-technical topics (travel, health, education, finance, creative writing) are underrepresented.
- English only. Not tested on multilingual conversations.
- Cold start. With no conversation history, the model has nothing to rank. The system falls back to treating each message as a new thread.
- Ambiguity resolution. On genuinely ambiguous messages with no entity, recency, or flow signal, the model may select incorrectly. The confidence threshold mechanism is designed to catch these cases and ask the user instead.
How to Use
PyTorch Inference
import torch
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("nreimers/MiniLM-L6-H384-uncased")
# Load the full ThreadReranker (see training notebook for class definition)
model = ThreadReranker()
model.load_state_dict(torch.load("model.pt", map_location="cpu"))
model.eval()
# Score a message against a candidate thread
message = "can you fix that chart rendering issue"
thread_text = "Building a metrics dashboard with Chart.js | the bar chart overflows on mobile | React, Chart.js"
encoding = tokenizer(message, thread_text, max_length=256,
padding="max_length", truncation=True, return_tensors="pt")
features = torch.tensor([[1.0, 1.0, 1.0, 0.92, 2.0]]) # Step 3 features
with torch.no_grad():
score = torch.sigmoid(model(encoding["input_ids"], encoding["attention_mask"], features))
print(f"Relevance score: {score.item():.4f}")
ONNX Inference (On-Device)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("thread_reranker.onnx")
# Prepare inputs (tokenized text + structured features)
result = session.run(None, {
"input_ids": input_ids_np,
"attention_mask": attention_mask_np,
"structured_features": features_np,
})
score = 1 / (1 + np.exp(-result[0])) # sigmoid
Files
| File | Description |
|---|---|
model.pt |
PyTorch model weights (base + LoRA merged + classification head) |
thread_reranker.onnx |
ONNX export for on-device inference |
config.json |
Model configuration and feature definitions |
training_history.json |
Per-epoch training and validation metrics |
tokenizer.json |
Tokenizer files |
Citation
If you use this model, please reference the training dataset:
@misc{thread-reranker-2026,
title={Thread Reranker: Cross-Encoder for Unified Conversation Thread Matching},
author={Algokruti},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/Algokruti/thread-reranker}
}
- Downloads last month
- 34
Model tree for Algokruti/thread-reranker
Base model
nreimers/MiniLM-L6-H384-uncasedDataset used to train Algokruti/thread-reranker
Evaluation results
- Hit Rate @ 1 (Overall) on thread-reranker-datatest set self-reported0.905
- Hit Rate @ 1 (Easy) on thread-reranker-datatest set self-reported1.000
- Hit Rate @ 1 (Medium) on thread-reranker-datatest set self-reported0.821
- Hit Rate @ 1 (Hard) on thread-reranker-datatest set self-reported0.841