Thread Reranker

A cross-encoder reranker that scores how relevant a conversation thread is to a new user message. Designed for unified conversation architectures where a single chat stream replaces explicit thread management — the model determines which internal thread a message belongs to so the right context can be retrieved automatically.

How It Works

In a unified conversation system, users interact through a single continuous chat. Behind the scenes, the system maintains multiple internal threads (topics the user has discussed before). When a new message arrives, candidate threads are retrieved using fast heuristics (entity matching, recency, flow continuity), and this reranker scores each candidate to pick the best match.

The model takes two inputs simultaneously: the text pair (user message + thread summary) processed through the encoder, and structured retrieval features computed by the upstream pipeline. It fuses both signals to produce a relevance score.

Architecture

User Message + Thread Summary ──► MiniLM-L6 (frozen + LoRA r=8) ──► CLS token ──┐
                                                                                  ├──► MLP Head ──► Score
Step 3 Structured Features ──────► Feature Projection (Linear→ReLU→Linear) ──────┘

Base model: nreimers/MiniLM-L6-H384-uncased (22M parameters, encoder-only)

LoRA configuration: Rank 8, alpha 16, applied to query and value projections, dropout 0.1

Structured features (5 inputs):

entity_overlap — count of thread entities found in the user message
keyword_matches — keyword overlap between message and thread content
flow_continuity — 1.0 if this thread was the most recently active, 0.0 otherwise
recency_score — exponential decay score based on hours since thread was last active
hours_since_active — raw hours since thread was last active

Intended Use

This model is one component in a 7-step unified conversation pipeline:

User sends message — single chat stream, no thread selector
Entity & signal extraction — lightweight NER and pattern matching (no ML)
Layered context retrieval — database queries using entity match, recency, flow continuity
Reranker (this model) — scores candidate threads from Step 3
Confidence threshold — auto-select if confident, ask user if ambiguous
LLM responds — with the correct thread context injected
Update thread store — extract new entities and facts, write back to database

The model only fires when the deterministic heuristics in Step 3 produce multiple plausible candidates. Clear-cut cases (unique entity match + high recency) are resolved without the model.

Performance

Evaluated on synthetic test data with three difficulty tiers:

Difficulty	Hit Rate @ 1	Description
Easy	100.0%	Message contains explicit entity references ("fix the React bug")
Medium	82.1%	Indirect references ("that bug we were debugging")
Hard	84.1%	No entity signal, relies on recency and flow ("let's keep going")
Overall	90.5%	Weighted across all tiers

Note: In the hybrid pipeline, easy cases are handled by deterministic heuristics without calling the model. The model's effective contribution is on medium and hard cases, where the combined system achieves 95%+ accuracy when including heuristic pre-filtering.

Training

Dataset: Algokruti/thread-reranker-data — 50,543 synthetic examples (12,500 positive, 38,043 negative) generated from 500 simulated user profiles across 12 topic types in 5 domains.

Training strategy: Curriculum learning — epochs 1-2 trained on easy examples only, epochs 3-5 on all difficulty tiers. Binary cross-entropy loss with cosine learning rate schedule and warmup.

Hyperparameters:

Batch size: 64
Learning rate: 2e-4
Epochs: 5 (2 curriculum + 3 full)
Max sequence length: 256
LoRA rank: 8, alpha: 16
Optimizer: AdamW with weight decay 0.01
Gradient clipping: max norm 1.0

Training domains covered:

Web Development (React Dashboard, Authentication, CSS Grid)
Backend Development (Python API, Docker Deployment)
Personal (Meal Planning, Job Search, Fitness)
Data Science (ML Training, Data Pipeline)
Mobile Development (iOS/Swift, Android/Kotlin)

Limitations

Trained on synthetic data only. Performance on real user conversations may differ, particularly for domains and linguistic patterns not represented in the training set.
Limited domain coverage. 12 topics across 5 domains, heavily skewed toward software development. Non-technical topics (travel, health, education, finance, creative writing) are underrepresented.
English only. Not tested on multilingual conversations.
Cold start. With no conversation history, the model has nothing to rank. The system falls back to treating each message as a new thread.
Ambiguity resolution. On genuinely ambiguous messages with no entity, recency, or flow signal, the model may select incorrectly. The confidence threshold mechanism is designed to catch these cases and ask the user instead.

How to Use

PyTorch Inference

import torch
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("nreimers/MiniLM-L6-H384-uncased")

# Load the full ThreadReranker (see training notebook for class definition)
model = ThreadReranker()
model.load_state_dict(torch.load("model.pt", map_location="cpu"))
model.eval()

# Score a message against a candidate thread
message = "can you fix that chart rendering issue"
thread_text = "Building a metrics dashboard with Chart.js | the bar chart overflows on mobile | React, Chart.js"

encoding = tokenizer(message, thread_text, max_length=256,
                     padding="max_length", truncation=True, return_tensors="pt")

features = torch.tensor([[1.0, 1.0, 1.0, 0.92, 2.0]])  # Step 3 features

with torch.no_grad():
    score = torch.sigmoid(model(encoding["input_ids"], encoding["attention_mask"], features))

print(f"Relevance score: {score.item():.4f}")

ONNX Inference (On-Device)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("thread_reranker.onnx")

# Prepare inputs (tokenized text + structured features)
result = session.run(None, {
    "input_ids": input_ids_np,
    "attention_mask": attention_mask_np,
    "structured_features": features_np,
})

score = 1 / (1 + np.exp(-result[0]))  # sigmoid

Files

File	Description
`model.pt`	PyTorch model weights (base + LoRA merged + classification head)
`thread_reranker.onnx`	ONNX export for on-device inference
`config.json`	Model configuration and feature definitions
`training_history.json`	Per-epoch training and validation metrics
`tokenizer.json`	Tokenizer files

Citation

If you use this model, please reference the training dataset:

@misc{thread-reranker-2026,
  title={Thread Reranker: Cross-Encoder for Unified Conversation Thread Matching},
  author={Algokruti},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/Algokruti/thread-reranker}
}

Downloads last month: 34

Model tree for Algokruti/thread-reranker

Base model

nreimers/MiniLM-L6-H384-uncased

Adapter

(1)

this model

Dataset used to train Algokruti/thread-reranker

Evaluation results

Hit Rate @ 1 (Overall) on thread-reranker-data
test set self-reported

0.905
Hit Rate @ 1 (Easy) on thread-reranker-data
test set self-reported

1.000
Hit Rate @ 1 (Medium) on thread-reranker-data
test set self-reported

0.821
Hit Rate @ 1 (Hard) on thread-reranker-data
test set self-reported

0.841