arxiv:2603.19415

Scalable Prompt Routing via Fine-Grained Latent Task Discovery

Published on Mar 19

· Submitted by

Yunyi Zhang on Mar 24

Amazon

Upvote

Authors:

Yunyi Zhang ,

Abstract

A two-stage prompt routing architecture uses graph-based clustering and mixture-of-experts learning to efficiently select optimal language models for queries, achieving superior performance at reduced computational cost.

AI-generated summary

Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.