Detection Heads

Detection head architectures on frozen EUPE-ViT-B features. The backbone is frozen throughout; only the head trains. The premise behind the repository is that a strong modern ViT encoder already carries enough semantic and spatial structure to support detection with a lightweight decoder on top, and that the detection head's architecture — not the backbone — is where the interesting design trade-offs live. Every result here is obtained from features extracted once and cached; training iterates over those cached tensors, which makes it possible to run architecture studies in minutes rather than the days required for end-to-end backbone training.

Contents

The reference detection head is a standard FCOS detector wrapped around a ViTDet-style simple feature pyramid, 16.14M parameters, 41.0 mAP on COCO val2017. It is the detection head used in phanerozoic/argus, a multi-task perception model that attaches classification, segmentation, depth, correspondence, and detection heads to a single frozen EUPE-ViT-B backbone. Anyone building a similar multi-task system who wants a clean detection baseline, or anyone who wants to see how far a standard FCOS head gets on frozen ViT features, can take it off the shelf.

Alongside the reference head sits a library of seventeen experimental architectural scaffolds under heads/. Each is a single head.py file implementing a forward pass with no trained checkpoint — hypothesis-stage designs that formulate detection in unconventional ways: wavelet decomposition in place of FPN, optimal-transport label assignment, compositional patch assembly, tropical-semiring operations, corner-pair relational reasoning, and others. A driver script (arena.py) instantiates and trains any of them against the cached backbone features, so trying a new formulation is a one-command operation. The scaffolds are recorded so alternative formulations are not lost, and so readers looking for detection ideas have somewhere to browse.

Finally, a scaling curve of parameter-efficient detection heads is included from a parallel research program described in the next section. The current picker reaches 41.27 mAP on COCO val2017 at 2.98 million learnable parameters, exceeding the FCOS baseline (41.10) at 18.4 percent of its head parameter budget. The head replaces FCOS's learned feature pyramid network with a cofiber decomposition of the backbone patch features — a zero-parameter multi-scale operation that iteratively subtracts the downsampled-then-upsampled component of a feature map to produce frequency-separated scale bands. Four such bands are computed (at spatial strides of 16, 32, 64, and 128 pixels) and augmented with a finer-resolution level (stride 8) synthesized by a single transposed convolution applied to the stride-16 band; these five levels match the scale coverage of the FCOS simple feature pyramid. Each level is processed by separate classification and regression convolutional towers — five standard 3×3 convolutions followed by four depthwise residual blocks at 192 hidden channels, with shared weights across levels — and top-down lateral connections pass information from coarser bands into finer ones before the towers run. The head is included in this repository as a direct architectural comparison against the FCOS baseline on the same frozen backbone. The full parameter-versus-mAP scaling curve for the cofiber line of work, from 105-parameter minimal circuits up through variants at four million parameters, is hosted in the separate repository linked below.

Reference Results (COCO val2017)

Head Parameters mAP [email protected] [email protected]
Baseline FCOS (simple feature pyramid) 16,138,074 41.1 64.9 43.4
Split-tower picker (160h, 5 std + 4 dw, 16 ep, step 100,000 live) 2,975,067 41.3 64.0 44.0
Split-tower 160h 5+4 + full stack, 8 ep EMA (prior picker) 2,975,067 41.0 63.9 43.2
Split-tower 160h 5+4 + ATSS + flip aug, 8 ep 2,934,107 40.1 63.0 42.0
Split-tower 192h 5+4 (legacy picker, 640px) 4,164,699 37.3 62.1 39.7
Split-tower 160h 5+4 plain (prior picker, FCOS assignment, no aug) 2,934,107 36.9 61.6 38.6
Split-tower deep-medium (128h, 5 std + 4 dw), Pareto-best ≤2.0M 1,918,555 36.2 61.0 38.0
Split-tower deep-medium 128h 5+4, picker recipe control 1,918,555 36.1 60.7 37.7
Split-tower medium (128h, 5 std + 1 dw), Pareto-best ≤1.9M 1,810,267 35.3 59.8 36.3
Split-tower slim (96h, 5 std + 1 dw), Pareto-best ≤1.1M 1,055,259 33.9 58.4 34.8

All measurements come from the same evaluation pipeline: per-class soft NMS (linear decay, IoU threshold 0.5), score threshold 0.05, top-100 detections per image, pycocotools on COCO val2017 (5,000 images). The current picker reaches 100.4% of FCOS's mAP at 18.4% of its head parameter budget, a 5.5× improvement in mAP per million parameters (13.88 vs 2.55).

The FCOS baseline uses a simple feature pyramid that synthesizes five scale levels — P3 at stride 8 through P7 at stride 128, each with 256 channels and GroupNorm — from the backbone's stride-16 spatial features, followed by two shared four-layer convolutional towers (one for classification, one for box regression plus centerness) and three 1×1 prediction heads. It trains in eight epochs at 640-pixel input.

The split-tower head differs from the FCOS baseline in two architectural respects: the multi-scale structure comes from a zero-parameter cofiber decomposition of the backbone features rather than from a learned FPN, and the classification and regression towers use separate weights throughout rather than sharing a tower. The learned classification weights are replaced with a linear projection into CLIP text-embedding space; classification logits are the cosine similarity of the projected features against frozen CLIP class-name embeddings. The current picker uses CLIP ViT-L/14 embeddings of shape (80, 768) averaged across 8 prompt variants as class prototypes, with the projection Linear(hidden, 768) without bias and a learned scalar temperature plus per-class bias.

Evaluation is at 640-pixel input with letterbox padding. Aggregate mAP is at parity between 640 and 768 pixel input across this architecture family.

The 2.98M-parameter current picker is the step-100,000 checkpoint of a 16-epoch (117,264-step) training run on the architecture and training stack of the prior 8-epoch picker, with the schedule extended from 8 to 16 epochs. Live soft mAP on COCO val2017 is 41.27, exceeding the 16.14M FCOS baseline (41.10) at 18.4% of its head-parameter budget. Reported weights are the unmodified SGD state at step 100,000, with no exponential moving average applied and no test-time augmentation. Checkpoint selection followed the convention used for the original picker: late-training checkpoints were evaluated at 4,000-step intervals from step 80,000 through the final step, and the step with the highest live soft mAP on COCO val2017 was retained. Saved as split_tower_5scale_160h_5std_4dw_ema_l14_16ep_step100000.pth under heads/cofiber_threshold/split_tower_5scale_160h_5std_4dw_ema_l14_16ep/.

The 2.98M-parameter prior picker adds three changes on top of the plain 160h 5+4 and trains for 8 epochs, lifting soft mAP from 36.88 to 40.96. First, the projection cls_project is initialized from the singular value decomposition of the frozen text-embedding matrix: the first 80 columns are the top-80 right singular vectors of the text embeddings, and the remaining columns are a random orthogonal complement normalized to unit column norm. Second, the CLIP ViT-B/32 single-prompt text embeddings of shape (80, 512) are replaced with CLIP ViT-L/14 embeddings of shape (80, 768), averaged across 8 prompt variants; this raises cls_project from Linear(160, 512) to Linear(160, 768) and adds approximately 41,000 parameters. Third, an exponential moving average of the trainable parameters with decay 0.9998 is maintained during training and used at evaluation. Non-trainable buffers are merged from the live head at save time so the EMA file is a standalone loadable state dict. Saved as split_tower_5scale_160h_5std_4dw_ema_l14_8ep.pth under heads/cofiber_threshold/split_tower_5scale_160h_5std_4dw_ema_l14/.

The 2.93M-parameter variant at width 160, 5 ConvGN + 4 DWRes depth, with ATSS assignment and horizontal-flip augmentation reaches 40.06 soft mAP over 8 epochs of training. Two changes distinguish it from the plain 160h 5+4 variant. First, horizontal-flip augmentation is applied to the input features and to per-location targets across all five pyramid levels, with the target permutation offset per level so each pyramid level's targets flip within its own index range. Second, the FCOS assignment rule (center-sampling plus size-range plus minimum-area) is replaced with ATSS (Zhang et al. 2020): for each ground-truth box, candidates are the top-9 locations per level by L2 distance to the GT center, and positives are those whose stride-scaled pseudo-IoU exceeds the per-GT mean + std of candidate pseudo-IoUs. ATSS depends only on GT boxes and precomputes into the shards under _atss-suffixed keys. Trained with the picker recipe (batch 16, lr 5e-4, 8 epochs, AdamW, cosine schedule with 3% warmup). Saved as split_tower_5scale_160h_5std_4dw_atss_aug_8ep.pth under heads/cofiber_threshold/split_tower_5scale_160h_5std_4dw_atss_aug/.

The 2.93M-parameter plain 160h 5+4 variant (FCOS assignment, no augmentation, picker recipe) reaches 36.88 soft mAP. The 128→160→192 width sweep at constant depth under FCOS assignment shows diminishing returns on width: 128→160 gains 0.75 mAP for +1.01M parameters (1.34 mAP per extra Mparam); 160→192 gains 0.42 mAP for +1.23M parameters (0.34 mAP per extra Mparam). Saved as split_tower_5scale_160h_5std_4dw_8ep.pth under heads/cofiber_threshold/split_tower_5scale_160h_5std_4dw/.

The 1.06M-parameter slim variant is the Pareto-best head at ≤1.1M parameters on this backbone. Its architecture was derived from per-layer linear probing of the 4.16M picker: passing cached stride-16 positive features through the picker's stem + scale_norms[1] + lateral path into each cls_tower and reg_tower layer, then fitting a linear ridge probe at each depth against one-hot class labels and LTRB distance targets. Both towers' probe curves plateaued at the 5th ConvGN block; the subsequent 4 DWRes blocks contributed <1% classification accuracy and <0.01 regression R² beyond layer 5. Effective width at the plateau layer, measured as the rank at 95% activation variance, was 74 channels for the cls tower and 49 for the reg tower. The slim head (width 96, 5 ConvGN + 1 DWRes per tower) sits a small safety margin above those probed floors, trains for 8 epochs at batch 32, lr 1e-3, AdamW with cosine schedule, and reaches 91.2% of the picker's mAP at 25.3% of its parameters — 3.5× the mAP-per-million-parameters of the picker (32.1 vs 8.97). Saved as split_tower_5scale_96h_5std_1dw_8ep.pth under heads/cofiber_threshold/split_tower_5scale_slim/.

The 1.92M-parameter deep-medium variant (width 128 at the picker's 5 ConvGN + 4 DWRes depth) sits just above medium on the scaling curve and closes 69% of the slim→picker gap at 46% of the picker's parameter budget. Adding the 3 extra DWRes blocks to the medium architecture cost only 108K parameters (DWRes is ~9× cheaper per layer than a standard Conv+GN block because the 3×3 kernel is depthwise) and gained +0.9 mAP, confirming that the picker's extra depth via DWRes is load-bearing even after the cls/reg towers have "saturated" in the linear-probe sense. The probe's apparent depth saturation at layer 5 is a ceiling on linear readability, not on non-linear refinement that the residual blocks add before reaching the output. The remaining 1.1 mAP gap between deep-medium and the picker is width-specific (128→192 at the same depth). Saved as split_tower_5scale_128h_5std_4dw_8ep.pth under heads/cofiber_threshold/split_tower_5scale_128h_5std_4dw/.

A control variant at the same 128h 5+4 architecture trained with the picker recipe (batch 16, lr 5e-4, 8 epochs) in place of the fast recipe (batch 32, lr 1e-3, 8 epochs) reaches 36.13 soft mAP, within 0.1 mAP of the fast-recipe variant at 36.23. The picker-to-deep-medium gap at constant depth is therefore attributable to width rather than to training schedule. Saved as split_tower_5scale_128h_5std_4dw_b16_8ep.pth in the same directory.

The 1.81M-parameter medium variant (width 128 at the same 5 ConvGN + 1 DWRes depth as slim) sits between slim and the picker on the scaling curve. It tests whether width beyond the probe's rank-95% floor of 74 channels is load-bearing through non-linear channel-mixing in the residual blocks, which the linear probe does not capture. 128h recovers 1.4 mAP over slim (35.3 vs 33.9 soft NMS), closing 41% of the slim-to-picker gap at 43% of the picker's parameter budget. Training recipe is identical to slim (batch 32, 8 epochs, lr 1e-3). Best step was 24000 of 29312 via late-training sweep. Saved as split_tower_5scale_128h_5std_1dw_8ep.pth under heads/cofiber_threshold/split_tower_5scale_128h_5std_1dw/.

Eval protocol: per-class hard NMS (IoU 0.5), score threshold 0.05, top-100 per image, pycocotools COCO val2017. Canonical script: eval_coco_map.py. A --soft-nms flag runs an additional linear-decay soft-NMS pass for comparison.

Analytical single-class person detector

A probe of how far a detection head can go with zero gradient descent on the same frozen backbone. The best closed-form configuration reaches 2.33 AP / 10.3 [email protected] on COCO val2017 for the person class at ~105K learnable parameters, all derived from simple statistics of the cached training features (no SGD, no training loop):

  • Fisher LDA classifier — 768-dim direction computed in closed form from pooled covariance of person-positive vs. background features across 85K positives and 51M negatives. Bayes-calibrated bias incorporates the ~1:600 class-imbalance prior of person-positive patches in COCO.
  • 15-anchor aspect grid — 5 scale quantile bins × 3 fixed aspect ratios (0.4, 1.0, 2.5) guarantees landscape-orientation coverage even though COCO person training data skews portrait. A per-scale ridge-regressed 15-way classifier selects the anchor at each patch location.
  • Per-anchor fine refinement — 15 independent 770-dim ridge regressors each predict residuals (dx, dy, d_log_w, d_log_h) relative to their assigned anchor shape. Splitting the regression into per-anchor subproblems avoids the averaging failure that broke earlier global-regression variants.

[email protected] is 0.2: the 15 discrete anchor shapes plus linear residual correction produce boxes precise enough for loose localization but not for strict IoU thresholds. Generative-classifier variants (per-class Gaussian mixtures via Mahalanobis distance) and soft anchor weighting were evaluated and both underperformed the discriminative-classifier with argmax assignment at this parameter scale.

Checkpoint at heads/cofiber_threshold/analytical_person_v3j/head_v3j_canonical.pth; build-and-evaluate script at build_analytical_person.py. Reproduces end-to-end quickly: statistics pass + closed-form solves + full val eval. The 80-class analytical scaling of this approach is in the separate cofiber-detection repository referenced below.

Cofiber decomposition (separate repository)

A large parallel line of work on this backbone replaces the FCOS feature pyramid with cofiber decomposition: a zero-parameter multi-scale operation that iteratively subtracts the downsampled-then-upsampled component of a feature map to produce frequency-separated scale bands. Given backbone features f, the decomposition produces f − upsample(avgpool(f)) as a high-frequency band, then recurses on the low-frequency remainder. The three bands that result are pairwise orthogonal in frequency content, which lets a lightweight head attend to them separately without needing a learned FPN to mediate between scales.

The decomposition has been machine-checked in Rocq/HoTT. The proof frames average pooling and bilinear upsampling as an adjoint pair whose counit gives a short exact sequence in a semi-additive category; the cofiber bands are the kernels of the resulting projections, and any input is uniquely expressible as a sum of its bands. The practical consequence is that the multi-scale construction which typically costs 11M parameters in an FPN is free — no parameters, no training, no FLOPs beyond pooling and interpolation.

The sibling repository phanerozoic/cofiber-detection contains the full line of work:

  • Closed-form analytical heads derived from least-squares regression on the decomposed features (zero training, 70K parameters, 1.6 mAP)
  • Trained neural heads at multiple scales, from 70K parameters (5.2 mAP) up to the 3.85M-parameter split-tower included here (20.3 mAP) and 4.27M-parameter variants with top-down lateral connections (19.9 mAP)
  • INT8 threshold-logic heads using Heaviside step functions (92K parameters, 5.9 mAP) and their pruned variants down to 46K nonzero parameters
  • Synthesizable Verilog circuit implementations, including a 93-parameter person image classifier that achieves 99.8% recall on COCO val images
  • Evolutionary dimension search (GPU-batched, 200 generations per second) that identifies informative feature subsets; the evolved 10-dimension person-detection circuit outperforms a greedy 100-dimension circuit at 10× fewer gates
  • Sheaf-cohomology regression features (directional feature differences at spatial boundaries, interpretable as ÄŒech 1-cocycles)
  • The Rocq/HoTT proof of decomposition exactness
  • Full training and evaluation scripts

That repository is the canonical host for the parameter-efficiency, circuit, and formalized-decomposition material. This repository retains the top-performing head as a baseline comparison.

Experimental Architectural Scaffolds

The seventeen scaffolds under heads/ each encode a distinct hypothesis about how detection can be formulated on frozen ViT features. They are recorded as architectural options and are not individually benchmarked here:

Scaffold Core idea
adaptive_query Learned query vectors that adaptively attend to backbone patches, producing detections as attention outputs rather than per-location predictions
cascade_pool A cascade of pooling operations to synthesize multi-scale features without an FPN
centernet CenterNet-style keypoint detection: predict object centers as a heatmap, then regress size and offset at center locations
compression Head operating on compressed backbone features, testing how aggressively features can be reduced before detection quality collapses
curvature Object boundaries as high curvature of the feature manifold; detection as curvature estimation
depth_fusion Joint detection and depth estimation sharing features, exploiting the alignment between depth gradients and object boundaries
feature_graph Detection as a graph problem over patch tokens, with edges indicating co-occurrence or spatial adjacency
mutual_attention Cross-attention between patch features to model inter-object relationships before detection
optimal_transport Label assignment via optimal transport between ground-truth boxes and prediction locations
patch_assembly Compositional object construction from overlapping patch detections, inspired by self-assembly tile systems
prototype Replacing learned classification weights with fixed class prototypes (text-embedded or image-mean)
relational_corners Corner-pair detection: predict corner heatmaps, then pair them via learned embeddings
scale_classify Explicit scale classification as an auxiliary head to produce scale-consistent boxes
sparse_query Sparse query mechanism to limit prediction locations and reduce background noise
threshold_prototype Prototype classification with a threshold gate
tropical Tropical-semiring (min-plus) operations in the head
wavelet Wavelet-based multi-scale decomposition as an alternative to FPN and to cofiber decomposition

Each scaffold is a one-file architectural commitment. The arena script trains any of them in a single command against the cached features.

FCOS Variants

Three FCOS-family heads are present as controlled variants on the baseline, each differing from the reference in one isolated way:

  • heads/baseline_fcos/ — the full simple-feature-pyramid FCOS head at 16.14M parameters, the 41.0-mAP reference.
  • heads/slim_fcos/ — a reduced-parameter variant with narrower towers and fewer FPN channels, for measuring which components of the baseline's capacity are load-bearing.
  • heads/hook_fcos/ — an FCOS head that reads intermediate backbone features via forward hooks rather than only the final spatial output. Follows the DPT depth decoder pattern used in the multi-task Argus model, where intermediate ViT blocks (2, 5, 8, 11) carry features complementary to the final block.

Arena

arena.py and multi_domain_arena.py are lightweight driver scripts that instantiate any head from the heads/ directory, train it for a configurable number of epochs on cached COCO features, and evaluate with pycocotools. The multi-domain version extends evaluation across the twenty RF100-VL cross-domain datasets in addition to COCO, to test whether a head's performance generalizes beyond the benchmark it was tuned on.

Shared Infrastructure

  • losses/ contains the FCOS-style focal loss for classification, GIoU loss for boxes, BCE for centerness, and a CenterNet-style focal-loss variant for keypoint heads.
  • utils/ contains the feature caching code (shard format, manifest, shard loader), the decoder that converts per-location predictions to COCO-format boxes, and the evaluation wrapper around pycocotools.
  • eval_coco_map.py is the standalone evaluation entry point; it can be pointed at any saved checkpoint.
  • train_split_tower_5scale.py trains the split-tower head; eval_coco_map.py reproduces the COCO mAP numbers above.

Backbone

All results use EUPE-ViT-B at 640-pixel input with letterbox padding. The backbone produces 768-channel spatial features at stride 16 (40×40 patches). It has 86M parameters and is frozen; features are cached once before any head training begins. The framework is backbone-agnostic: any ViT-scale encoder producing a consistent spatial grid can be substituted by replacing the feature extraction step, and the head library will run without modification. EUPE-ViT-B is the default because its multi-teacher distillation produces features that work well across detection, segmentation, depth, and correspondence, which makes it a sensible proving ground for heads that might eventually share a backbone with other tasks.

References

  • Tian, Shen, Chen, He. FCOS: Fully Convolutional One-Stage Object Detection. ICCV 2019.
  • Li, Mao, Girshick, He. Exploring Plain Vision Transformer Backbones for Object Detection. ECCV 2022.
  • Zhou, Wang, Krähenbühl. Objects as Points. arXiv:1904.07850, 2019.

License

Apache 2.0. Users are responsible for complying with the license terms of whatever backbone they substitute for EUPE-ViT-B.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/detection-heads

Finetuned
(5)
this model

Dataset used to train phanerozoic/detection-heads

Paper for phanerozoic/detection-heads