api / FUTURE_CONSIDERATIONS.md
gary-boon
Add future considerations doc for response size optimization
3e67ea2

Future Considerations

Response Size Optimization for Research Attention Analysis

The Problem

The /analyze/research/attention endpoint returns massive JSON responses because it captures full tensor data for PhD research purposes:

Current data captured per generation step:

  • Full attention matrices: [n_layers × n_heads × seq_len × seq_len]
  • Q/K/V matrices: [n_layers × n_heads × seq_len × head_dim]
  • Layer metrics, head patterns, token alternatives

Response sizes observed:

Model Tokens Response Size
CodeGen 350M (20L, 16H) 8 tokens ~357MB
Devstral Small (40L, 32H) 8 tokens ~300-500MB estimated
Devstral Small (40L, 32H) 50+ tokens Potentially 2-5GB+

Why This Matters

For PhD research on a real coding task (e.g., "write a quicksort algorithm"), generating 50-100 tokens would produce multi-gigabyte responses. This creates:

  1. Memory pressure - Browser may crash parsing huge JSON
  2. Transfer time - Minutes to download over typical connections
  3. Storage costs - Saving analysis runs becomes expensive
  4. GPU Space costs - Long-running requests keep expensive GPU active

Potential Solutions (To Be Evaluated)

1. Binary Format (Zarr/HDF5)

Store tensors in efficient binary format server-side, stream on demand:

  • Backend saves to Zarr (already have ZarrStorage class)
  • Return metadata + URLs to tensor chunks
  • Frontend fetches only what's needed for current visualization
  • Pros: 10x+ size reduction, lazy loading
  • Cons: More complex architecture, requires persistent storage

2. Selective Detail Levels

Offer analysis modes with different detail levels:

detail_level = request.get("detail_level", "full")
# "summary" - metrics only, no matrices (~1MB)
# "attention_only" - attention matrices, no Q/K/V (~100MB)
# "top_heads" - only top-5 most interesting heads per layer (~50MB)
# "full" - everything (current behavior)
  • Pros: User controls trade-off
  • Cons: May miss important patterns in "interesting" head selection

3. Streaming Tensor Data

Instead of one giant JSON, stream tensor chunks:

  • Send metadata and metrics immediately
  • Stream attention matrices layer-by-layer
  • Frontend renders progressively as data arrives
  • Pros: Immediate feedback, can cancel mid-stream
  • Cons: Complex state management, partial data handling

4. Compression

Apply compression to reduce transfer size:

  • gzip the JSON response (typically 70-80% reduction)
  • Quantize floats to float16 or int8 with scale factors
  • Round to 3-4 decimal places (30-40% reduction)
  • Pros: Simple to implement
  • Cons: Still large, some precision loss

5. Server-Side Analysis with Thin Results

Run analysis server-side, return only insights:

  • Compute attention patterns, anomalies, statistics on backend
  • Return summary metrics + flagged interesting patterns
  • Download full tensors only when user drills down
  • Pros: Massive size reduction for typical use
  • Cons: Loses raw data for novel research questions

Recommended Approach (Future)

A hybrid approach combining:

  1. Default: Compressed summary (~10-50MB)

    • Attention metrics per head (entropy, max_weight, pattern type)
    • Layer-level aggregates
    • Token alternatives and probabilities
  2. On-demand: Full tensor access

    • Store full tensors in Zarr on backend
    • User can request specific layer/head/step data
    • Paginated/chunked downloads
  3. Research mode: Bulk export

    • Async job that packages everything
    • Downloads as .zarr or .h5 file
    • For offline analysis in Python/Jupyter

Related Files

  • /backend/model_service.py - analyze_research_attention_stream() builds the response
  • /backend/storage.py - ZarrStorage class already exists
  • /components/research/VerticalPipeline.tsx - consumes the data

Notes

  • Current implementation prioritizes completeness for PhD research
  • Any optimization must not lose data needed for research questions
  • Consider making optimization opt-in rather than default