# Future Considerations ## Response Size Optimization for Research Attention Analysis ### The Problem The `/analyze/research/attention` endpoint returns massive JSON responses because it captures full tensor data for PhD research purposes: **Current data captured per generation step:** - Full attention matrices: `[n_layers × n_heads × seq_len × seq_len]` - Q/K/V matrices: `[n_layers × n_heads × seq_len × head_dim]` - Layer metrics, head patterns, token alternatives **Response sizes observed:** | Model | Tokens | Response Size | |-------|--------|---------------| | CodeGen 350M (20L, 16H) | 8 tokens | ~357MB | | Devstral Small (40L, 32H) | 8 tokens | ~300-500MB estimated | | Devstral Small (40L, 32H) | 50+ tokens | **Potentially 2-5GB+** | ### Why This Matters For PhD research on a real coding task (e.g., "write a quicksort algorithm"), generating 50-100 tokens would produce multi-gigabyte responses. This creates: 1. **Memory pressure** - Browser may crash parsing huge JSON 2. **Transfer time** - Minutes to download over typical connections 3. **Storage costs** - Saving analysis runs becomes expensive 4. **GPU Space costs** - Long-running requests keep expensive GPU active ### Potential Solutions (To Be Evaluated) #### 1. Binary Format (Zarr/HDF5) Store tensors in efficient binary format server-side, stream on demand: - Backend saves to Zarr (already have `ZarrStorage` class) - Return metadata + URLs to tensor chunks - Frontend fetches only what's needed for current visualization - **Pros**: 10x+ size reduction, lazy loading - **Cons**: More complex architecture, requires persistent storage #### 2. Selective Detail Levels Offer analysis modes with different detail levels: ```python detail_level = request.get("detail_level", "full") # "summary" - metrics only, no matrices (~1MB) # "attention_only" - attention matrices, no Q/K/V (~100MB) # "top_heads" - only top-5 most interesting heads per layer (~50MB) # "full" - everything (current behavior) ``` - **Pros**: User controls trade-off - **Cons**: May miss important patterns in "interesting" head selection #### 3. Streaming Tensor Data Instead of one giant JSON, stream tensor chunks: - Send metadata and metrics immediately - Stream attention matrices layer-by-layer - Frontend renders progressively as data arrives - **Pros**: Immediate feedback, can cancel mid-stream - **Cons**: Complex state management, partial data handling #### 4. Compression Apply compression to reduce transfer size: - gzip the JSON response (typically 70-80% reduction) - Quantize floats to float16 or int8 with scale factors - Round to 3-4 decimal places (30-40% reduction) - **Pros**: Simple to implement - **Cons**: Still large, some precision loss #### 5. Server-Side Analysis with Thin Results Run analysis server-side, return only insights: - Compute attention patterns, anomalies, statistics on backend - Return summary metrics + flagged interesting patterns - Download full tensors only when user drills down - **Pros**: Massive size reduction for typical use - **Cons**: Loses raw data for novel research questions ### Recommended Approach (Future) A hybrid approach combining: 1. **Default: Compressed summary** (~10-50MB) - Attention metrics per head (entropy, max_weight, pattern type) - Layer-level aggregates - Token alternatives and probabilities 2. **On-demand: Full tensor access** - Store full tensors in Zarr on backend - User can request specific layer/head/step data - Paginated/chunked downloads 3. **Research mode: Bulk export** - Async job that packages everything - Downloads as .zarr or .h5 file - For offline analysis in Python/Jupyter ### Related Files - `/backend/model_service.py` - `analyze_research_attention_stream()` builds the response - `/backend/storage.py` - `ZarrStorage` class already exists - `/components/research/VerticalPipeline.tsx` - consumes the data ### Notes - Current implementation prioritizes completeness for PhD research - Any optimization must not lose data needed for research questions - Consider making optimization opt-in rather than default