Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
| # Future Considerations | |
| ## Response Size Optimization for Research Attention Analysis | |
| ### The Problem | |
| The `/analyze/research/attention` endpoint returns massive JSON responses because it captures full tensor data for PhD research purposes: | |
| **Current data captured per generation step:** | |
| - Full attention matrices: `[n_layers × n_heads × seq_len × seq_len]` | |
| - Q/K/V matrices: `[n_layers × n_heads × seq_len × head_dim]` | |
| - Layer metrics, head patterns, token alternatives | |
| **Response sizes observed:** | |
| | Model | Tokens | Response Size | | |
| |-------|--------|---------------| | |
| | CodeGen 350M (20L, 16H) | 8 tokens | ~357MB | | |
| | Devstral Small (40L, 32H) | 8 tokens | ~300-500MB estimated | | |
| | Devstral Small (40L, 32H) | 50+ tokens | **Potentially 2-5GB+** | | |
| ### Why This Matters | |
| For PhD research on a real coding task (e.g., "write a quicksort algorithm"), generating 50-100 tokens would produce multi-gigabyte responses. This creates: | |
| 1. **Memory pressure** - Browser may crash parsing huge JSON | |
| 2. **Transfer time** - Minutes to download over typical connections | |
| 3. **Storage costs** - Saving analysis runs becomes expensive | |
| 4. **GPU Space costs** - Long-running requests keep expensive GPU active | |
| ### Potential Solutions (To Be Evaluated) | |
| #### 1. Binary Format (Zarr/HDF5) | |
| Store tensors in efficient binary format server-side, stream on demand: | |
| - Backend saves to Zarr (already have `ZarrStorage` class) | |
| - Return metadata + URLs to tensor chunks | |
| - Frontend fetches only what's needed for current visualization | |
| - **Pros**: 10x+ size reduction, lazy loading | |
| - **Cons**: More complex architecture, requires persistent storage | |
| #### 2. Selective Detail Levels | |
| Offer analysis modes with different detail levels: | |
| ```python | |
| detail_level = request.get("detail_level", "full") | |
| # "summary" - metrics only, no matrices (~1MB) | |
| # "attention_only" - attention matrices, no Q/K/V (~100MB) | |
| # "top_heads" - only top-5 most interesting heads per layer (~50MB) | |
| # "full" - everything (current behavior) | |
| ``` | |
| - **Pros**: User controls trade-off | |
| - **Cons**: May miss important patterns in "interesting" head selection | |
| #### 3. Streaming Tensor Data | |
| Instead of one giant JSON, stream tensor chunks: | |
| - Send metadata and metrics immediately | |
| - Stream attention matrices layer-by-layer | |
| - Frontend renders progressively as data arrives | |
| - **Pros**: Immediate feedback, can cancel mid-stream | |
| - **Cons**: Complex state management, partial data handling | |
| #### 4. Compression | |
| Apply compression to reduce transfer size: | |
| - gzip the JSON response (typically 70-80% reduction) | |
| - Quantize floats to float16 or int8 with scale factors | |
| - Round to 3-4 decimal places (30-40% reduction) | |
| - **Pros**: Simple to implement | |
| - **Cons**: Still large, some precision loss | |
| #### 5. Server-Side Analysis with Thin Results | |
| Run analysis server-side, return only insights: | |
| - Compute attention patterns, anomalies, statistics on backend | |
| - Return summary metrics + flagged interesting patterns | |
| - Download full tensors only when user drills down | |
| - **Pros**: Massive size reduction for typical use | |
| - **Cons**: Loses raw data for novel research questions | |
| ### Recommended Approach (Future) | |
| A hybrid approach combining: | |
| 1. **Default: Compressed summary** (~10-50MB) | |
| - Attention metrics per head (entropy, max_weight, pattern type) | |
| - Layer-level aggregates | |
| - Token alternatives and probabilities | |
| 2. **On-demand: Full tensor access** | |
| - Store full tensors in Zarr on backend | |
| - User can request specific layer/head/step data | |
| - Paginated/chunked downloads | |
| 3. **Research mode: Bulk export** | |
| - Async job that packages everything | |
| - Downloads as .zarr or .h5 file | |
| - For offline analysis in Python/Jupyter | |
| ### Related Files | |
| - `/backend/model_service.py` - `analyze_research_attention_stream()` builds the response | |
| - `/backend/storage.py` - `ZarrStorage` class already exists | |
| - `/components/research/VerticalPipeline.tsx` - consumes the data | |
| ### Notes | |
| - Current implementation prioritizes completeness for PhD research | |
| - Any optimization must not lose data needed for research questions | |
| - Consider making optimization opt-in rather than default | |