Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
gary-boon
Claude Opus 4.5
commited on
Commit
·
3e67ea2
1
Parent(s):
15a862b
Add future considerations doc for response size optimization
Browse filesDocument the scalability challenge with research attention analysis:
- 8 tokens currently produces ~357MB response
- Full coding task (50+ tokens) could be 2-5GB+
- Outlines potential solutions: binary format, selective detail,
streaming, compression, server-side analysis
This is tracked for future work - current implementation prioritizes
completeness for PhD research.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <[email protected]>
- FUTURE_CONSIDERATIONS.md +102 -0
FUTURE_CONSIDERATIONS.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Future Considerations
|
| 2 |
+
|
| 3 |
+
## Response Size Optimization for Research Attention Analysis
|
| 4 |
+
|
| 5 |
+
### The Problem
|
| 6 |
+
|
| 7 |
+
The `/analyze/research/attention` endpoint returns massive JSON responses because it captures full tensor data for PhD research purposes:
|
| 8 |
+
|
| 9 |
+
**Current data captured per generation step:**
|
| 10 |
+
- Full attention matrices: `[n_layers × n_heads × seq_len × seq_len]`
|
| 11 |
+
- Q/K/V matrices: `[n_layers × n_heads × seq_len × head_dim]`
|
| 12 |
+
- Layer metrics, head patterns, token alternatives
|
| 13 |
+
|
| 14 |
+
**Response sizes observed:**
|
| 15 |
+
| Model | Tokens | Response Size |
|
| 16 |
+
|-------|--------|---------------|
|
| 17 |
+
| CodeGen 350M (20L, 16H) | 8 tokens | ~357MB |
|
| 18 |
+
| Devstral Small (40L, 32H) | 8 tokens | ~300-500MB estimated |
|
| 19 |
+
| Devstral Small (40L, 32H) | 50+ tokens | **Potentially 2-5GB+** |
|
| 20 |
+
|
| 21 |
+
### Why This Matters
|
| 22 |
+
|
| 23 |
+
For PhD research on a real coding task (e.g., "write a quicksort algorithm"), generating 50-100 tokens would produce multi-gigabyte responses. This creates:
|
| 24 |
+
|
| 25 |
+
1. **Memory pressure** - Browser may crash parsing huge JSON
|
| 26 |
+
2. **Transfer time** - Minutes to download over typical connections
|
| 27 |
+
3. **Storage costs** - Saving analysis runs becomes expensive
|
| 28 |
+
4. **GPU Space costs** - Long-running requests keep expensive GPU active
|
| 29 |
+
|
| 30 |
+
### Potential Solutions (To Be Evaluated)
|
| 31 |
+
|
| 32 |
+
#### 1. Binary Format (Zarr/HDF5)
|
| 33 |
+
Store tensors in efficient binary format server-side, stream on demand:
|
| 34 |
+
- Backend saves to Zarr (already have `ZarrStorage` class)
|
| 35 |
+
- Return metadata + URLs to tensor chunks
|
| 36 |
+
- Frontend fetches only what's needed for current visualization
|
| 37 |
+
- **Pros**: 10x+ size reduction, lazy loading
|
| 38 |
+
- **Cons**: More complex architecture, requires persistent storage
|
| 39 |
+
|
| 40 |
+
#### 2. Selective Detail Levels
|
| 41 |
+
Offer analysis modes with different detail levels:
|
| 42 |
+
```python
|
| 43 |
+
detail_level = request.get("detail_level", "full")
|
| 44 |
+
# "summary" - metrics only, no matrices (~1MB)
|
| 45 |
+
# "attention_only" - attention matrices, no Q/K/V (~100MB)
|
| 46 |
+
# "top_heads" - only top-5 most interesting heads per layer (~50MB)
|
| 47 |
+
# "full" - everything (current behavior)
|
| 48 |
+
```
|
| 49 |
+
- **Pros**: User controls trade-off
|
| 50 |
+
- **Cons**: May miss important patterns in "interesting" head selection
|
| 51 |
+
|
| 52 |
+
#### 3. Streaming Tensor Data
|
| 53 |
+
Instead of one giant JSON, stream tensor chunks:
|
| 54 |
+
- Send metadata and metrics immediately
|
| 55 |
+
- Stream attention matrices layer-by-layer
|
| 56 |
+
- Frontend renders progressively as data arrives
|
| 57 |
+
- **Pros**: Immediate feedback, can cancel mid-stream
|
| 58 |
+
- **Cons**: Complex state management, partial data handling
|
| 59 |
+
|
| 60 |
+
#### 4. Compression
|
| 61 |
+
Apply compression to reduce transfer size:
|
| 62 |
+
- gzip the JSON response (typically 70-80% reduction)
|
| 63 |
+
- Quantize floats to float16 or int8 with scale factors
|
| 64 |
+
- Round to 3-4 decimal places (30-40% reduction)
|
| 65 |
+
- **Pros**: Simple to implement
|
| 66 |
+
- **Cons**: Still large, some precision loss
|
| 67 |
+
|
| 68 |
+
#### 5. Server-Side Analysis with Thin Results
|
| 69 |
+
Run analysis server-side, return only insights:
|
| 70 |
+
- Compute attention patterns, anomalies, statistics on backend
|
| 71 |
+
- Return summary metrics + flagged interesting patterns
|
| 72 |
+
- Download full tensors only when user drills down
|
| 73 |
+
- **Pros**: Massive size reduction for typical use
|
| 74 |
+
- **Cons**: Loses raw data for novel research questions
|
| 75 |
+
|
| 76 |
+
### Recommended Approach (Future)
|
| 77 |
+
|
| 78 |
+
A hybrid approach combining:
|
| 79 |
+
1. **Default: Compressed summary** (~10-50MB)
|
| 80 |
+
- Attention metrics per head (entropy, max_weight, pattern type)
|
| 81 |
+
- Layer-level aggregates
|
| 82 |
+
- Token alternatives and probabilities
|
| 83 |
+
|
| 84 |
+
2. **On-demand: Full tensor access**
|
| 85 |
+
- Store full tensors in Zarr on backend
|
| 86 |
+
- User can request specific layer/head/step data
|
| 87 |
+
- Paginated/chunked downloads
|
| 88 |
+
|
| 89 |
+
3. **Research mode: Bulk export**
|
| 90 |
+
- Async job that packages everything
|
| 91 |
+
- Downloads as .zarr or .h5 file
|
| 92 |
+
- For offline analysis in Python/Jupyter
|
| 93 |
+
|
| 94 |
+
### Related Files
|
| 95 |
+
- `/backend/model_service.py` - `analyze_research_attention_stream()` builds the response
|
| 96 |
+
- `/backend/storage.py` - `ZarrStorage` class already exists
|
| 97 |
+
- `/components/research/VerticalPipeline.tsx` - consumes the data
|
| 98 |
+
|
| 99 |
+
### Notes
|
| 100 |
+
- Current implementation prioritizes completeness for PhD research
|
| 101 |
+
- Any optimization must not lose data needed for research questions
|
| 102 |
+
- Consider making optimization opt-in rather than default
|