gary-boon Claude Opus 4.5 commited on
Commit
3e67ea2
·
1 Parent(s): 15a862b

Add future considerations doc for response size optimization

Browse files

Document the scalability challenge with research attention analysis:
- 8 tokens currently produces ~357MB response
- Full coding task (50+ tokens) could be 2-5GB+
- Outlines potential solutions: binary format, selective detail,
streaming, compression, server-side analysis

This is tracked for future work - current implementation prioritizes
completeness for PhD research.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Files changed (1) hide show
  1. FUTURE_CONSIDERATIONS.md +102 -0
FUTURE_CONSIDERATIONS.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Future Considerations
2
+
3
+ ## Response Size Optimization for Research Attention Analysis
4
+
5
+ ### The Problem
6
+
7
+ The `/analyze/research/attention` endpoint returns massive JSON responses because it captures full tensor data for PhD research purposes:
8
+
9
+ **Current data captured per generation step:**
10
+ - Full attention matrices: `[n_layers × n_heads × seq_len × seq_len]`
11
+ - Q/K/V matrices: `[n_layers × n_heads × seq_len × head_dim]`
12
+ - Layer metrics, head patterns, token alternatives
13
+
14
+ **Response sizes observed:**
15
+ | Model | Tokens | Response Size |
16
+ |-------|--------|---------------|
17
+ | CodeGen 350M (20L, 16H) | 8 tokens | ~357MB |
18
+ | Devstral Small (40L, 32H) | 8 tokens | ~300-500MB estimated |
19
+ | Devstral Small (40L, 32H) | 50+ tokens | **Potentially 2-5GB+** |
20
+
21
+ ### Why This Matters
22
+
23
+ For PhD research on a real coding task (e.g., "write a quicksort algorithm"), generating 50-100 tokens would produce multi-gigabyte responses. This creates:
24
+
25
+ 1. **Memory pressure** - Browser may crash parsing huge JSON
26
+ 2. **Transfer time** - Minutes to download over typical connections
27
+ 3. **Storage costs** - Saving analysis runs becomes expensive
28
+ 4. **GPU Space costs** - Long-running requests keep expensive GPU active
29
+
30
+ ### Potential Solutions (To Be Evaluated)
31
+
32
+ #### 1. Binary Format (Zarr/HDF5)
33
+ Store tensors in efficient binary format server-side, stream on demand:
34
+ - Backend saves to Zarr (already have `ZarrStorage` class)
35
+ - Return metadata + URLs to tensor chunks
36
+ - Frontend fetches only what's needed for current visualization
37
+ - **Pros**: 10x+ size reduction, lazy loading
38
+ - **Cons**: More complex architecture, requires persistent storage
39
+
40
+ #### 2. Selective Detail Levels
41
+ Offer analysis modes with different detail levels:
42
+ ```python
43
+ detail_level = request.get("detail_level", "full")
44
+ # "summary" - metrics only, no matrices (~1MB)
45
+ # "attention_only" - attention matrices, no Q/K/V (~100MB)
46
+ # "top_heads" - only top-5 most interesting heads per layer (~50MB)
47
+ # "full" - everything (current behavior)
48
+ ```
49
+ - **Pros**: User controls trade-off
50
+ - **Cons**: May miss important patterns in "interesting" head selection
51
+
52
+ #### 3. Streaming Tensor Data
53
+ Instead of one giant JSON, stream tensor chunks:
54
+ - Send metadata and metrics immediately
55
+ - Stream attention matrices layer-by-layer
56
+ - Frontend renders progressively as data arrives
57
+ - **Pros**: Immediate feedback, can cancel mid-stream
58
+ - **Cons**: Complex state management, partial data handling
59
+
60
+ #### 4. Compression
61
+ Apply compression to reduce transfer size:
62
+ - gzip the JSON response (typically 70-80% reduction)
63
+ - Quantize floats to float16 or int8 with scale factors
64
+ - Round to 3-4 decimal places (30-40% reduction)
65
+ - **Pros**: Simple to implement
66
+ - **Cons**: Still large, some precision loss
67
+
68
+ #### 5. Server-Side Analysis with Thin Results
69
+ Run analysis server-side, return only insights:
70
+ - Compute attention patterns, anomalies, statistics on backend
71
+ - Return summary metrics + flagged interesting patterns
72
+ - Download full tensors only when user drills down
73
+ - **Pros**: Massive size reduction for typical use
74
+ - **Cons**: Loses raw data for novel research questions
75
+
76
+ ### Recommended Approach (Future)
77
+
78
+ A hybrid approach combining:
79
+ 1. **Default: Compressed summary** (~10-50MB)
80
+ - Attention metrics per head (entropy, max_weight, pattern type)
81
+ - Layer-level aggregates
82
+ - Token alternatives and probabilities
83
+
84
+ 2. **On-demand: Full tensor access**
85
+ - Store full tensors in Zarr on backend
86
+ - User can request specific layer/head/step data
87
+ - Paginated/chunked downloads
88
+
89
+ 3. **Research mode: Bulk export**
90
+ - Async job that packages everything
91
+ - Downloads as .zarr or .h5 file
92
+ - For offline analysis in Python/Jupyter
93
+
94
+ ### Related Files
95
+ - `/backend/model_service.py` - `analyze_research_attention_stream()` builds the response
96
+ - `/backend/storage.py` - `ZarrStorage` class already exists
97
+ - `/components/research/VerticalPipeline.tsx` - consumes the data
98
+
99
+ ### Notes
100
+ - Current implementation prioritizes completeness for PhD research
101
+ - Any optimization must not lose data needed for research questions
102
+ - Consider making optimization opt-in rather than default