Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
File size: 4,114 Bytes
3e67ea2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
# Future Considerations
## Response Size Optimization for Research Attention Analysis
### The Problem
The `/analyze/research/attention` endpoint returns massive JSON responses because it captures full tensor data for PhD research purposes:
**Current data captured per generation step:**
- Full attention matrices: `[n_layers × n_heads × seq_len × seq_len]`
- Q/K/V matrices: `[n_layers × n_heads × seq_len × head_dim]`
- Layer metrics, head patterns, token alternatives
**Response sizes observed:**
| Model | Tokens | Response Size |
|-------|--------|---------------|
| CodeGen 350M (20L, 16H) | 8 tokens | ~357MB |
| Devstral Small (40L, 32H) | 8 tokens | ~300-500MB estimated |
| Devstral Small (40L, 32H) | 50+ tokens | **Potentially 2-5GB+** |
### Why This Matters
For PhD research on a real coding task (e.g., "write a quicksort algorithm"), generating 50-100 tokens would produce multi-gigabyte responses. This creates:
1. **Memory pressure** - Browser may crash parsing huge JSON
2. **Transfer time** - Minutes to download over typical connections
3. **Storage costs** - Saving analysis runs becomes expensive
4. **GPU Space costs** - Long-running requests keep expensive GPU active
### Potential Solutions (To Be Evaluated)
#### 1. Binary Format (Zarr/HDF5)
Store tensors in efficient binary format server-side, stream on demand:
- Backend saves to Zarr (already have `ZarrStorage` class)
- Return metadata + URLs to tensor chunks
- Frontend fetches only what's needed for current visualization
- **Pros**: 10x+ size reduction, lazy loading
- **Cons**: More complex architecture, requires persistent storage
#### 2. Selective Detail Levels
Offer analysis modes with different detail levels:
```python
detail_level = request.get("detail_level", "full")
# "summary" - metrics only, no matrices (~1MB)
# "attention_only" - attention matrices, no Q/K/V (~100MB)
# "top_heads" - only top-5 most interesting heads per layer (~50MB)
# "full" - everything (current behavior)
```
- **Pros**: User controls trade-off
- **Cons**: May miss important patterns in "interesting" head selection
#### 3. Streaming Tensor Data
Instead of one giant JSON, stream tensor chunks:
- Send metadata and metrics immediately
- Stream attention matrices layer-by-layer
- Frontend renders progressively as data arrives
- **Pros**: Immediate feedback, can cancel mid-stream
- **Cons**: Complex state management, partial data handling
#### 4. Compression
Apply compression to reduce transfer size:
- gzip the JSON response (typically 70-80% reduction)
- Quantize floats to float16 or int8 with scale factors
- Round to 3-4 decimal places (30-40% reduction)
- **Pros**: Simple to implement
- **Cons**: Still large, some precision loss
#### 5. Server-Side Analysis with Thin Results
Run analysis server-side, return only insights:
- Compute attention patterns, anomalies, statistics on backend
- Return summary metrics + flagged interesting patterns
- Download full tensors only when user drills down
- **Pros**: Massive size reduction for typical use
- **Cons**: Loses raw data for novel research questions
### Recommended Approach (Future)
A hybrid approach combining:
1. **Default: Compressed summary** (~10-50MB)
- Attention metrics per head (entropy, max_weight, pattern type)
- Layer-level aggregates
- Token alternatives and probabilities
2. **On-demand: Full tensor access**
- Store full tensors in Zarr on backend
- User can request specific layer/head/step data
- Paginated/chunked downloads
3. **Research mode: Bulk export**
- Async job that packages everything
- Downloads as .zarr or .h5 file
- For offline analysis in Python/Jupyter
### Related Files
- `/backend/model_service.py` - `analyze_research_attention_stream()` builds the response
- `/backend/storage.py` - `ZarrStorage` class already exists
- `/components/research/VerticalPipeline.tsx` - consumes the data
### Notes
- Current implementation prioritizes completeness for PhD research
- Any optimization must not lose data needed for research questions
- Consider making optimization opt-in rather than default
|