Spaces:

visualisable-ai
/

api

Running on CPU Upgrade

gary-boon Claude Opus 4.5 commited on 3 days ago

Commit

3e67ea2

1 Parent(s): 15a862b

Add future considerations doc for response size optimization

Document the scalability challenge with research attention analysis:
- 8 tokens currently produces ~357MB response
- Full coding task (50+ tokens) could be 2-5GB+
- Outlines potential solutions: binary format, selective detail,
streaming, compression, server-side analysis

This is tracked for future work - current implementation prioritizes
completeness for PhD research.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Files changed (1) hide show

FUTURE_CONSIDERATIONS.md +102 -0

FUTURE_CONSIDERATIONS.md ADDED Viewed

	@@ -0,0 +1,102 @@

+# Future Considerations
+## Response Size Optimization for Research Attention Analysis
+### The Problem
+The `/analyze/research/attention` endpoint returns massive JSON responses because it captures full tensor data for PhD research purposes:
+**Current data captured per generation step:**
+- Full attention matrices: `[n_layers × n_heads × seq_len × seq_len]`
+- Q/K/V matrices: `[n_layers × n_heads × seq_len × head_dim]`
+- Layer metrics, head patterns, token alternatives
+**Response sizes observed:**
+| Model | Tokens | Response Size |
+|-------|--------|---------------|
+| CodeGen 350M (20L, 16H) | 8 tokens | ~357MB |
+| Devstral Small (40L, 32H) | 8 tokens | ~300-500MB estimated |
+| Devstral Small (40L, 32H) | 50+ tokens | **Potentially 2-5GB+** |
+### Why This Matters
+For PhD research on a real coding task (e.g., "write a quicksort algorithm"), generating 50-100 tokens would produce multi-gigabyte responses. This creates:
+1. **Memory pressure** - Browser may crash parsing huge JSON
+2. **Transfer time** - Minutes to download over typical connections
+3. **Storage costs** - Saving analysis runs becomes expensive
+4. **GPU Space costs** - Long-running requests keep expensive GPU active
+### Potential Solutions (To Be Evaluated)
+#### 1. Binary Format (Zarr/HDF5)
+Store tensors in efficient binary format server-side, stream on demand:
+- Backend saves to Zarr (already have `ZarrStorage` class)
+- Return metadata + URLs to tensor chunks
+- Frontend fetches only what's needed for current visualization
+- **Pros**: 10x+ size reduction, lazy loading
+- **Cons**: More complex architecture, requires persistent storage
+#### 2. Selective Detail Levels
+Offer analysis modes with different detail levels:
+```python
+detail_level = request.get("detail_level", "full")
+# "summary" - metrics only, no matrices (~1MB)
+# "attention_only" - attention matrices, no Q/K/V (~100MB)
+# "top_heads" - only top-5 most interesting heads per layer (~50MB)
+# "full" - everything (current behavior)
+```
+- **Pros**: User controls trade-off
+- **Cons**: May miss important patterns in "interesting" head selection
+#### 3. Streaming Tensor Data
+Instead of one giant JSON, stream tensor chunks:
+- Send metadata and metrics immediately
+- Stream attention matrices layer-by-layer
+- Frontend renders progressively as data arrives
+- **Pros**: Immediate feedback, can cancel mid-stream
+- **Cons**: Complex state management, partial data handling
+#### 4. Compression
+Apply compression to reduce transfer size:
+- gzip the JSON response (typically 70-80% reduction)
+- Quantize floats to float16 or int8 with scale factors
+- Round to 3-4 decimal places (30-40% reduction)
+- **Pros**: Simple to implement
+- **Cons**: Still large, some precision loss
+#### 5. Server-Side Analysis with Thin Results
+Run analysis server-side, return only insights:
+- Compute attention patterns, anomalies, statistics on backend
+- Return summary metrics + flagged interesting patterns
+- Download full tensors only when user drills down
+- **Pros**: Massive size reduction for typical use
+- **Cons**: Loses raw data for novel research questions
+### Recommended Approach (Future)
+A hybrid approach combining:
+1. **Default: Compressed summary** (~10-50MB)
+   - Attention metrics per head (entropy, max_weight, pattern type)
+   - Layer-level aggregates
+   - Token alternatives and probabilities
+2. **On-demand: Full tensor access**
+   - Store full tensors in Zarr on backend
+   - User can request specific layer/head/step data
+   - Paginated/chunked downloads
+3. **Research mode: Bulk export**
+   - Async job that packages everything
+   - Downloads as .zarr or .h5 file
+   - For offline analysis in Python/Jupyter
+### Related Files
+- `/backend/model_service.py` - `analyze_research_attention_stream()` builds the response
+- `/backend/storage.py` - `ZarrStorage` class already exists
+- `/components/research/VerticalPipeline.tsx` - consumes the data
+### Notes
+- Current implementation prioritizes completeness for PhD research
+- Any optimization must not lose data needed for research questions
+- Consider making optimization opt-in rather than default