Spaces:

visualisable-ai
/

api

Running on CPU Upgrade

App Files Files Community

api / FUTURE_CONSIDERATIONS.md

gary-boon

Add future considerations doc for response size optimization

3e67ea2 3 days ago

preview code

raw

history blame contribute delete

4.11 kB

	# Future Considerations

	## Response Size Optimization for Research Attention Analysis

	### The Problem

	The `/analyze/research/attention` endpoint returns massive JSON responses because it captures full tensor data for PhD research purposes:

	Current data captured per generation step:
	- Full attention matrices: `[n_layers × n_heads × seq_len × seq_len]`
	- Q/K/V matrices: `[n_layers × n_heads × seq_len × head_dim]`
	- Layer metrics, head patterns, token alternatives

	Response sizes observed:
	\| Model \| Tokens \| Response Size \|
	\|-------\|--------\|---------------\|
	\| CodeGen 350M (20L, 16H) \| 8 tokens \| ~357MB \|
	\| Devstral Small (40L, 32H) \| 8 tokens \| ~300-500MB estimated \|
	\| Devstral Small (40L, 32H) \| 50+ tokens \| Potentially 2-5GB+ \|

	### Why This Matters

	For PhD research on a real coding task (e.g., "write a quicksort algorithm"), generating 50-100 tokens would produce multi-gigabyte responses. This creates:

	1. Memory pressure - Browser may crash parsing huge JSON
	2. Transfer time - Minutes to download over typical connections
	3. Storage costs - Saving analysis runs becomes expensive
	4. GPU Space costs - Long-running requests keep expensive GPU active

	### Potential Solutions (To Be Evaluated)

	#### 1. Binary Format (Zarr/HDF5)
	Store tensors in efficient binary format server-side, stream on demand:
	- Backend saves to Zarr (already have `ZarrStorage` class)
	- Return metadata + URLs to tensor chunks
	- Frontend fetches only what's needed for current visualization
	- Pros: 10x+ size reduction, lazy loading
	- Cons: More complex architecture, requires persistent storage

	#### 2. Selective Detail Levels
	Offer analysis modes with different detail levels:
	```python
	detail_level = request.get("detail_level", "full")
	# "summary" - metrics only, no matrices (~1MB)
	# "attention_only" - attention matrices, no Q/K/V (~100MB)
	# "top_heads" - only top-5 most interesting heads per layer (~50MB)
	# "full" - everything (current behavior)
	```
	- Pros: User controls trade-off
	- Cons: May miss important patterns in "interesting" head selection

	#### 3. Streaming Tensor Data
	Instead of one giant JSON, stream tensor chunks:
	- Send metadata and metrics immediately
	- Stream attention matrices layer-by-layer
	- Frontend renders progressively as data arrives
	- Pros: Immediate feedback, can cancel mid-stream
	- Cons: Complex state management, partial data handling

	#### 4. Compression
	Apply compression to reduce transfer size:
	- gzip the JSON response (typically 70-80% reduction)
	- Quantize floats to float16 or int8 with scale factors
	- Round to 3-4 decimal places (30-40% reduction)
	- Pros: Simple to implement
	- Cons: Still large, some precision loss

	#### 5. Server-Side Analysis with Thin Results
	Run analysis server-side, return only insights:
	- Compute attention patterns, anomalies, statistics on backend
	- Return summary metrics + flagged interesting patterns
	- Download full tensors only when user drills down
	- Pros: Massive size reduction for typical use
	- Cons: Loses raw data for novel research questions

	### Recommended Approach (Future)

	A hybrid approach combining:
	1. Default: Compressed summary (~10-50MB)
	- Attention metrics per head (entropy, max_weight, pattern type)
	- Layer-level aggregates
	- Token alternatives and probabilities

	2. On-demand: Full tensor access
	- Store full tensors in Zarr on backend
	- User can request specific layer/head/step data
	- Paginated/chunked downloads

	3. Research mode: Bulk export
	- Async job that packages everything
	- Downloads as .zarr or .h5 file
	- For offline analysis in Python/Jupyter

	### Related Files
	- `/backend/model_service.py` - `analyze_research_attention_stream()` builds the response
	- `/backend/storage.py` - `ZarrStorage` class already exists
	- `/components/research/VerticalPipeline.tsx` - consumes the data

	### Notes
	- Current implementation prioritizes completeness for PhD research
	- Any optimization must not lose data needed for research questions
	- Consider making optimization opt-in rather than default