gary-boon Claude Opus 4.5 commited on
Commit
383a328
·
1 Parent(s): c6f4cc5

Add Phase 5: Performance optimizations to phased plan

Browse files

Document future performance improvements:
- Lazy loading attention data (recommended approach)
- Server-Sent Events for progress streaming
- Client-side progress estimation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Files changed (1) hide show
  1. docs/devstral-spark-plan-phased.md +98 -0
docs/devstral-spark-plan-phased.md CHANGED
@@ -1960,3 +1960,101 @@ git checkout pre-devstral-phase2-v1
1960
  After Phase 2/2b/2c completion (December 2024):
1961
  - Backend: Contains MistralAdapter, devstral-small config, /models endpoints
1962
  - Frontend: Contains dynamic layer handling, backendFetch conversion
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1960
  After Phase 2/2b/2c completion (December 2024):
1961
  - Backend: Contains MistralAdapter, devstral-small config, /models endpoints
1962
  - Frontend: Contains dynamic layer handling, backendFetch conversion
1963
+
1964
+ ---
1965
+
1966
+ ## Phase 5: Performance Optimizations (Future Work)
1967
+
1968
+ **Goal:** Improve the user experience during analysis by reducing perceived latency and providing better progress feedback.
1969
+
1970
+ ### 5.1 Problem Statement
1971
+
1972
+ After clicking "Analyze", there's a significant delay (~10-20 seconds) between:
1973
+ 1. The backend completing generation (visible in HF logs)
1974
+ 2. The UI updating with results
1975
+
1976
+ **Root Cause:** The `/analyze/research/attention` endpoint returns a massive JSON payload containing:
1977
+ - Attention data for all layers × all heads × all tokens
1978
+ - For Devstral (40 layers, 32 heads), this can be 10s of MB of JSON
1979
+ - Network transfer and JSON parsing dominate the wait time
1980
+
1981
+ ### 5.2 Lazy Loading Attention Data (Recommended)
1982
+
1983
+ Instead of returning all attention data upfront, use progressive loading:
1984
+
1985
+ **Phase 1: Initial Response (Fast)**
1986
+ ```json
1987
+ {
1988
+ "promptTokens": [...],
1989
+ "generatedTokens": [...],
1990
+ "tokenSections": {...},
1991
+ "modelInfo": {...},
1992
+ "generationTime": 1.23,
1993
+ "attentionAvailable": true,
1994
+ "attentionLayers": 40
1995
+ }
1996
+ ```
1997
+
1998
+ **Phase 2: On-Demand Layer Loading**
1999
+ ```
2000
+ GET /attention/layer/{layer_idx}?token_position={pos}
2001
+ ```
2002
+
2003
+ Returns attention data for a single layer only when the user expands it.
2004
+
2005
+ **Benefits:**
2006
+ - Initial response is small and fast (< 100KB)
2007
+ - Attention data loaded only when needed
2008
+ - Users who only inspect a few layers don't download all data
2009
+
2010
+ **Backend Changes Required:**
2011
+ - Store attention data in server-side cache (keyed by session/request ID)
2012
+ - Add `/attention/layer/{idx}` endpoint
2013
+ - Add TTL to clean up cached attention data
2014
+
2015
+ **Frontend Changes Required:**
2016
+ - Update VerticalPipeline to fetch layer data on expand
2017
+ - Add loading state for individual layers
2018
+ - Cache fetched layers client-side
2019
+
2020
+ ### 5.3 Server-Sent Events for Progress (Alternative)
2021
+
2022
+ For users who need to see all data, stream progress updates:
2023
+
2024
+ ```typescript
2025
+ // Backend streams events
2026
+ data: {"stage": "tokenizing", "progress": 0.1}
2027
+ data: {"stage": "generating", "progress": 0.3, "tokensGenerated": 4}
2028
+ data: {"stage": "extracting_attention", "progress": 0.5, "layer": 10}
2029
+ data: {"stage": "serializing", "progress": 0.9}
2030
+ data: {"stage": "complete", "progress": 1.0, "data": {...}}
2031
+ ```
2032
+
2033
+ **Implementation:**
2034
+ - Change endpoint to return `text/event-stream`
2035
+ - Frontend uses `EventSource` or `fetch` with `ReadableStream`
2036
+ - Progress bar updates with each event
2037
+
2038
+ ### 5.4 Client-Side Progress Estimation (Quick Win)
2039
+
2040
+ As an interim solution before backend changes:
2041
+
2042
+ 1. Show staged progress indicator: Tokenizing → Generating → Processing → Loading
2043
+ 2. Estimate timing based on model and prompt length
2044
+ 3. Use optimistic progress that smoothly fills, with final jump on completion
2045
+
2046
+ **Limitation:** Progress won't perfectly match actual backend state, but provides better UX than empty-then-full.
2047
+
2048
+ ### 5.5 Validation Criteria
2049
+
2050
+ - [ ] Initial UI response time < 2 seconds after generation completes
2051
+ - [ ] Progress indicator shows meaningful stages
2052
+ - [ ] Users don't see "stuck" empty progress bar
2053
+ - [ ] Attention data available when user expands layers
2054
+
2055
+ ### 5.6 Priority
2056
+
2057
+ **Recommended implementation order:**
2058
+ 1. Client-side progress estimation (quick win, no backend changes)
2059
+ 2. Lazy loading attention data (biggest impact on perceived performance)
2060
+ 3. SSE streaming (nice-to-have for power users)