Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
gary-boon
Claude Opus 4.5
commited on
Commit
·
383a328
1
Parent(s):
c6f4cc5
Add Phase 5: Performance optimizations to phased plan
Browse filesDocument future performance improvements:
- Lazy loading attention data (recommended approach)
- Server-Sent Events for progress streaming
- Client-side progress estimation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <[email protected]>
docs/devstral-spark-plan-phased.md
CHANGED
|
@@ -1960,3 +1960,101 @@ git checkout pre-devstral-phase2-v1
|
|
| 1960 |
After Phase 2/2b/2c completion (December 2024):
|
| 1961 |
- Backend: Contains MistralAdapter, devstral-small config, /models endpoints
|
| 1962 |
- Frontend: Contains dynamic layer handling, backendFetch conversion
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1960 |
After Phase 2/2b/2c completion (December 2024):
|
| 1961 |
- Backend: Contains MistralAdapter, devstral-small config, /models endpoints
|
| 1962 |
- Frontend: Contains dynamic layer handling, backendFetch conversion
|
| 1963 |
+
|
| 1964 |
+
---
|
| 1965 |
+
|
| 1966 |
+
## Phase 5: Performance Optimizations (Future Work)
|
| 1967 |
+
|
| 1968 |
+
**Goal:** Improve the user experience during analysis by reducing perceived latency and providing better progress feedback.
|
| 1969 |
+
|
| 1970 |
+
### 5.1 Problem Statement
|
| 1971 |
+
|
| 1972 |
+
After clicking "Analyze", there's a significant delay (~10-20 seconds) between:
|
| 1973 |
+
1. The backend completing generation (visible in HF logs)
|
| 1974 |
+
2. The UI updating with results
|
| 1975 |
+
|
| 1976 |
+
**Root Cause:** The `/analyze/research/attention` endpoint returns a massive JSON payload containing:
|
| 1977 |
+
- Attention data for all layers × all heads × all tokens
|
| 1978 |
+
- For Devstral (40 layers, 32 heads), this can be 10s of MB of JSON
|
| 1979 |
+
- Network transfer and JSON parsing dominate the wait time
|
| 1980 |
+
|
| 1981 |
+
### 5.2 Lazy Loading Attention Data (Recommended)
|
| 1982 |
+
|
| 1983 |
+
Instead of returning all attention data upfront, use progressive loading:
|
| 1984 |
+
|
| 1985 |
+
**Phase 1: Initial Response (Fast)**
|
| 1986 |
+
```json
|
| 1987 |
+
{
|
| 1988 |
+
"promptTokens": [...],
|
| 1989 |
+
"generatedTokens": [...],
|
| 1990 |
+
"tokenSections": {...},
|
| 1991 |
+
"modelInfo": {...},
|
| 1992 |
+
"generationTime": 1.23,
|
| 1993 |
+
"attentionAvailable": true,
|
| 1994 |
+
"attentionLayers": 40
|
| 1995 |
+
}
|
| 1996 |
+
```
|
| 1997 |
+
|
| 1998 |
+
**Phase 2: On-Demand Layer Loading**
|
| 1999 |
+
```
|
| 2000 |
+
GET /attention/layer/{layer_idx}?token_position={pos}
|
| 2001 |
+
```
|
| 2002 |
+
|
| 2003 |
+
Returns attention data for a single layer only when the user expands it.
|
| 2004 |
+
|
| 2005 |
+
**Benefits:**
|
| 2006 |
+
- Initial response is small and fast (< 100KB)
|
| 2007 |
+
- Attention data loaded only when needed
|
| 2008 |
+
- Users who only inspect a few layers don't download all data
|
| 2009 |
+
|
| 2010 |
+
**Backend Changes Required:**
|
| 2011 |
+
- Store attention data in server-side cache (keyed by session/request ID)
|
| 2012 |
+
- Add `/attention/layer/{idx}` endpoint
|
| 2013 |
+
- Add TTL to clean up cached attention data
|
| 2014 |
+
|
| 2015 |
+
**Frontend Changes Required:**
|
| 2016 |
+
- Update VerticalPipeline to fetch layer data on expand
|
| 2017 |
+
- Add loading state for individual layers
|
| 2018 |
+
- Cache fetched layers client-side
|
| 2019 |
+
|
| 2020 |
+
### 5.3 Server-Sent Events for Progress (Alternative)
|
| 2021 |
+
|
| 2022 |
+
For users who need to see all data, stream progress updates:
|
| 2023 |
+
|
| 2024 |
+
```typescript
|
| 2025 |
+
// Backend streams events
|
| 2026 |
+
data: {"stage": "tokenizing", "progress": 0.1}
|
| 2027 |
+
data: {"stage": "generating", "progress": 0.3, "tokensGenerated": 4}
|
| 2028 |
+
data: {"stage": "extracting_attention", "progress": 0.5, "layer": 10}
|
| 2029 |
+
data: {"stage": "serializing", "progress": 0.9}
|
| 2030 |
+
data: {"stage": "complete", "progress": 1.0, "data": {...}}
|
| 2031 |
+
```
|
| 2032 |
+
|
| 2033 |
+
**Implementation:**
|
| 2034 |
+
- Change endpoint to return `text/event-stream`
|
| 2035 |
+
- Frontend uses `EventSource` or `fetch` with `ReadableStream`
|
| 2036 |
+
- Progress bar updates with each event
|
| 2037 |
+
|
| 2038 |
+
### 5.4 Client-Side Progress Estimation (Quick Win)
|
| 2039 |
+
|
| 2040 |
+
As an interim solution before backend changes:
|
| 2041 |
+
|
| 2042 |
+
1. Show staged progress indicator: Tokenizing → Generating → Processing → Loading
|
| 2043 |
+
2. Estimate timing based on model and prompt length
|
| 2044 |
+
3. Use optimistic progress that smoothly fills, with final jump on completion
|
| 2045 |
+
|
| 2046 |
+
**Limitation:** Progress won't perfectly match actual backend state, but provides better UX than empty-then-full.
|
| 2047 |
+
|
| 2048 |
+
### 5.5 Validation Criteria
|
| 2049 |
+
|
| 2050 |
+
- [ ] Initial UI response time < 2 seconds after generation completes
|
| 2051 |
+
- [ ] Progress indicator shows meaningful stages
|
| 2052 |
+
- [ ] Users don't see "stuck" empty progress bar
|
| 2053 |
+
- [ ] Attention data available when user expands layers
|
| 2054 |
+
|
| 2055 |
+
### 5.6 Priority
|
| 2056 |
+
|
| 2057 |
+
**Recommended implementation order:**
|
| 2058 |
+
1. Client-side progress estimation (quick win, no backend changes)
|
| 2059 |
+
2. Lazy loading attention data (biggest impact on perceived performance)
|
| 2060 |
+
3. SSE streaming (nice-to-have for power users)
|