Spaces:

visualisable-ai
/

api

Running on CPU Upgrade

gary-boon Claude Opus 4.5 commited on 3 days ago

Commit

383a328

1 Parent(s): c6f4cc5

Add Phase 5: Performance optimizations to phased plan

Document future performance improvements:
- Lazy loading attention data (recommended approach)
- Server-Sent Events for progress streaming
- Client-side progress estimation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Files changed (1) hide show

docs/devstral-spark-plan-phased.md +98 -0

docs/devstral-spark-plan-phased.md CHANGED Viewed

@@ -1960,3 +1960,101 @@ git checkout pre-devstral-phase2-v1
 After Phase 2/2b/2c completion (December 2024):
 - Backend: Contains MistralAdapter, devstral-small config, /models endpoints
 - Frontend: Contains dynamic layer handling, backendFetch conversion

 After Phase 2/2b/2c completion (December 2024):
 - Backend: Contains MistralAdapter, devstral-small config, /models endpoints
 - Frontend: Contains dynamic layer handling, backendFetch conversion
+---
+## Phase 5: Performance Optimizations (Future Work)
+**Goal:** Improve the user experience during analysis by reducing perceived latency and providing better progress feedback.
+### 5.1 Problem Statement
+After clicking "Analyze", there's a significant delay (~10-20 seconds) between:
+1. The backend completing generation (visible in HF logs)
+2. The UI updating with results
+**Root Cause:** The `/analyze/research/attention` endpoint returns a massive JSON payload containing:
+- Attention data for all layers × all heads × all tokens
+- For Devstral (40 layers, 32 heads), this can be 10s of MB of JSON
+- Network transfer and JSON parsing dominate the wait time
+### 5.2 Lazy Loading Attention Data (Recommended)
+Instead of returning all attention data upfront, use progressive loading:
+**Phase 1: Initial Response (Fast)**
+```json
+{
+  "promptTokens": [...],
+  "generatedTokens": [...],
+  "tokenSections": {...},
+  "modelInfo": {...},
+  "generationTime": 1.23,
+  "attentionAvailable": true,
+  "attentionLayers": 40
+}
+```
+**Phase 2: On-Demand Layer Loading**
+```
+GET /attention/layer/{layer_idx}?token_position={pos}
+```
+Returns attention data for a single layer only when the user expands it.
+**Benefits:**
+- Initial response is small and fast (< 100KB)
+- Attention data loaded only when needed
+- Users who only inspect a few layers don't download all data
+**Backend Changes Required:**
+- Store attention data in server-side cache (keyed by session/request ID)
+- Add `/attention/layer/{idx}` endpoint
+- Add TTL to clean up cached attention data
+**Frontend Changes Required:**
+- Update VerticalPipeline to fetch layer data on expand
+- Add loading state for individual layers
+- Cache fetched layers client-side
+### 5.3 Server-Sent Events for Progress (Alternative)
+For users who need to see all data, stream progress updates:
+```typescript
+// Backend streams events
+data: {"stage": "tokenizing", "progress": 0.1}
+data: {"stage": "generating", "progress": 0.3, "tokensGenerated": 4}
+data: {"stage": "extracting_attention", "progress": 0.5, "layer": 10}
+data: {"stage": "serializing", "progress": 0.9}
+data: {"stage": "complete", "progress": 1.0, "data": {...}}
+```
+**Implementation:**
+- Change endpoint to return `text/event-stream`
+- Frontend uses `EventSource` or `fetch` with `ReadableStream`
+- Progress bar updates with each event
+### 5.4 Client-Side Progress Estimation (Quick Win)
+As an interim solution before backend changes:
+1. Show staged progress indicator: Tokenizing → Generating → Processing → Loading
+2. Estimate timing based on model and prompt length
+3. Use optimistic progress that smoothly fills, with final jump on completion
+**Limitation:** Progress won't perfectly match actual backend state, but provides better UX than empty-then-full.
+### 5.5 Validation Criteria
+- [ ] Initial UI response time < 2 seconds after generation completes
+- [ ] Progress indicator shows meaningful stages
+- [ ] Users don't see "stuck" empty progress bar
+- [ ] Attention data available when user expands layers
+### 5.6 Priority
+**Recommended implementation order:**
+1. Client-side progress estimation (quick win, no backend changes)
+2. Lazy loading attention data (biggest impact on perceived performance)
+3. SSE streaming (nice-to-have for power users)