Spaces:
Sleeping
Sleeping
Commit ·
ce55e28
1
Parent(s): 8a91479
rag service
Browse files- RAG_REFACTOR.md +196 -0
- api/main.py +5 -13
- api/routes.py +78 -24
- requirements.txt +1 -0
- services/memory_service.py +18 -75
- services/rag_service.py +0 -628
- services/rag_service_supabase.py +595 -0
- supabase_migrations/001_create_code_embeddings.sql +68 -0
RAG_REFACTOR.md
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RAG System Refactoring - Workspace-Scoped with Supabase pgvector
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
The RAG system has been refactored to use Supabase pgvector as the only vector store, with workspace-scoped indexing and querying. This ensures that:
|
| 6 |
+
|
| 7 |
+
1. ✅ Only user workspace code is indexed (not extension source code)
|
| 8 |
+
2. ✅ Each workspace is isolated using `workspace_id`
|
| 9 |
+
3. ✅ All embeddings are stored in Supabase (cloud-only, no local vector DBs)
|
| 10 |
+
4. ✅ Incremental indexing on file save/create/delete
|
| 11 |
+
5. ✅ Free-tier friendly (efficient queries, no background loops)
|
| 12 |
+
|
| 13 |
+
## Architecture Changes
|
| 14 |
+
|
| 15 |
+
### Backend
|
| 16 |
+
|
| 17 |
+
#### New RAG Service (`rag_service_supabase.py`)
|
| 18 |
+
- **Stateless**: No local storage, all data in Supabase
|
| 19 |
+
- **Workspace-scoped**: All operations require `workspace_id`
|
| 20 |
+
- **Supabase pgvector**: Uses PostgreSQL with pgvector extension
|
| 21 |
+
- **Embeddings**: HuggingFace `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions)
|
| 22 |
+
|
| 23 |
+
#### Database Schema
|
| 24 |
+
```sql
|
| 25 |
+
CREATE TABLE code_embeddings (
|
| 26 |
+
id UUID PRIMARY KEY,
|
| 27 |
+
workspace_id TEXT NOT NULL,
|
| 28 |
+
file_path TEXT NOT NULL,
|
| 29 |
+
content TEXT NOT NULL,
|
| 30 |
+
embedding vector(384),
|
| 31 |
+
chunk_index INTEGER,
|
| 32 |
+
total_chunks INTEGER,
|
| 33 |
+
file_size INTEGER,
|
| 34 |
+
created_at TIMESTAMPTZ,
|
| 35 |
+
updated_at TIMESTAMPTZ
|
| 36 |
+
);
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
#### New API Endpoints
|
| 40 |
+
- `POST /api/rag/index/workspace` - Index workspace files
|
| 41 |
+
- `POST /api/rag/index/file` - Index a single file
|
| 42 |
+
- `DELETE /api/rag/index/file` - Delete file embeddings
|
| 43 |
+
- `POST /api/rag/query` - Query with workspace_id (updated)
|
| 44 |
+
- `GET /api/rag/stats?workspace_id=...` - Get stats for workspace (updated)
|
| 45 |
+
|
| 46 |
+
### Frontend (VS Code Extension)
|
| 47 |
+
|
| 48 |
+
#### Workspace Detection
|
| 49 |
+
- Detects active workspace on extension activation
|
| 50 |
+
- Generates stable `workspace_id` (MD5 hash of workspace path)
|
| 51 |
+
- Automatically indexes workspace files on activation
|
| 52 |
+
|
| 53 |
+
#### File Event Handlers
|
| 54 |
+
- `onDidSaveTextDocument` → Updates embeddings for saved files
|
| 55 |
+
- `onDidCreateFiles` → Indexes new files
|
| 56 |
+
- `onDidDeleteFiles` → Deletes embeddings for deleted files
|
| 57 |
+
|
| 58 |
+
#### Updated Chat Flow
|
| 59 |
+
- All chat messages include `workspace_id`
|
| 60 |
+
- RAG queries are scoped to the current workspace
|
| 61 |
+
- Context retrieval is workspace-aware
|
| 62 |
+
|
| 63 |
+
## Setup Instructions
|
| 64 |
+
|
| 65 |
+
### 1. Run Supabase Migration
|
| 66 |
+
|
| 67 |
+
Execute the migration SQL in your Supabase SQL Editor:
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
# File: augmas-backend/supabase_migrations/001_create_code_embeddings.sql
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
This creates:
|
| 74 |
+
- `code_embeddings` table with pgvector support
|
| 75 |
+
- Indexes for efficient querying
|
| 76 |
+
- `match_code_embeddings` RPC function for vector similarity search
|
| 77 |
+
|
| 78 |
+
### 2. Update Environment Variables
|
| 79 |
+
|
| 80 |
+
Ensure your `.env` file has:
|
| 81 |
+
```env
|
| 82 |
+
SUPABASE_URL=https://your-project-id.supabase.co
|
| 83 |
+
SUPABASE_KEY=your-service-role-key
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### 3. Install Dependencies
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
cd augmas-backend
|
| 90 |
+
pip install -r requirements.txt
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
New dependency: `numpy>=1.24.0,<2.0.0`
|
| 94 |
+
|
| 95 |
+
### 4. Restart Backend
|
| 96 |
+
|
| 97 |
+
The backend now initializes the new RAG service automatically:
|
| 98 |
+
```python
|
| 99 |
+
rag_service = RAGServiceSupabase()
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
## Usage
|
| 103 |
+
|
| 104 |
+
### Extension Activation
|
| 105 |
+
|
| 106 |
+
1. Open a workspace in VS Code
|
| 107 |
+
2. Extension automatically:
|
| 108 |
+
- Detects workspace
|
| 109 |
+
- Generates `workspace_id`
|
| 110 |
+
- Scans and indexes all eligible files
|
| 111 |
+
- Sets up file event handlers
|
| 112 |
+
|
| 113 |
+
### Incremental Indexing
|
| 114 |
+
|
| 115 |
+
- **File Save**: Automatically updates embeddings
|
| 116 |
+
- **File Create**: Automatically indexes new files
|
| 117 |
+
- **File Delete**: Automatically removes embeddings
|
| 118 |
+
|
| 119 |
+
### Chat with RAG
|
| 120 |
+
|
| 121 |
+
When you send a message in the chat:
|
| 122 |
+
1. Extension includes `workspace_id` in the request
|
| 123 |
+
2. Backend performs vector similarity search scoped to workspace
|
| 124 |
+
3. Relevant code chunks are injected into the LLM prompt
|
| 125 |
+
4. Response includes workspace-aware context
|
| 126 |
+
|
| 127 |
+
## Key Features
|
| 128 |
+
|
| 129 |
+
### Workspace Isolation
|
| 130 |
+
- Each workspace has a unique `workspace_id`
|
| 131 |
+
- All queries are filtered by `workspace_id`
|
| 132 |
+
- No cross-workspace contamination
|
| 133 |
+
|
| 134 |
+
### Efficient Vector Search
|
| 135 |
+
- Uses Supabase RPC function `match_code_embeddings` for efficient pgvector queries
|
| 136 |
+
- Falls back to Python-based similarity if RPC fails
|
| 137 |
+
- Optimized for free-tier limits (limited result sets)
|
| 138 |
+
|
| 139 |
+
### Free-Tier Friendly
|
| 140 |
+
- No background reindex loops
|
| 141 |
+
- Incremental updates only
|
| 142 |
+
- Efficient batch operations
|
| 143 |
+
- Respects Supabase rate limits
|
| 144 |
+
|
| 145 |
+
### Stateless Backend
|
| 146 |
+
- No local vector stores
|
| 147 |
+
- No filesystem access
|
| 148 |
+
- All data in Supabase
|
| 149 |
+
- Horizontally scalable
|
| 150 |
+
|
| 151 |
+
## Migration from Old System
|
| 152 |
+
|
| 153 |
+
The old `RAGService` (using Qdrant) is no longer used. The new system:
|
| 154 |
+
|
| 155 |
+
1. **No migration needed**: Old Qdrant data is not migrated
|
| 156 |
+
2. **Fresh start**: Each workspace starts with empty index
|
| 157 |
+
3. **Automatic indexing**: Files are indexed on first activation
|
| 158 |
+
|
| 159 |
+
## Troubleshooting
|
| 160 |
+
|
| 161 |
+
### "Supabase client not initialized"
|
| 162 |
+
- Check `SUPABASE_URL` and `SUPABASE_KEY` in `.env`
|
| 163 |
+
- Ensure you're using the **service_role** key (not anon key)
|
| 164 |
+
|
| 165 |
+
### "RPC function not found"
|
| 166 |
+
- Run the migration SQL in Supabase SQL Editor
|
| 167 |
+
- Ensure `match_code_embeddings` function exists
|
| 168 |
+
|
| 169 |
+
### "No workspace detected"
|
| 170 |
+
- Open a folder in VS Code (File → Open Folder)
|
| 171 |
+
- Extension requires an active workspace folder
|
| 172 |
+
|
| 173 |
+
### Slow indexing
|
| 174 |
+
- Large workspaces may take time to index initially
|
| 175 |
+
- Subsequent updates are incremental and fast
|
| 176 |
+
- Check Supabase dashboard for query performance
|
| 177 |
+
|
| 178 |
+
## Performance Considerations
|
| 179 |
+
|
| 180 |
+
### Free Tier Limits
|
| 181 |
+
- Supabase Free Tier: 500MB database, 2GB bandwidth
|
| 182 |
+
- Vector search is efficient but limited to ~1000 results per query
|
| 183 |
+
- Batch inserts are optimized to avoid rate limits
|
| 184 |
+
|
| 185 |
+
### Optimization Tips
|
| 186 |
+
1. Exclude large/minified files (already handled)
|
| 187 |
+
2. Use `.gitignore` patterns (extension respects them)
|
| 188 |
+
3. Index only source files (not build artifacts)
|
| 189 |
+
|
| 190 |
+
## Future Enhancements
|
| 191 |
+
|
| 192 |
+
- [ ] Background indexing progress indicator
|
| 193 |
+
- [ ] Manual reindex command
|
| 194 |
+
- [ ] Index statistics in extension UI
|
| 195 |
+
- [ ] Support for multiple workspace folders
|
| 196 |
+
- [ ] Incremental chunk updates (not full file reindex)
|
api/main.py
CHANGED
|
@@ -10,7 +10,7 @@ from utils.config import get_settings, get_environment
|
|
| 10 |
from api.routes import router
|
| 11 |
from services.langchain_service import LangChainService
|
| 12 |
from services.memory_service import MemoryService
|
| 13 |
-
from services.
|
| 14 |
|
| 15 |
# Get settings
|
| 16 |
settings = get_settings()
|
|
@@ -18,7 +18,7 @@ settings = get_settings()
|
|
| 18 |
# Service instances (will be initialized in lifespan)
|
| 19 |
langchain_service: LangChainService = None
|
| 20 |
memory_service: MemoryService = None
|
| 21 |
-
rag_service:
|
| 22 |
|
| 23 |
|
| 24 |
@asynccontextmanager
|
|
@@ -49,19 +49,11 @@ async def lifespan(app: FastAPI):
|
|
| 49 |
print(f"❌ Failed to initialize Memory service: {e}")
|
| 50 |
raise
|
| 51 |
|
| 52 |
-
# Initialize RAG service
|
| 53 |
try:
|
| 54 |
-
rag_service =
|
| 55 |
-
workspace_root=settings.workspace_root,
|
| 56 |
-
storage_path=settings.storage_path
|
| 57 |
-
)
|
| 58 |
await rag_service.initialize()
|
| 59 |
-
print("✅ RAG service initialized")
|
| 60 |
-
|
| 61 |
-
if not rag_service.is_ready() and settings.enable_rag:
|
| 62 |
-
print("📚 Starting background workspace indexing...")
|
| 63 |
-
import asyncio
|
| 64 |
-
asyncio.create_task(rag_service.index_workspace(show_progress=False))
|
| 65 |
except Exception as e:
|
| 66 |
print(f"⚠️ RAG service initialization warning: {e}")
|
| 67 |
|
|
|
|
| 10 |
from api.routes import router
|
| 11 |
from services.langchain_service import LangChainService
|
| 12 |
from services.memory_service import MemoryService
|
| 13 |
+
from services.rag_service_supabase import RAGServiceSupabase
|
| 14 |
|
| 15 |
# Get settings
|
| 16 |
settings = get_settings()
|
|
|
|
| 18 |
# Service instances (will be initialized in lifespan)
|
| 19 |
langchain_service: LangChainService = None
|
| 20 |
memory_service: MemoryService = None
|
| 21 |
+
rag_service: RAGServiceSupabase = None
|
| 22 |
|
| 23 |
|
| 24 |
@asynccontextmanager
|
|
|
|
| 49 |
print(f"❌ Failed to initialize Memory service: {e}")
|
| 50 |
raise
|
| 51 |
|
| 52 |
+
# Initialize RAG service (stateless, workspace-scoped)
|
| 53 |
try:
|
| 54 |
+
rag_service = RAGServiceSupabase()
|
|
|
|
|
|
|
|
|
|
| 55 |
await rag_service.initialize()
|
| 56 |
+
print("✅ RAG service initialized (Supabase pgvector)")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
except Exception as e:
|
| 58 |
print(f"⚠️ RAG service initialization warning: {e}")
|
| 59 |
|
api/routes.py
CHANGED
|
@@ -6,7 +6,7 @@ import logging
|
|
| 6 |
|
| 7 |
from services.langchain_service import LangChainService, CodeContext, FileContext
|
| 8 |
from services.memory_service import MemoryService
|
| 9 |
-
from services.
|
| 10 |
from auth.dependencies import get_current_user as get_current_user_id
|
| 11 |
|
| 12 |
logger = logging.getLogger(__name__)
|
|
@@ -16,7 +16,7 @@ router = APIRouter()
|
|
| 16 |
# Service instances (should be initialized in main.py and passed as dependencies)
|
| 17 |
langchain_service: Optional[LangChainService] = None
|
| 18 |
memory_service: Optional[MemoryService] = None
|
| 19 |
-
rag_service: Optional[
|
| 20 |
|
| 21 |
|
| 22 |
def get_langchain_service(request: Request) -> LangChainService:
|
|
@@ -33,7 +33,7 @@ def get_memory_service(request: Request) -> MemoryService:
|
|
| 33 |
return service
|
| 34 |
|
| 35 |
|
| 36 |
-
def get_rag_service(request: Request) ->
|
| 37 |
service = getattr(request.app.state, 'rag_service', None)
|
| 38 |
if service is None:
|
| 39 |
raise HTTPException(status_code=500, detail="RAG service not initialized")
|
|
@@ -44,6 +44,7 @@ def get_rag_service(request: Request) -> RAGService:
|
|
| 44 |
|
| 45 |
class ChatRequest(BaseModel):
|
| 46 |
message: str
|
|
|
|
| 47 |
context: Optional[Dict[str, Any]] = None
|
| 48 |
conversation_id: Optional[str] = None
|
| 49 |
current_file: Optional[Dict[str, str]] = None
|
|
@@ -62,9 +63,26 @@ class FileReference(BaseModel):
|
|
| 62 |
|
| 63 |
class RAGQueryRequest(BaseModel):
|
| 64 |
query: str
|
|
|
|
| 65 |
max_chunks: int = 5
|
| 66 |
|
| 67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
class ModelSwitchRequest(BaseModel):
|
| 69 |
model_id: str
|
| 70 |
|
|
@@ -221,7 +239,7 @@ async def get_current_user(
|
|
| 221 |
async def chat(
|
| 222 |
request: ChatRequest,
|
| 223 |
langchain: LangChainService = Depends(get_langchain_service),
|
| 224 |
-
rag:
|
| 225 |
):
|
| 226 |
"""Process chat message"""
|
| 227 |
try:
|
|
@@ -247,10 +265,14 @@ async def chat(
|
|
| 247 |
if file_context:
|
| 248 |
context.referenced_files.append(file_context)
|
| 249 |
|
| 250 |
-
# Get RAG context
|
| 251 |
rag_context = ""
|
| 252 |
-
if rag.is_ready():
|
| 253 |
-
rag_context = await rag.get_relevant_context(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
|
| 255 |
# Process query
|
| 256 |
response = await langchain.process_query(
|
|
@@ -273,9 +295,9 @@ async def chat(
|
|
| 273 |
@router.post("/rag/query")
|
| 274 |
async def rag_query(
|
| 275 |
request: RAGQueryRequest,
|
| 276 |
-
rag:
|
| 277 |
):
|
| 278 |
-
"""Query RAG for relevant context"""
|
| 279 |
try:
|
| 280 |
if not rag.is_ready():
|
| 281 |
raise HTTPException(
|
|
@@ -283,7 +305,11 @@ async def rag_query(
|
|
| 283 |
detail="RAG not ready. Please index workspace first."
|
| 284 |
)
|
| 285 |
|
| 286 |
-
context = await rag.get_relevant_context(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
return {"context": context}
|
| 288 |
|
| 289 |
except HTTPException:
|
|
@@ -292,38 +318,66 @@ async def rag_query(
|
|
| 292 |
raise HTTPException(status_code=500, detail=str(e))
|
| 293 |
|
| 294 |
|
| 295 |
-
@router.post("/rag/index")
|
| 296 |
-
async def index_workspace(
|
| 297 |
-
|
|
|
|
|
|
|
|
|
|
| 298 |
try:
|
| 299 |
-
result = await rag.index_workspace(
|
| 300 |
return result
|
| 301 |
except Exception as e:
|
| 302 |
raise HTTPException(status_code=500, detail=str(e))
|
| 303 |
|
| 304 |
|
| 305 |
-
@router.post("/rag/
|
| 306 |
-
async def
|
| 307 |
-
|
|
|
|
|
|
|
|
|
|
| 308 |
try:
|
| 309 |
-
result = await rag.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 310 |
return result
|
| 311 |
except Exception as e:
|
| 312 |
raise HTTPException(status_code=500, detail=str(e))
|
| 313 |
|
| 314 |
|
| 315 |
@router.get("/rag/stats")
|
| 316 |
-
async def get_rag_stats(
|
| 317 |
-
|
| 318 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 319 |
|
| 320 |
|
| 321 |
@router.get("/rag/status")
|
| 322 |
-
async def get_rag_status(rag:
|
| 323 |
"""Get RAG service status"""
|
| 324 |
return {
|
| 325 |
-
"ready": rag.is_ready()
|
| 326 |
-
"files_indexed": rag.get_indexed_files_count()
|
| 327 |
}
|
| 328 |
|
| 329 |
|
|
|
|
| 6 |
|
| 7 |
from services.langchain_service import LangChainService, CodeContext, FileContext
|
| 8 |
from services.memory_service import MemoryService
|
| 9 |
+
from services.rag_service_supabase import RAGServiceSupabase
|
| 10 |
from auth.dependencies import get_current_user as get_current_user_id
|
| 11 |
|
| 12 |
logger = logging.getLogger(__name__)
|
|
|
|
| 16 |
# Service instances (should be initialized in main.py and passed as dependencies)
|
| 17 |
langchain_service: Optional[LangChainService] = None
|
| 18 |
memory_service: Optional[MemoryService] = None
|
| 19 |
+
rag_service: Optional[RAGServiceSupabase] = None
|
| 20 |
|
| 21 |
|
| 22 |
def get_langchain_service(request: Request) -> LangChainService:
|
|
|
|
| 33 |
return service
|
| 34 |
|
| 35 |
|
| 36 |
+
def get_rag_service(request: Request) -> RAGServiceSupabase:
|
| 37 |
service = getattr(request.app.state, 'rag_service', None)
|
| 38 |
if service is None:
|
| 39 |
raise HTTPException(status_code=500, detail="RAG service not initialized")
|
|
|
|
| 44 |
|
| 45 |
class ChatRequest(BaseModel):
|
| 46 |
message: str
|
| 47 |
+
workspace_id: Optional[str] = None
|
| 48 |
context: Optional[Dict[str, Any]] = None
|
| 49 |
conversation_id: Optional[str] = None
|
| 50 |
current_file: Optional[Dict[str, str]] = None
|
|
|
|
| 63 |
|
| 64 |
class RAGQueryRequest(BaseModel):
|
| 65 |
query: str
|
| 66 |
+
workspace_id: str
|
| 67 |
max_chunks: int = 5
|
| 68 |
|
| 69 |
|
| 70 |
+
class IndexWorkspaceRequest(BaseModel):
|
| 71 |
+
workspace_id: str
|
| 72 |
+
files: List[Dict[str, str]] # List of {path: str, content: str}
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
class IndexFileRequest(BaseModel):
|
| 76 |
+
workspace_id: str
|
| 77 |
+
file_path: str
|
| 78 |
+
content: str
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
class DeleteFileRequest(BaseModel):
|
| 82 |
+
workspace_id: str
|
| 83 |
+
file_path: str
|
| 84 |
+
|
| 85 |
+
|
| 86 |
class ModelSwitchRequest(BaseModel):
|
| 87 |
model_id: str
|
| 88 |
|
|
|
|
| 239 |
async def chat(
|
| 240 |
request: ChatRequest,
|
| 241 |
langchain: LangChainService = Depends(get_langchain_service),
|
| 242 |
+
rag: RAGServiceSupabase = Depends(get_rag_service)
|
| 243 |
):
|
| 244 |
"""Process chat message"""
|
| 245 |
try:
|
|
|
|
| 265 |
if file_context:
|
| 266 |
context.referenced_files.append(file_context)
|
| 267 |
|
| 268 |
+
# Get RAG context (workspace-scoped)
|
| 269 |
rag_context = ""
|
| 270 |
+
if request.workspace_id and rag.is_ready():
|
| 271 |
+
rag_context = await rag.get_relevant_context(
|
| 272 |
+
request.workspace_id,
|
| 273 |
+
request.message,
|
| 274 |
+
max_chunks=5
|
| 275 |
+
)
|
| 276 |
|
| 277 |
# Process query
|
| 278 |
response = await langchain.process_query(
|
|
|
|
| 295 |
@router.post("/rag/query")
|
| 296 |
async def rag_query(
|
| 297 |
request: RAGQueryRequest,
|
| 298 |
+
rag: RAGServiceSupabase = Depends(get_rag_service)
|
| 299 |
):
|
| 300 |
+
"""Query RAG for relevant context (workspace-scoped)"""
|
| 301 |
try:
|
| 302 |
if not rag.is_ready():
|
| 303 |
raise HTTPException(
|
|
|
|
| 305 |
detail="RAG not ready. Please index workspace first."
|
| 306 |
)
|
| 307 |
|
| 308 |
+
context = await rag.get_relevant_context(
|
| 309 |
+
request.workspace_id,
|
| 310 |
+
request.query,
|
| 311 |
+
request.max_chunks
|
| 312 |
+
)
|
| 313 |
return {"context": context}
|
| 314 |
|
| 315 |
except HTTPException:
|
|
|
|
| 318 |
raise HTTPException(status_code=500, detail=str(e))
|
| 319 |
|
| 320 |
|
| 321 |
+
@router.post("/rag/index/workspace")
|
| 322 |
+
async def index_workspace(
|
| 323 |
+
request: IndexWorkspaceRequest,
|
| 324 |
+
rag: RAGServiceSupabase = Depends(get_rag_service)
|
| 325 |
+
):
|
| 326 |
+
"""Index workspace files"""
|
| 327 |
try:
|
| 328 |
+
result = await rag.index_workspace(request.workspace_id, request.files)
|
| 329 |
return result
|
| 330 |
except Exception as e:
|
| 331 |
raise HTTPException(status_code=500, detail=str(e))
|
| 332 |
|
| 333 |
|
| 334 |
+
@router.post("/rag/index/file")
|
| 335 |
+
async def index_file(
|
| 336 |
+
request: IndexFileRequest,
|
| 337 |
+
rag: RAGServiceSupabase = Depends(get_rag_service)
|
| 338 |
+
):
|
| 339 |
+
"""Index a single file"""
|
| 340 |
try:
|
| 341 |
+
result = await rag.index_workspace(
|
| 342 |
+
request.workspace_id,
|
| 343 |
+
[{'path': request.file_path, 'content': request.content}]
|
| 344 |
+
)
|
| 345 |
+
return result
|
| 346 |
+
except Exception as e:
|
| 347 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 348 |
+
|
| 349 |
+
|
| 350 |
+
@router.delete("/rag/index/file")
|
| 351 |
+
async def delete_file(
|
| 352 |
+
workspace_id: str,
|
| 353 |
+
file_path: str,
|
| 354 |
+
rag: RAGServiceSupabase = Depends(get_rag_service)
|
| 355 |
+
):
|
| 356 |
+
"""Delete embeddings for a file"""
|
| 357 |
+
try:
|
| 358 |
+
result = await rag.delete_file(workspace_id, file_path)
|
| 359 |
return result
|
| 360 |
except Exception as e:
|
| 361 |
raise HTTPException(status_code=500, detail=str(e))
|
| 362 |
|
| 363 |
|
| 364 |
@router.get("/rag/stats")
|
| 365 |
+
async def get_rag_stats(
|
| 366 |
+
workspace_id: str,
|
| 367 |
+
rag: RAGServiceSupabase = Depends(get_rag_service)
|
| 368 |
+
):
|
| 369 |
+
"""Get RAG indexing statistics for a workspace"""
|
| 370 |
+
try:
|
| 371 |
+
return await rag.get_index_stats(workspace_id)
|
| 372 |
+
except Exception as e:
|
| 373 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 374 |
|
| 375 |
|
| 376 |
@router.get("/rag/status")
|
| 377 |
+
async def get_rag_status(rag: RAGServiceSupabase = Depends(get_rag_service)):
|
| 378 |
"""Get RAG service status"""
|
| 379 |
return {
|
| 380 |
+
"ready": rag.is_ready()
|
|
|
|
| 381 |
}
|
| 382 |
|
| 383 |
|
requirements.txt
CHANGED
|
@@ -32,3 +32,4 @@ passlib[bcrypt]==1.7.4
|
|
| 32 |
httpx>=0.24.0,<0.26.0
|
| 33 |
aiofiles==23.2.1
|
| 34 |
colorama==0.4.6
|
|
|
|
|
|
| 32 |
httpx>=0.24.0,<0.26.0
|
| 33 |
aiofiles==23.2.1
|
| 34 |
colorama==0.4.6
|
| 35 |
+
numpy>=1.24.0,<2.0.0
|
services/memory_service.py
CHANGED
|
@@ -412,42 +412,6 @@ class MemoryService:
|
|
| 412 |
except Exception as e:
|
| 413 |
logger.error(f"Error updating chat last used: {e}")
|
| 414 |
|
| 415 |
-
def create_chat(self, user_id: str, title: str) -> str:
|
| 416 |
-
"""Create chat (synchronous for compatibility)"""
|
| 417 |
-
if not self.client:
|
| 418 |
-
raise ValueError("Supabase client not initialized")
|
| 419 |
-
|
| 420 |
-
now = int(time.time() * 1000)
|
| 421 |
-
|
| 422 |
-
result = self.client.table("chat_sessions").insert({
|
| 423 |
-
"title": title,
|
| 424 |
-
"created": now,
|
| 425 |
-
"last_used": now,
|
| 426 |
-
"user_id": user_id
|
| 427 |
-
}).execute()
|
| 428 |
-
|
| 429 |
-
return result.data[0]["id"]
|
| 430 |
-
|
| 431 |
-
def list_chats(self, user_id: str) -> List[Dict]:
|
| 432 |
-
"""List chats (synchronous for compatibility)"""
|
| 433 |
-
if not self.client:
|
| 434 |
-
raise ValueError("Supabase client not initialized")
|
| 435 |
-
|
| 436 |
-
result = self.client.table("chat_sessions") \
|
| 437 |
-
.select("*") \
|
| 438 |
-
.eq("user_id", user_id) \
|
| 439 |
-
.order("last_used", desc=True) \
|
| 440 |
-
.execute()
|
| 441 |
-
|
| 442 |
-
return result.data
|
| 443 |
-
|
| 444 |
-
def delete_chat(self, user_id: str, chat_id: str):
|
| 445 |
-
"""Delete chat (synchronous for compatibility)"""
|
| 446 |
-
self._verify_chat(user_id, chat_id)
|
| 447 |
-
|
| 448 |
-
self.client.table("chat_messages").delete().eq("chat_id", chat_id).execute()
|
| 449 |
-
self.client.table("chat_sessions").delete().eq("id", chat_id).execute()
|
| 450 |
-
|
| 451 |
# =========================
|
| 452 |
# MESSAGES
|
| 453 |
# =========================
|
|
@@ -573,50 +537,29 @@ class MemoryService:
|
|
| 573 |
"most_active_chat": None
|
| 574 |
}
|
| 575 |
|
| 576 |
-
def add_message(self, user_id: str, chat_id: str, role: str, content: str):
|
| 577 |
-
"""Add message (synchronous for compatibility)"""
|
| 578 |
-
if role not in ("user", "assistant"):
|
| 579 |
-
raise ValueError("Invalid role")
|
| 580 |
-
|
| 581 |
-
self._verify_chat(user_id, chat_id)
|
| 582 |
-
|
| 583 |
-
self.client.table("chat_messages").insert({
|
| 584 |
-
"chat_id": chat_id,
|
| 585 |
-
"role": role,
|
| 586 |
-
"content": content,
|
| 587 |
-
"timestamp": int(time.time() * 1000)
|
| 588 |
-
}).execute()
|
| 589 |
-
|
| 590 |
-
self.client.table("chat_sessions").update({
|
| 591 |
-
"last_used": int(time.time() * 1000)
|
| 592 |
-
}).eq("id", chat_id).execute()
|
| 593 |
-
|
| 594 |
-
def get_messages(self, user_id: str, chat_id: str) -> List[Dict]:
|
| 595 |
-
"""Get messages (synchronous for compatibility)"""
|
| 596 |
-
self._verify_chat(user_id, chat_id)
|
| 597 |
-
|
| 598 |
-
result = self.client.table("chat_messages") \
|
| 599 |
-
.select("*") \
|
| 600 |
-
.eq("chat_id", chat_id) \
|
| 601 |
-
.order("timestamp", desc=False) \
|
| 602 |
-
.execute()
|
| 603 |
-
|
| 604 |
-
return result.data
|
| 605 |
-
|
| 606 |
# =========================
|
| 607 |
# INTERNAL
|
| 608 |
# =========================
|
| 609 |
|
| 610 |
-
def _verify_chat(self, user_id: str, chat_id: str):
|
| 611 |
-
"""Verify that chat belongs to user"""
|
| 612 |
if not self.client:
|
| 613 |
raise ValueError("Supabase client not initialized")
|
| 614 |
|
| 615 |
-
|
| 616 |
-
.
|
| 617 |
-
|
| 618 |
-
|
| 619 |
-
|
|
|
|
| 620 |
|
| 621 |
-
|
| 622 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 412 |
except Exception as e:
|
| 413 |
logger.error(f"Error updating chat last used: {e}")
|
| 414 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 415 |
# =========================
|
| 416 |
# MESSAGES
|
| 417 |
# =========================
|
|
|
|
| 537 |
"most_active_chat": None
|
| 538 |
}
|
| 539 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 540 |
# =========================
|
| 541 |
# INTERNAL
|
| 542 |
# =========================
|
| 543 |
|
| 544 |
+
async def _verify_chat(self, user_id: str, chat_id: str):
|
| 545 |
+
"""Verify that chat belongs to user (ASYNC VERSION)"""
|
| 546 |
if not self.client:
|
| 547 |
raise ValueError("Supabase client not initialized")
|
| 548 |
|
| 549 |
+
try:
|
| 550 |
+
result = self.client.table("chat_sessions") \
|
| 551 |
+
.select("id") \
|
| 552 |
+
.eq("id", chat_id) \
|
| 553 |
+
.eq("user_id", user_id) \
|
| 554 |
+
.execute()
|
| 555 |
|
| 556 |
+
if not result.data:
|
| 557 |
+
logger.warning(f"Chat verification failed - chat_id: {chat_id}, user_id: {user_id}")
|
| 558 |
+
raise ValueError("Chat not found or access denied")
|
| 559 |
+
|
| 560 |
+
logger.debug(f"Chat verified - chat_id: {chat_id}, user_id: {user_id}")
|
| 561 |
+
except ValueError:
|
| 562 |
+
raise # Re-raise our custom error
|
| 563 |
+
except Exception as e:
|
| 564 |
+
logger.error(f"Error verifying chat: {e}")
|
| 565 |
+
self._handle_supabase_error(e, f"_verify_chat(chat_id={chat_id}, user_id={user_id})")
|
services/rag_service.py
DELETED
|
@@ -1,628 +0,0 @@
|
|
| 1 |
-
import json
|
| 2 |
-
import hashlib
|
| 3 |
-
import asyncio
|
| 4 |
-
import logging
|
| 5 |
-
from pathlib import Path
|
| 6 |
-
from typing import List, Dict, Set, Optional, Tuple
|
| 7 |
-
from dataclasses import dataclass, field
|
| 8 |
-
import aiofiles
|
| 9 |
-
from langchain_core.documents import Document
|
| 10 |
-
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
| 11 |
-
from langchain_huggingface import HuggingFaceEmbeddings
|
| 12 |
-
from langchain_qdrant import QdrantVectorStore
|
| 13 |
-
from qdrant_client import QdrantClient, models
|
| 14 |
-
from qdrant_client.http.models import Distance, VectorParams
|
| 15 |
-
|
| 16 |
-
# Configure logging
|
| 17 |
-
logging.basicConfig(
|
| 18 |
-
level=logging.INFO,
|
| 19 |
-
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
| 20 |
-
)
|
| 21 |
-
logger = logging.getLogger(__name__)
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
@dataclass
|
| 25 |
-
class IndexingState:
|
| 26 |
-
indexed_files: Set[str] = field(default_factory=set)
|
| 27 |
-
failed_files: Dict[str, str] = field(default_factory=dict)
|
| 28 |
-
last_indexed_at: int = 0
|
| 29 |
-
version: str = "3.0"
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
@dataclass
|
| 33 |
-
class FileProcessResult:
|
| 34 |
-
file_path: str
|
| 35 |
-
success: bool
|
| 36 |
-
documents: Optional[List[Document]] = None
|
| 37 |
-
error: Optional[str] = None
|
| 38 |
-
size: int = 0
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
class RAGService:
|
| 42 |
-
"""Service for indexing and querying codebase with RAG using HuggingFace embeddings and Qdrant"""
|
| 43 |
-
|
| 44 |
-
VERSION = "3.0"
|
| 45 |
-
BATCH_SIZE = 10
|
| 46 |
-
MAX_FILE_SIZE = 500_000 # 500KB
|
| 47 |
-
EMBEDDING_BATCH_SIZE = 32
|
| 48 |
-
RATE_LIMIT_DELAY = 0.1
|
| 49 |
-
MAX_CONCURRENT_READS = 5
|
| 50 |
-
CHECKPOINT_INTERVAL = 20
|
| 51 |
-
MAX_RETRIES = 3
|
| 52 |
-
|
| 53 |
-
# Popular code embedding models from HuggingFace
|
| 54 |
-
# Options: "BAAI/bge-small-en-v1.5", "sentence-transformers/all-MiniLM-L6-v2"
|
| 55 |
-
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
|
| 56 |
-
|
| 57 |
-
BINARY_EXTENSIONS = {
|
| 58 |
-
'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.ico',
|
| 59 |
-
'.mp3', '.mp4', '.avi', '.mov', '.wav',
|
| 60 |
-
'.zip', '.tar', '.gz', '.rar', '.7z',
|
| 61 |
-
'.exe', '.dll', '.so', '.dylib',
|
| 62 |
-
'.pdf', '.doc', '.docx', '.xls', '.xlsx',
|
| 63 |
-
'.woff', '.woff2', '.ttf', '.eot'
|
| 64 |
-
}
|
| 65 |
-
|
| 66 |
-
EXCLUDE_PATTERNS = [
|
| 67 |
-
'node_modules', '.git', 'dist', 'build',
|
| 68 |
-
'.vscode', 'coverage', '__pycache__',
|
| 69 |
-
'.pytest_cache', '.next', 'out', '.DS_Store',
|
| 70 |
-
'venv', 'env', '.env', 'vendor'
|
| 71 |
-
]
|
| 72 |
-
|
| 73 |
-
FILE_EXTENSIONS = [
|
| 74 |
-
'.ts', '.js', '.py', '.jsx', '.tsx', '.java',
|
| 75 |
-
'.go', '.php', '.rs', '.cpp', '.c', '.h', '.hpp',
|
| 76 |
-
'.cs', '.rb', '.swift', '.kt', '.md', '.txt',
|
| 77 |
-
'.json', '.yml', '.yaml', '.sh', '.sql', '.r'
|
| 78 |
-
]
|
| 79 |
-
|
| 80 |
-
def __init__(self, workspace_root: str, storage_path: str):
|
| 81 |
-
self.workspace_root = Path(workspace_root)
|
| 82 |
-
self.storage_path = Path(storage_path)
|
| 83 |
-
|
| 84 |
-
logger.info(f"Initializing RAG service for workspace: {workspace_root}")
|
| 85 |
-
|
| 86 |
-
# Initialize HuggingFace embeddings
|
| 87 |
-
try:
|
| 88 |
-
self.embeddings = HuggingFaceEmbeddings(
|
| 89 |
-
model_name=self.EMBEDDING_MODEL,
|
| 90 |
-
model_kwargs={'device': 'cpu'},
|
| 91 |
-
encode_kwargs={'normalize_embeddings': True}
|
| 92 |
-
)
|
| 93 |
-
logger.info(f"Loaded embedding model: {self.EMBEDDING_MODEL}")
|
| 94 |
-
except Exception as e:
|
| 95 |
-
logger.error(f"Failed to load embedding model: {e}")
|
| 96 |
-
raise
|
| 97 |
-
|
| 98 |
-
# Text splitter for code
|
| 99 |
-
self.text_splitter = RecursiveCharacterTextSplitter(
|
| 100 |
-
chunk_size=1000,
|
| 101 |
-
chunk_overlap=200,
|
| 102 |
-
separators=["\n\n", "\n", " ", ""]
|
| 103 |
-
)
|
| 104 |
-
|
| 105 |
-
self.vector_store: Optional[QdrantVectorStore] = None
|
| 106 |
-
self.qdrant_client: Optional[QdrantClient] = None
|
| 107 |
-
self.indexing_state = IndexingState()
|
| 108 |
-
self.is_indexing = False
|
| 109 |
-
|
| 110 |
-
# Setup storage paths
|
| 111 |
-
workspace_hash = self._hash_workspace_path(str(workspace_root))
|
| 112 |
-
self.base_path = self.storage_path / "vector_stores" / workspace_hash
|
| 113 |
-
self.state_file = self.base_path / "indexing_state.json"
|
| 114 |
-
self.qdrant_path = self.base_path / "qdrant_storage"
|
| 115 |
-
self.collection_name = f"codebase_{workspace_hash}"
|
| 116 |
-
|
| 117 |
-
def _hash_workspace_path(self, path: str) -> str:
|
| 118 |
-
"""Create hash of workspace path for storage"""
|
| 119 |
-
return hashlib.md5(path.encode()).hexdigest()[:16]
|
| 120 |
-
|
| 121 |
-
async def initialize(self):
|
| 122 |
-
"""Initialize RAG service and load existing index"""
|
| 123 |
-
try:
|
| 124 |
-
self.base_path.mkdir(parents=True, exist_ok=True)
|
| 125 |
-
self.qdrant_path.mkdir(parents=True, exist_ok=True)
|
| 126 |
-
|
| 127 |
-
await self._load_indexing_state()
|
| 128 |
-
|
| 129 |
-
# Version check
|
| 130 |
-
if self.indexing_state.version != self.VERSION:
|
| 131 |
-
logger.warning(f"Version mismatch. Expected {self.VERSION}, got {self.indexing_state.version}. Resetting index.")
|
| 132 |
-
await self.reset_index()
|
| 133 |
-
return
|
| 134 |
-
|
| 135 |
-
# Initialize Qdrant client
|
| 136 |
-
try:
|
| 137 |
-
self.qdrant_client = QdrantClient(path=str(self.qdrant_path))
|
| 138 |
-
|
| 139 |
-
# Check if collection exists
|
| 140 |
-
collections = self.qdrant_client.get_collections().collections
|
| 141 |
-
collection_exists = any(c.name == self.collection_name for c in collections)
|
| 142 |
-
|
| 143 |
-
if collection_exists:
|
| 144 |
-
# Load existing vector store
|
| 145 |
-
self.vector_store = QdrantVectorStore(
|
| 146 |
-
client=self.qdrant_client,
|
| 147 |
-
collection_name=self.collection_name,
|
| 148 |
-
embedding=self.embeddings
|
| 149 |
-
)
|
| 150 |
-
|
| 151 |
-
collection_info = self.qdrant_client.get_collection(self.collection_name)
|
| 152 |
-
vector_count = collection_info.points_count
|
| 153 |
-
|
| 154 |
-
logger.info(f"Loaded existing index with {len(self.indexing_state.indexed_files)} files, {vector_count} vectors")
|
| 155 |
-
else:
|
| 156 |
-
logger.info("No existing index found. Ready for first-time indexing.")
|
| 157 |
-
|
| 158 |
-
except Exception as e:
|
| 159 |
-
logger.error(f"Failed to initialize Qdrant: {e}")
|
| 160 |
-
await self.reset_index()
|
| 161 |
-
|
| 162 |
-
except Exception as e:
|
| 163 |
-
logger.error(f"Initialization failed: {e}")
|
| 164 |
-
raise
|
| 165 |
-
|
| 166 |
-
async def _load_indexing_state(self):
|
| 167 |
-
"""Load indexing state from disk"""
|
| 168 |
-
try:
|
| 169 |
-
if self.state_file.exists():
|
| 170 |
-
async with aiofiles.open(self.state_file, 'r') as f:
|
| 171 |
-
content = await f.read()
|
| 172 |
-
data = json.loads(content)
|
| 173 |
-
|
| 174 |
-
self.indexing_state = IndexingState(
|
| 175 |
-
indexed_files=set(data.get('indexedFiles', [])),
|
| 176 |
-
failed_files=dict(data.get('failedFiles', {})),
|
| 177 |
-
last_indexed_at=data.get('lastIndexedAt', 0),
|
| 178 |
-
version=data.get('version', '1.0')
|
| 179 |
-
)
|
| 180 |
-
logger.debug(f"Loaded indexing state: {len(self.indexing_state.indexed_files)} files")
|
| 181 |
-
except Exception as e:
|
| 182 |
-
logger.error(f"Failed to load indexing state: {e}")
|
| 183 |
-
self.indexing_state = IndexingState()
|
| 184 |
-
|
| 185 |
-
async def _save_indexing_state(self):
|
| 186 |
-
"""Save indexing state to disk"""
|
| 187 |
-
try:
|
| 188 |
-
state_data = {
|
| 189 |
-
'indexedFiles': list(self.indexing_state.indexed_files),
|
| 190 |
-
'failedFiles': self.indexing_state.failed_files,
|
| 191 |
-
'lastIndexedAt': self.indexing_state.last_indexed_at,
|
| 192 |
-
'version': self.indexing_state.version
|
| 193 |
-
}
|
| 194 |
-
|
| 195 |
-
async with aiofiles.open(self.state_file, 'w') as f:
|
| 196 |
-
await f.write(json.dumps(state_data, indent=2))
|
| 197 |
-
|
| 198 |
-
logger.debug("Saved indexing state")
|
| 199 |
-
except Exception as e:
|
| 200 |
-
logger.error(f"Failed to save indexing state: {e}")
|
| 201 |
-
|
| 202 |
-
async def index_workspace(self, show_progress: bool = True) -> Dict[str, any]:
|
| 203 |
-
"""Index the entire workspace"""
|
| 204 |
-
if self.is_indexing:
|
| 205 |
-
logger.warning("Indexing already in progress")
|
| 206 |
-
return {'error': 'Indexing already in progress'}
|
| 207 |
-
|
| 208 |
-
self.is_indexing = True
|
| 209 |
-
logger.info("Starting workspace indexing")
|
| 210 |
-
|
| 211 |
-
try:
|
| 212 |
-
all_files = await self._get_workspace_files()
|
| 213 |
-
new_files = [
|
| 214 |
-
f for f in all_files
|
| 215 |
-
if f not in self.indexing_state.indexed_files
|
| 216 |
-
]
|
| 217 |
-
|
| 218 |
-
if not new_files:
|
| 219 |
-
self.is_indexing = False
|
| 220 |
-
logger.info("All files already indexed")
|
| 221 |
-
return {
|
| 222 |
-
'success': True,
|
| 223 |
-
'message': 'All files already indexed',
|
| 224 |
-
'indexed': len(self.indexing_state.indexed_files)
|
| 225 |
-
}
|
| 226 |
-
|
| 227 |
-
logger.info(f"Found {len(new_files)} new files to index (total workspace: {len(all_files)})")
|
| 228 |
-
|
| 229 |
-
await self._process_files_in_batches(new_files)
|
| 230 |
-
|
| 231 |
-
success_count = len(new_files) - len([f for f in new_files if f in self.indexing_state.failed_files])
|
| 232 |
-
|
| 233 |
-
self.is_indexing = False
|
| 234 |
-
logger.info(f"Indexing complete: {success_count} files indexed successfully")
|
| 235 |
-
|
| 236 |
-
return {
|
| 237 |
-
'success': True,
|
| 238 |
-
'message': f'Indexed {success_count} files',
|
| 239 |
-
'indexed': success_count,
|
| 240 |
-
'failed': len(self.indexing_state.failed_files),
|
| 241 |
-
'total': len(self.indexing_state.indexed_files)
|
| 242 |
-
}
|
| 243 |
-
|
| 244 |
-
except Exception as e:
|
| 245 |
-
self.is_indexing = False
|
| 246 |
-
logger.error(f"Indexing error: {e}", exc_info=True)
|
| 247 |
-
return {'error': str(e)}
|
| 248 |
-
|
| 249 |
-
async def _process_files_in_batches(self, files: List[str]):
|
| 250 |
-
"""Process files in batches with rate limiting"""
|
| 251 |
-
total_files = len(files)
|
| 252 |
-
processed_count = 0
|
| 253 |
-
documents_buffer = []
|
| 254 |
-
|
| 255 |
-
logger.info(f"Starting batch processing of {total_files} files")
|
| 256 |
-
|
| 257 |
-
for i in range(0, len(files), self.BATCH_SIZE):
|
| 258 |
-
batch = files[i:i + self.BATCH_SIZE]
|
| 259 |
-
batch_start = i + 1
|
| 260 |
-
batch_end = min(i + self.BATCH_SIZE, total_files)
|
| 261 |
-
|
| 262 |
-
logger.info(f"Processing files {batch_start}-{batch_end} of {total_files}")
|
| 263 |
-
|
| 264 |
-
try:
|
| 265 |
-
batch_results = await self._process_batch_with_concurrency(batch)
|
| 266 |
-
|
| 267 |
-
for result in batch_results:
|
| 268 |
-
if result.success and result.documents:
|
| 269 |
-
documents_buffer.extend(result.documents)
|
| 270 |
-
self.indexing_state.indexed_files.add(result.file_path)
|
| 271 |
-
self.indexing_state.failed_files.pop(result.file_path, None)
|
| 272 |
-
elif not result.success:
|
| 273 |
-
self.indexing_state.failed_files[result.file_path] = result.error or "Unknown error"
|
| 274 |
-
logger.warning(f"Failed to index {Path(result.file_path).name}: {result.error}")
|
| 275 |
-
|
| 276 |
-
# Add documents to vector store in batches
|
| 277 |
-
if len(documents_buffer) >= self.EMBEDDING_BATCH_SIZE:
|
| 278 |
-
await self._add_documents_to_qdrant(documents_buffer)
|
| 279 |
-
documents_buffer = []
|
| 280 |
-
|
| 281 |
-
processed_count += len(batch)
|
| 282 |
-
|
| 283 |
-
# Checkpoint
|
| 284 |
-
if processed_count % self.CHECKPOINT_INTERVAL == 0:
|
| 285 |
-
await self._save_checkpoint()
|
| 286 |
-
logger.info(f"Checkpoint saved: {processed_count}/{total_files} files processed")
|
| 287 |
-
|
| 288 |
-
# Rate limiting
|
| 289 |
-
await asyncio.sleep(self.RATE_LIMIT_DELAY)
|
| 290 |
-
|
| 291 |
-
except Exception as e:
|
| 292 |
-
logger.error(f"Batch {batch_start}-{batch_end} processing error: {e}", exc_info=True)
|
| 293 |
-
for file_path in batch:
|
| 294 |
-
self.indexing_state.failed_files[file_path] = "Batch processing failed"
|
| 295 |
-
|
| 296 |
-
# Add remaining documents
|
| 297 |
-
if documents_buffer:
|
| 298 |
-
await self._add_documents_to_qdrant(documents_buffer)
|
| 299 |
-
|
| 300 |
-
await self._save_checkpoint()
|
| 301 |
-
logger.info(f"Indexing complete: {processed_count} files processed")
|
| 302 |
-
|
| 303 |
-
async def _process_batch_with_concurrency(
|
| 304 |
-
self,
|
| 305 |
-
file_paths: List[str]
|
| 306 |
-
) -> List[FileProcessResult]:
|
| 307 |
-
"""Process a batch of files with controlled concurrency"""
|
| 308 |
-
results = []
|
| 309 |
-
|
| 310 |
-
for i in range(0, len(file_paths), self.MAX_CONCURRENT_READS):
|
| 311 |
-
chunk = file_paths[i:i + self.MAX_CONCURRENT_READS]
|
| 312 |
-
|
| 313 |
-
tasks = [self._index_file_with_retry(fp) for fp in chunk]
|
| 314 |
-
chunk_results = await asyncio.gather(*tasks, return_exceptions=True)
|
| 315 |
-
|
| 316 |
-
for j, result in enumerate(chunk_results):
|
| 317 |
-
if isinstance(result, Exception):
|
| 318 |
-
results.append(FileProcessResult(
|
| 319 |
-
file_path=chunk[j],
|
| 320 |
-
success=False,
|
| 321 |
-
error=str(result),
|
| 322 |
-
size=0
|
| 323 |
-
))
|
| 324 |
-
else:
|
| 325 |
-
results.append(FileProcessResult(
|
| 326 |
-
file_path=chunk[j],
|
| 327 |
-
success=True,
|
| 328 |
-
documents=result[0],
|
| 329 |
-
size=result[1]
|
| 330 |
-
))
|
| 331 |
-
|
| 332 |
-
return results
|
| 333 |
-
|
| 334 |
-
async def _index_file_with_retry(
|
| 335 |
-
self,
|
| 336 |
-
file_path: str
|
| 337 |
-
) -> Tuple[List[Document], int]:
|
| 338 |
-
"""Index a file with retry logic"""
|
| 339 |
-
last_error = None
|
| 340 |
-
|
| 341 |
-
for attempt in range(self.MAX_RETRIES):
|
| 342 |
-
try:
|
| 343 |
-
documents = await self._index_file(file_path)
|
| 344 |
-
size = Path(file_path).stat().st_size
|
| 345 |
-
return documents, size
|
| 346 |
-
except Exception as e:
|
| 347 |
-
last_error = e
|
| 348 |
-
if attempt < self.MAX_RETRIES - 1:
|
| 349 |
-
delay = 0.5 * (2 ** attempt)
|
| 350 |
-
logger.debug(f"Retry {attempt + 1} for {Path(file_path).name} after {delay}s")
|
| 351 |
-
await asyncio.sleep(delay)
|
| 352 |
-
|
| 353 |
-
raise last_error or Exception("Unknown indexing error")
|
| 354 |
-
|
| 355 |
-
async def _index_file(self, file_path: str) -> List[Document]:
|
| 356 |
-
"""Index a single file"""
|
| 357 |
-
path = Path(file_path)
|
| 358 |
-
|
| 359 |
-
# Size check
|
| 360 |
-
file_size = path.stat().st_size
|
| 361 |
-
if file_size > self.MAX_FILE_SIZE:
|
| 362 |
-
logger.debug(f"Skipping {path.name}: file too large ({file_size} bytes)")
|
| 363 |
-
return []
|
| 364 |
-
|
| 365 |
-
# Binary check
|
| 366 |
-
if self._is_binary_file(file_path):
|
| 367 |
-
return []
|
| 368 |
-
|
| 369 |
-
# Read file content
|
| 370 |
-
try:
|
| 371 |
-
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
|
| 372 |
-
content = await f.read()
|
| 373 |
-
except UnicodeDecodeError:
|
| 374 |
-
logger.debug(f"Skipping {path.name}: encoding error")
|
| 375 |
-
return []
|
| 376 |
-
except Exception as e:
|
| 377 |
-
if 'No such file' in str(e):
|
| 378 |
-
return []
|
| 379 |
-
raise
|
| 380 |
-
|
| 381 |
-
# Content validation
|
| 382 |
-
if not content or len(content.strip()) < 10:
|
| 383 |
-
return []
|
| 384 |
-
|
| 385 |
-
# Minified file check
|
| 386 |
-
if self._is_minified_file(content, file_path):
|
| 387 |
-
logger.debug(f"Skipping {path.name}: minified file")
|
| 388 |
-
return []
|
| 389 |
-
|
| 390 |
-
# Split into chunks
|
| 391 |
-
try:
|
| 392 |
-
chunks = await asyncio.to_thread(self.text_splitter.split_text, content)
|
| 393 |
-
except Exception as e:
|
| 394 |
-
logger.warning(f"Failed to split {path.name}: {e}")
|
| 395 |
-
return []
|
| 396 |
-
|
| 397 |
-
# Create documents
|
| 398 |
-
relative_path = str(path.relative_to(self.workspace_root))
|
| 399 |
-
documents = []
|
| 400 |
-
|
| 401 |
-
for idx, chunk in enumerate(chunks):
|
| 402 |
-
documents.append(Document(
|
| 403 |
-
page_content=chunk,
|
| 404 |
-
metadata={
|
| 405 |
-
'source': relative_path,
|
| 406 |
-
'filename': path.name,
|
| 407 |
-
'extension': path.suffix,
|
| 408 |
-
'indexed_at': int(asyncio.get_event_loop().time() * 1000),
|
| 409 |
-
'chunk_index': idx,
|
| 410 |
-
'total_chunks': len(chunks),
|
| 411 |
-
'file_size': file_size
|
| 412 |
-
}
|
| 413 |
-
))
|
| 414 |
-
|
| 415 |
-
return documents
|
| 416 |
-
|
| 417 |
-
def _is_binary_file(self, file_path: str) -> bool:
|
| 418 |
-
"""Check if file is binary"""
|
| 419 |
-
ext = Path(file_path).suffix.lower()
|
| 420 |
-
return ext in self.BINARY_EXTENSIONS
|
| 421 |
-
|
| 422 |
-
def _is_minified_file(self, content: str, file_path: str) -> bool:
|
| 423 |
-
"""Check if file is minified"""
|
| 424 |
-
ext = Path(file_path).suffix.lower()
|
| 425 |
-
if ext not in ['.js', '.css', '.json']:
|
| 426 |
-
return False
|
| 427 |
-
|
| 428 |
-
lines = content.split('\n')
|
| 429 |
-
if not lines:
|
| 430 |
-
return False
|
| 431 |
-
|
| 432 |
-
avg_line_length = len(content) / len(lines)
|
| 433 |
-
return avg_line_length > 500 or '.min.' in file_path
|
| 434 |
-
|
| 435 |
-
async def _add_documents_to_qdrant(self, documents: List[Document]):
|
| 436 |
-
"""Add documents to Qdrant vector store"""
|
| 437 |
-
if not documents:
|
| 438 |
-
return
|
| 439 |
-
|
| 440 |
-
logger.info(f"Adding {len(documents)} document chunks to Qdrant")
|
| 441 |
-
|
| 442 |
-
try:
|
| 443 |
-
if not self.vector_store:
|
| 444 |
-
# Create collection and initialize vector store
|
| 445 |
-
embedding_dim = len(self.embeddings.embed_query("test"))
|
| 446 |
-
|
| 447 |
-
self.qdrant_client.create_collection(
|
| 448 |
-
collection_name=self.collection_name,
|
| 449 |
-
vectors_config=VectorParams(
|
| 450 |
-
size=embedding_dim,
|
| 451 |
-
distance=Distance.COSINE
|
| 452 |
-
)
|
| 453 |
-
)
|
| 454 |
-
|
| 455 |
-
self.vector_store = QdrantVectorStore(
|
| 456 |
-
client=self.qdrant_client,
|
| 457 |
-
collection_name=self.collection_name,
|
| 458 |
-
embedding=self.embeddings
|
| 459 |
-
)
|
| 460 |
-
logger.info(f"Created Qdrant collection: {self.collection_name}")
|
| 461 |
-
|
| 462 |
-
# Add documents in batches
|
| 463 |
-
for i in range(0, len(documents), self.EMBEDDING_BATCH_SIZE):
|
| 464 |
-
batch = documents[i:i + self.EMBEDDING_BATCH_SIZE]
|
| 465 |
-
await asyncio.to_thread(self.vector_store.add_documents, batch)
|
| 466 |
-
await asyncio.sleep(0.05)
|
| 467 |
-
|
| 468 |
-
logger.info(f"Successfully added {len(documents)} chunks to vector store")
|
| 469 |
-
|
| 470 |
-
except Exception as e:
|
| 471 |
-
logger.error(f"Failed to add documents to Qdrant: {e}", exc_info=True)
|
| 472 |
-
raise
|
| 473 |
-
|
| 474 |
-
async def _save_checkpoint(self):
|
| 475 |
-
"""Save checkpoint during indexing"""
|
| 476 |
-
try:
|
| 477 |
-
self.indexing_state.last_indexed_at = int(asyncio.get_event_loop().time() * 1000)
|
| 478 |
-
await self._save_indexing_state()
|
| 479 |
-
logger.debug("Checkpoint saved")
|
| 480 |
-
except Exception as e:
|
| 481 |
-
logger.error(f"Failed to save checkpoint: {e}")
|
| 482 |
-
|
| 483 |
-
async def _get_workspace_files(self) -> List[str]:
|
| 484 |
-
"""Get all eligible workspace files"""
|
| 485 |
-
files = []
|
| 486 |
-
|
| 487 |
-
for ext in self.FILE_EXTENSIONS:
|
| 488 |
-
for file_path in self.workspace_root.rglob(f'*{ext}'):
|
| 489 |
-
# Check exclude patterns
|
| 490 |
-
if any(pattern in str(file_path) for pattern in self.EXCLUDE_PATTERNS):
|
| 491 |
-
continue
|
| 492 |
-
|
| 493 |
-
# Skip minified files
|
| 494 |
-
if '.min.' in file_path.name:
|
| 495 |
-
continue
|
| 496 |
-
|
| 497 |
-
files.append(str(file_path))
|
| 498 |
-
|
| 499 |
-
logger.debug(f"Found {len(files)} eligible files in workspace")
|
| 500 |
-
return files
|
| 501 |
-
|
| 502 |
-
async def search_similar_code(
|
| 503 |
-
self,
|
| 504 |
-
query: str,
|
| 505 |
-
k: int = 5
|
| 506 |
-
) -> List[Document]:
|
| 507 |
-
"""Search for similar code chunks"""
|
| 508 |
-
if not self.vector_store:
|
| 509 |
-
logger.error("Vector store not initialized")
|
| 510 |
-
raise ValueError("Vector store not initialized. Please index workspace first.")
|
| 511 |
-
|
| 512 |
-
try:
|
| 513 |
-
logger.debug(f"Searching for: {query[:50]}...")
|
| 514 |
-
results = await asyncio.to_thread(
|
| 515 |
-
self.vector_store.similarity_search,
|
| 516 |
-
query,
|
| 517 |
-
k=k
|
| 518 |
-
)
|
| 519 |
-
logger.debug(f"Found {len(results)} similar documents")
|
| 520 |
-
return results
|
| 521 |
-
except Exception as e:
|
| 522 |
-
logger.error(f"Search error: {e}", exc_info=True)
|
| 523 |
-
return []
|
| 524 |
-
|
| 525 |
-
async def get_relevant_context(
|
| 526 |
-
self,
|
| 527 |
-
query: str,
|
| 528 |
-
max_chunks: int = 5
|
| 529 |
-
) -> str:
|
| 530 |
-
"""Get relevant context for a query"""
|
| 531 |
-
try:
|
| 532 |
-
docs = await self.search_similar_code(query, max_chunks)
|
| 533 |
-
if not docs:
|
| 534 |
-
return ""
|
| 535 |
-
|
| 536 |
-
context_parts = []
|
| 537 |
-
for i, doc in enumerate(docs, 1):
|
| 538 |
-
source = doc.metadata.get('source', 'Unknown')
|
| 539 |
-
chunk_idx = doc.metadata.get('chunk_index', 0)
|
| 540 |
-
context_parts.append(
|
| 541 |
-
f"[Context {i}] File: {source} (chunk {chunk_idx})\n{doc.page_content}\n---"
|
| 542 |
-
)
|
| 543 |
-
|
| 544 |
-
return "\n\n".join(context_parts)
|
| 545 |
-
|
| 546 |
-
except Exception as e:
|
| 547 |
-
logger.error(f"Failed to get relevant context: {e}")
|
| 548 |
-
return ""
|
| 549 |
-
|
| 550 |
-
async def reindex_workspace(self) -> Dict[str, any]:
|
| 551 |
-
"""Reindex entire workspace from scratch"""
|
| 552 |
-
logger.info("Starting full reindex")
|
| 553 |
-
await self.reset_index()
|
| 554 |
-
return await self.index_workspace(show_progress=True)
|
| 555 |
-
|
| 556 |
-
async def reset_index(self):
|
| 557 |
-
"""Reset the entire index"""
|
| 558 |
-
try:
|
| 559 |
-
logger.info("Resetting index")
|
| 560 |
-
|
| 561 |
-
# Delete Qdrant collection
|
| 562 |
-
if self.qdrant_client:
|
| 563 |
-
try:
|
| 564 |
-
self.qdrant_client.delete_collection(self.collection_name)
|
| 565 |
-
logger.info(f"Deleted Qdrant collection: {self.collection_name}")
|
| 566 |
-
except Exception as e:
|
| 567 |
-
logger.warning(f"Could not delete collection: {e}")
|
| 568 |
-
|
| 569 |
-
self.vector_store = None
|
| 570 |
-
self.indexing_state = IndexingState()
|
| 571 |
-
|
| 572 |
-
# Clean up storage
|
| 573 |
-
if self.base_path.exists():
|
| 574 |
-
import shutil
|
| 575 |
-
shutil.rmtree(self.base_path)
|
| 576 |
-
|
| 577 |
-
self.base_path.mkdir(parents=True, exist_ok=True)
|
| 578 |
-
self.qdrant_path.mkdir(parents=True, exist_ok=True)
|
| 579 |
-
|
| 580 |
-
await self._save_indexing_state()
|
| 581 |
-
|
| 582 |
-
# Reinitialize Qdrant client
|
| 583 |
-
self.qdrant_client = QdrantClient(path=str(self.qdrant_path))
|
| 584 |
-
|
| 585 |
-
logger.info("Index reset successfully")
|
| 586 |
-
|
| 587 |
-
except Exception as e:
|
| 588 |
-
logger.error(f"Failed to reset index: {e}", exc_info=True)
|
| 589 |
-
raise
|
| 590 |
-
|
| 591 |
-
def get_indexed_files_count(self) -> int:
|
| 592 |
-
"""Get count of indexed files"""
|
| 593 |
-
return len(self.indexing_state.indexed_files)
|
| 594 |
-
|
| 595 |
-
def get_index_stats(self) -> Dict[str, any]:
|
| 596 |
-
"""Get indexing statistics"""
|
| 597 |
-
from datetime import datetime
|
| 598 |
-
|
| 599 |
-
last_indexed = "Never"
|
| 600 |
-
if self.indexing_state.last_indexed_at > 0:
|
| 601 |
-
dt = datetime.fromtimestamp(self.indexing_state.last_indexed_at / 1000)
|
| 602 |
-
last_indexed = dt.strftime('%Y-%m-%d %H:%M:%S')
|
| 603 |
-
|
| 604 |
-
vector_count = 0
|
| 605 |
-
if self.qdrant_client and self.vector_store:
|
| 606 |
-
try:
|
| 607 |
-
collection_info = self.qdrant_client.get_collection(self.collection_name)
|
| 608 |
-
vector_count = collection_info.points_count
|
| 609 |
-
except:
|
| 610 |
-
pass
|
| 611 |
-
|
| 612 |
-
return {
|
| 613 |
-
'total_indexed': len(self.indexing_state.indexed_files),
|
| 614 |
-
'total_failed': len(self.indexing_state.failed_files),
|
| 615 |
-
'vector_count': vector_count,
|
| 616 |
-
'last_indexed_at': last_indexed,
|
| 617 |
-
'is_ready': self.vector_store is not None,
|
| 618 |
-
'is_indexing': self.is_indexing,
|
| 619 |
-
'version': self.indexing_state.version,
|
| 620 |
-
'embedding_model': self.EMBEDDING_MODEL
|
| 621 |
-
}
|
| 622 |
-
|
| 623 |
-
def is_ready(self) -> bool:
|
| 624 |
-
"""Check if RAG service is ready"""
|
| 625 |
-
return (
|
| 626 |
-
self.vector_store is not None and
|
| 627 |
-
len(self.indexing_state.indexed_files) > 0
|
| 628 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
services/rag_service_supabase.py
ADDED
|
@@ -0,0 +1,595 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import hashlib
|
| 2 |
+
import asyncio
|
| 3 |
+
import logging
|
| 4 |
+
from typing import List, Dict, Optional, Tuple
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
from datetime import datetime
|
| 7 |
+
import os
|
| 8 |
+
|
| 9 |
+
from langchain_core.documents import Document
|
| 10 |
+
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
| 11 |
+
from langchain_huggingface import HuggingFaceEmbeddings
|
| 12 |
+
from supabase import create_client, Client
|
| 13 |
+
|
| 14 |
+
# Configure logging
|
| 15 |
+
logging.basicConfig(
|
| 16 |
+
level=logging.INFO,
|
| 17 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
| 18 |
+
)
|
| 19 |
+
logger = logging.getLogger(__name__)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
@dataclass
|
| 23 |
+
class FileProcessResult:
|
| 24 |
+
file_path: str
|
| 25 |
+
success: bool
|
| 26 |
+
documents: Optional[List[Document]] = None
|
| 27 |
+
error: Optional[str] = None
|
| 28 |
+
size: int = 0
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
class RAGServiceSupabase:
|
| 32 |
+
"""Service for indexing and querying codebase with RAG using Supabase pgvector"""
|
| 33 |
+
|
| 34 |
+
VERSION = "4.0"
|
| 35 |
+
BATCH_SIZE = 10 # Files per batch
|
| 36 |
+
MAX_FILE_SIZE = 500_000 # 500KB
|
| 37 |
+
EMBEDDING_BATCH_SIZE = 32 # Chunks per embedding batch
|
| 38 |
+
RATE_LIMIT_DELAY = 0.1
|
| 39 |
+
MAX_CONCURRENT_READS = 5
|
| 40 |
+
|
| 41 |
+
# Using sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
|
| 42 |
+
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
|
| 43 |
+
EMBEDDING_DIM = 384
|
| 44 |
+
|
| 45 |
+
BINARY_EXTENSIONS = {
|
| 46 |
+
'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.ico',
|
| 47 |
+
'.mp3', '.mp4', '.avi', '.mov', '.wav',
|
| 48 |
+
'.zip', '.tar', '.gz', '.rar', '.7z',
|
| 49 |
+
'.exe', '.dll', '.so', '.dylib',
|
| 50 |
+
'.pdf', '.doc', '.docx', '.xls', '.xlsx',
|
| 51 |
+
'.woff', '.woff2', '.ttf', '.eot'
|
| 52 |
+
}
|
| 53 |
+
|
| 54 |
+
EXCLUDE_PATTERNS = [
|
| 55 |
+
'node_modules', '.git', 'dist', 'build',
|
| 56 |
+
'.vscode', 'coverage', '__pycache__',
|
| 57 |
+
'.pytest_cache', '.next', 'out', '.DS_Store',
|
| 58 |
+
'venv', 'env', '.env', 'vendor'
|
| 59 |
+
]
|
| 60 |
+
|
| 61 |
+
FILE_EXTENSIONS = [
|
| 62 |
+
'.ts', '.js', '.py', '.jsx', '.tsx', '.java',
|
| 63 |
+
'.go', '.php', '.rs', '.cpp', '.c', '.h', '.hpp',
|
| 64 |
+
'.cs', '.rb', '.swift', '.kt', '.md', '.txt',
|
| 65 |
+
'.json', '.yml', '.yaml', '.sh', '.sql', '.r'
|
| 66 |
+
]
|
| 67 |
+
|
| 68 |
+
def __init__(self):
|
| 69 |
+
"""Initialize RAG service with Supabase"""
|
| 70 |
+
logger.info("Initializing RAG service with Supabase pgvector")
|
| 71 |
+
|
| 72 |
+
# Initialize Supabase client
|
| 73 |
+
supabase_url = os.getenv("SUPABASE_URL")
|
| 74 |
+
supabase_key = os.getenv("SUPABASE_KEY")
|
| 75 |
+
|
| 76 |
+
if not supabase_url or not supabase_key:
|
| 77 |
+
logger.warning("Supabase URL or key not configured. RAG features will not work.")
|
| 78 |
+
logger.warning("Please set SUPABASE_URL and SUPABASE_KEY environment variables.")
|
| 79 |
+
self.client: Optional[Client] = None
|
| 80 |
+
else:
|
| 81 |
+
try:
|
| 82 |
+
self.client = create_client(supabase_url, supabase_key)
|
| 83 |
+
# Test connection
|
| 84 |
+
try:
|
| 85 |
+
self.client.table("code_embeddings").select("id").limit(1).execute()
|
| 86 |
+
logger.info("Supabase client initialized and verified")
|
| 87 |
+
except Exception as e:
|
| 88 |
+
logger.error(f"Failed to verify Supabase connection: {e}")
|
| 89 |
+
self.client = None
|
| 90 |
+
except Exception as e:
|
| 91 |
+
logger.error(f"Failed to initialize Supabase client: {e}")
|
| 92 |
+
self.client = None
|
| 93 |
+
|
| 94 |
+
# Initialize HuggingFace embeddings
|
| 95 |
+
try:
|
| 96 |
+
self.embeddings = HuggingFaceEmbeddings(
|
| 97 |
+
model_name=self.EMBEDDING_MODEL,
|
| 98 |
+
model_kwargs={'device': 'cpu'},
|
| 99 |
+
encode_kwargs={'normalize_embeddings': True}
|
| 100 |
+
)
|
| 101 |
+
logger.info(f"Loaded embedding model: {self.EMBEDDING_MODEL}")
|
| 102 |
+
except Exception as e:
|
| 103 |
+
logger.error(f"Failed to load embedding model: {e}")
|
| 104 |
+
raise
|
| 105 |
+
|
| 106 |
+
# Text splitter for code
|
| 107 |
+
self.text_splitter = RecursiveCharacterTextSplitter(
|
| 108 |
+
chunk_size=1000,
|
| 109 |
+
chunk_overlap=200,
|
| 110 |
+
separators=["\n\n", "\n", " ", ""]
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
self.is_indexing = False
|
| 114 |
+
|
| 115 |
+
@staticmethod
|
| 116 |
+
def _hash_workspace_path(path: str) -> str:
|
| 117 |
+
"""Create stable hash of workspace path"""
|
| 118 |
+
return hashlib.md5(path.encode()).hexdigest()
|
| 119 |
+
|
| 120 |
+
async def initialize(self):
|
| 121 |
+
"""Initialize RAG service (no-op for stateless service)"""
|
| 122 |
+
if self.client is None:
|
| 123 |
+
logger.warning("RAG service initialized without Supabase client")
|
| 124 |
+
else:
|
| 125 |
+
logger.info("RAG service initialized successfully")
|
| 126 |
+
return True
|
| 127 |
+
|
| 128 |
+
async def index_workspace(
|
| 129 |
+
self,
|
| 130 |
+
workspace_id: str,
|
| 131 |
+
files: List[Dict[str, str]]
|
| 132 |
+
) -> Dict[str, any]:
|
| 133 |
+
"""
|
| 134 |
+
Index workspace files
|
| 135 |
+
|
| 136 |
+
Args:
|
| 137 |
+
workspace_id: Unique identifier for the workspace
|
| 138 |
+
files: List of dicts with 'path' and 'content' keys
|
| 139 |
+
"""
|
| 140 |
+
if self.client is None:
|
| 141 |
+
return {'error': 'Supabase client not initialized'}
|
| 142 |
+
|
| 143 |
+
if self.is_indexing:
|
| 144 |
+
logger.warning("Indexing already in progress")
|
| 145 |
+
return {'error': 'Indexing already in progress'}
|
| 146 |
+
|
| 147 |
+
self.is_indexing = True
|
| 148 |
+
logger.info(f"Starting workspace indexing for workspace_id: {workspace_id}")
|
| 149 |
+
|
| 150 |
+
try:
|
| 151 |
+
total_files = len(files)
|
| 152 |
+
if total_files == 0:
|
| 153 |
+
self.is_indexing = False
|
| 154 |
+
return {
|
| 155 |
+
'success': True,
|
| 156 |
+
'message': 'No files to index',
|
| 157 |
+
'indexed': 0
|
| 158 |
+
}
|
| 159 |
+
|
| 160 |
+
logger.info(f"Indexing {total_files} files")
|
| 161 |
+
|
| 162 |
+
success_count = 0
|
| 163 |
+
failed_count = 0
|
| 164 |
+
|
| 165 |
+
# Process files in batches
|
| 166 |
+
for i in range(0, total_files, self.BATCH_SIZE):
|
| 167 |
+
batch = files[i:i + self.BATCH_SIZE]
|
| 168 |
+
batch_start = i + 1
|
| 169 |
+
batch_end = min(i + self.BATCH_SIZE, total_files)
|
| 170 |
+
|
| 171 |
+
logger.info(f"Processing files {batch_start}-{batch_end} of {total_files}")
|
| 172 |
+
|
| 173 |
+
try:
|
| 174 |
+
batch_results = await self._process_batch(workspace_id, batch)
|
| 175 |
+
|
| 176 |
+
for result in batch_results:
|
| 177 |
+
if result.success:
|
| 178 |
+
success_count += 1
|
| 179 |
+
else:
|
| 180 |
+
failed_count += 1
|
| 181 |
+
logger.warning(f"Failed to index {result.file_path}: {result.error}")
|
| 182 |
+
|
| 183 |
+
# Rate limiting
|
| 184 |
+
await asyncio.sleep(self.RATE_LIMIT_DELAY)
|
| 185 |
+
|
| 186 |
+
except Exception as e:
|
| 187 |
+
logger.error(f"Batch {batch_start}-{batch_end} processing error: {e}", exc_info=True)
|
| 188 |
+
failed_count += len(batch)
|
| 189 |
+
|
| 190 |
+
self.is_indexing = False
|
| 191 |
+
logger.info(f"Indexing complete: {success_count} files indexed, {failed_count} failed")
|
| 192 |
+
|
| 193 |
+
return {
|
| 194 |
+
'success': True,
|
| 195 |
+
'message': f'Indexed {success_count} files',
|
| 196 |
+
'indexed': success_count,
|
| 197 |
+
'failed': failed_count,
|
| 198 |
+
'total': total_files
|
| 199 |
+
}
|
| 200 |
+
|
| 201 |
+
except Exception as e:
|
| 202 |
+
self.is_indexing = False
|
| 203 |
+
logger.error(f"Indexing error: {e}", exc_info=True)
|
| 204 |
+
return {'error': str(e)}
|
| 205 |
+
|
| 206 |
+
async def _process_batch(
|
| 207 |
+
self,
|
| 208 |
+
workspace_id: str,
|
| 209 |
+
files: List[Dict[str, str]]
|
| 210 |
+
) -> List[FileProcessResult]:
|
| 211 |
+
"""Process a batch of files"""
|
| 212 |
+
results = []
|
| 213 |
+
|
| 214 |
+
# Process files concurrently
|
| 215 |
+
tasks = [self._index_file(workspace_id, file_data) for file_data in files]
|
| 216 |
+
batch_results = await asyncio.gather(*tasks, return_exceptions=True)
|
| 217 |
+
|
| 218 |
+
for i, result in enumerate(batch_results):
|
| 219 |
+
if isinstance(result, Exception):
|
| 220 |
+
results.append(FileProcessResult(
|
| 221 |
+
file_path=files[i].get('path', 'unknown'),
|
| 222 |
+
success=False,
|
| 223 |
+
error=str(result),
|
| 224 |
+
size=0
|
| 225 |
+
))
|
| 226 |
+
else:
|
| 227 |
+
results.append(result)
|
| 228 |
+
|
| 229 |
+
return results
|
| 230 |
+
|
| 231 |
+
async def _index_file(
|
| 232 |
+
self,
|
| 233 |
+
workspace_id: str,
|
| 234 |
+
file_data: Dict[str, str]
|
| 235 |
+
) -> FileProcessResult:
|
| 236 |
+
"""Index a single file"""
|
| 237 |
+
file_path = file_data.get('path', '')
|
| 238 |
+
content = file_data.get('content', '')
|
| 239 |
+
|
| 240 |
+
try:
|
| 241 |
+
# Size check
|
| 242 |
+
if len(content.encode('utf-8')) > self.MAX_FILE_SIZE:
|
| 243 |
+
return FileProcessResult(
|
| 244 |
+
file_path=file_path,
|
| 245 |
+
success=False,
|
| 246 |
+
error="File too large",
|
| 247 |
+
size=len(content.encode('utf-8'))
|
| 248 |
+
)
|
| 249 |
+
|
| 250 |
+
# Binary check
|
| 251 |
+
if self._is_binary_file(file_path):
|
| 252 |
+
return FileProcessResult(
|
| 253 |
+
file_path=file_path,
|
| 254 |
+
success=False,
|
| 255 |
+
error="Binary file",
|
| 256 |
+
size=len(content.encode('utf-8'))
|
| 257 |
+
)
|
| 258 |
+
|
| 259 |
+
# Content validation
|
| 260 |
+
if not content or len(content.strip()) < 10:
|
| 261 |
+
return FileProcessResult(
|
| 262 |
+
file_path=file_path,
|
| 263 |
+
success=False,
|
| 264 |
+
error="Empty or too short",
|
| 265 |
+
size=len(content.encode('utf-8'))
|
| 266 |
+
)
|
| 267 |
+
|
| 268 |
+
# Minified file check
|
| 269 |
+
if self._is_minified_file(content, file_path):
|
| 270 |
+
return FileProcessResult(
|
| 271 |
+
file_path=file_path,
|
| 272 |
+
success=False,
|
| 273 |
+
error="Minified file",
|
| 274 |
+
size=len(content.encode('utf-8'))
|
| 275 |
+
)
|
| 276 |
+
|
| 277 |
+
# Split into chunks
|
| 278 |
+
try:
|
| 279 |
+
chunks = await asyncio.to_thread(self.text_splitter.split_text, content)
|
| 280 |
+
except Exception as e:
|
| 281 |
+
logger.warning(f"Failed to split {file_path}: {e}")
|
| 282 |
+
return FileProcessResult(
|
| 283 |
+
file_path=file_path,
|
| 284 |
+
success=False,
|
| 285 |
+
error=f"Split error: {e}",
|
| 286 |
+
size=len(content.encode('utf-8'))
|
| 287 |
+
)
|
| 288 |
+
|
| 289 |
+
# Delete existing embeddings for this file
|
| 290 |
+
await self._delete_file_embeddings(workspace_id, file_path)
|
| 291 |
+
|
| 292 |
+
# Create documents and embeddings
|
| 293 |
+
documents = []
|
| 294 |
+
for idx, chunk in enumerate(chunks):
|
| 295 |
+
documents.append(Document(
|
| 296 |
+
page_content=chunk,
|
| 297 |
+
metadata={
|
| 298 |
+
'source': file_path,
|
| 299 |
+
'filename': os.path.basename(file_path),
|
| 300 |
+
'chunk_index': idx,
|
| 301 |
+
'total_chunks': len(chunks),
|
| 302 |
+
'file_size': len(content.encode('utf-8'))
|
| 303 |
+
}
|
| 304 |
+
))
|
| 305 |
+
|
| 306 |
+
# Generate embeddings and store in Supabase
|
| 307 |
+
await self._store_embeddings(workspace_id, file_path, documents)
|
| 308 |
+
|
| 309 |
+
return FileProcessResult(
|
| 310 |
+
file_path=file_path,
|
| 311 |
+
success=True,
|
| 312 |
+
documents=documents,
|
| 313 |
+
size=len(content.encode('utf-8'))
|
| 314 |
+
)
|
| 315 |
+
|
| 316 |
+
except Exception as e:
|
| 317 |
+
logger.error(f"Error indexing file {file_path}: {e}", exc_info=True)
|
| 318 |
+
return FileProcessResult(
|
| 319 |
+
file_path=file_path,
|
| 320 |
+
success=False,
|
| 321 |
+
error=str(e),
|
| 322 |
+
size=len(content.encode('utf-8')) if content else 0
|
| 323 |
+
)
|
| 324 |
+
|
| 325 |
+
async def _store_embeddings(
|
| 326 |
+
self,
|
| 327 |
+
workspace_id: str,
|
| 328 |
+
file_path: str,
|
| 329 |
+
documents: List[Document]
|
| 330 |
+
):
|
| 331 |
+
"""Store document embeddings in Supabase"""
|
| 332 |
+
if not documents:
|
| 333 |
+
return
|
| 334 |
+
|
| 335 |
+
try:
|
| 336 |
+
# Generate embeddings in batches
|
| 337 |
+
for i in range(0, len(documents), self.EMBEDDING_BATCH_SIZE):
|
| 338 |
+
batch = documents[i:i + self.EMBEDDING_BATCH_SIZE]
|
| 339 |
+
|
| 340 |
+
# Generate embeddings
|
| 341 |
+
texts = [doc.page_content for doc in batch]
|
| 342 |
+
embeddings_list = await asyncio.to_thread(
|
| 343 |
+
self.embeddings.embed_documents,
|
| 344 |
+
texts
|
| 345 |
+
)
|
| 346 |
+
|
| 347 |
+
# Prepare records for insertion
|
| 348 |
+
records = []
|
| 349 |
+
for j, (doc, embedding) in enumerate(zip(batch, embeddings_list)):
|
| 350 |
+
# Convert embedding to list format for Supabase
|
| 351 |
+
embedding_list = embedding if isinstance(embedding, list) else embedding.tolist()
|
| 352 |
+
|
| 353 |
+
records.append({
|
| 354 |
+
'workspace_id': workspace_id,
|
| 355 |
+
'file_path': file_path,
|
| 356 |
+
'content': doc.page_content,
|
| 357 |
+
'embedding': embedding_list,
|
| 358 |
+
'chunk_index': doc.metadata.get('chunk_index', j),
|
| 359 |
+
'total_chunks': doc.metadata.get('total_chunks', len(documents)),
|
| 360 |
+
'file_size': doc.metadata.get('file_size', 0)
|
| 361 |
+
})
|
| 362 |
+
|
| 363 |
+
# Insert into Supabase
|
| 364 |
+
if records:
|
| 365 |
+
self.client.table("code_embeddings").insert(records).execute()
|
| 366 |
+
logger.debug(f"Stored {len(records)} embeddings for {file_path}")
|
| 367 |
+
|
| 368 |
+
logger.info(f"Stored {len(documents)} chunks for {file_path}")
|
| 369 |
+
|
| 370 |
+
except Exception as e:
|
| 371 |
+
logger.error(f"Failed to store embeddings for {file_path}: {e}", exc_info=True)
|
| 372 |
+
raise
|
| 373 |
+
|
| 374 |
+
async def _delete_file_embeddings(self, workspace_id: str, file_path: str):
|
| 375 |
+
"""Delete all embeddings for a file"""
|
| 376 |
+
try:
|
| 377 |
+
self.client.table("code_embeddings").delete().eq(
|
| 378 |
+
"workspace_id", workspace_id
|
| 379 |
+
).eq("file_path", file_path).execute()
|
| 380 |
+
logger.debug(f"Deleted embeddings for {file_path}")
|
| 381 |
+
except Exception as e:
|
| 382 |
+
logger.error(f"Failed to delete embeddings for {file_path}: {e}")
|
| 383 |
+
|
| 384 |
+
def _is_binary_file(self, file_path: str) -> bool:
|
| 385 |
+
"""Check if file is binary"""
|
| 386 |
+
ext = os.path.splitext(file_path)[1].lower()
|
| 387 |
+
return ext in self.BINARY_EXTENSIONS
|
| 388 |
+
|
| 389 |
+
def _is_minified_file(self, content: str, file_path: str) -> bool:
|
| 390 |
+
"""Check if file is minified"""
|
| 391 |
+
ext = os.path.splitext(file_path)[1].lower()
|
| 392 |
+
if ext not in ['.js', '.css', '.json']:
|
| 393 |
+
return False
|
| 394 |
+
|
| 395 |
+
lines = content.split('\n')
|
| 396 |
+
if not lines:
|
| 397 |
+
return False
|
| 398 |
+
|
| 399 |
+
avg_line_length = len(content) / len(lines)
|
| 400 |
+
return avg_line_length > 500 or '.min.' in file_path
|
| 401 |
+
|
| 402 |
+
async def search_similar_code(
|
| 403 |
+
self,
|
| 404 |
+
workspace_id: str,
|
| 405 |
+
query: str,
|
| 406 |
+
k: int = 5
|
| 407 |
+
) -> List[Document]:
|
| 408 |
+
"""Search for similar code chunks using pgvector via Supabase RPC"""
|
| 409 |
+
if self.client is None:
|
| 410 |
+
logger.error("Supabase client not initialized")
|
| 411 |
+
return []
|
| 412 |
+
|
| 413 |
+
try:
|
| 414 |
+
# Generate query embedding
|
| 415 |
+
query_embedding = await asyncio.to_thread(
|
| 416 |
+
self.embeddings.embed_query,
|
| 417 |
+
query
|
| 418 |
+
)
|
| 419 |
+
|
| 420 |
+
# Convert to list format
|
| 421 |
+
query_embedding_list = query_embedding if isinstance(query_embedding, list) else query_embedding.tolist()
|
| 422 |
+
|
| 423 |
+
# Use Supabase RPC function for efficient vector search
|
| 424 |
+
try:
|
| 425 |
+
response = self.client.rpc(
|
| 426 |
+
'match_code_embeddings',
|
| 427 |
+
{
|
| 428 |
+
'query_embedding': query_embedding_list,
|
| 429 |
+
'workspace_filter': workspace_id,
|
| 430 |
+
'match_threshold': 0.3, # Lower threshold for more results
|
| 431 |
+
'match_count': k
|
| 432 |
+
}
|
| 433 |
+
).execute()
|
| 434 |
+
|
| 435 |
+
if not response.data:
|
| 436 |
+
return []
|
| 437 |
+
|
| 438 |
+
# Convert to Document objects
|
| 439 |
+
documents = []
|
| 440 |
+
for row in response.data:
|
| 441 |
+
documents.append(Document(
|
| 442 |
+
page_content=row['content'],
|
| 443 |
+
metadata={
|
| 444 |
+
'source': row['file_path'],
|
| 445 |
+
'filename': os.path.basename(row['file_path']),
|
| 446 |
+
'chunk_index': row.get('chunk_index', 0),
|
| 447 |
+
'total_chunks': row.get('total_chunks', 1),
|
| 448 |
+
'similarity': float(row.get('similarity', 0.0))
|
| 449 |
+
}
|
| 450 |
+
))
|
| 451 |
+
|
| 452 |
+
logger.debug(f"Found {len(documents)} similar documents via RPC")
|
| 453 |
+
return documents
|
| 454 |
+
|
| 455 |
+
except Exception as rpc_error:
|
| 456 |
+
# Fallback to Python-based similarity if RPC fails
|
| 457 |
+
logger.warning(f"RPC search failed, using fallback: {rpc_error}")
|
| 458 |
+
return await self._fallback_search(workspace_id, query_embedding_list, k)
|
| 459 |
+
|
| 460 |
+
except Exception as e:
|
| 461 |
+
logger.error(f"Search error: {e}", exc_info=True)
|
| 462 |
+
return []
|
| 463 |
+
|
| 464 |
+
async def _fallback_search(
|
| 465 |
+
self,
|
| 466 |
+
workspace_id: str,
|
| 467 |
+
query_embedding_list: List[float],
|
| 468 |
+
k: int
|
| 469 |
+
) -> List[Document]:
|
| 470 |
+
"""Fallback search using Python-based similarity computation"""
|
| 471 |
+
try:
|
| 472 |
+
# Fetch embeddings for workspace (limited for free tier)
|
| 473 |
+
response = self.client.table("code_embeddings").select(
|
| 474 |
+
"id, workspace_id, file_path, content, chunk_index, total_chunks, embedding"
|
| 475 |
+
).eq("workspace_id", workspace_id).limit(500).execute()
|
| 476 |
+
|
| 477 |
+
if not response.data:
|
| 478 |
+
return []
|
| 479 |
+
|
| 480 |
+
# Compute cosine similarity
|
| 481 |
+
import numpy as np
|
| 482 |
+
|
| 483 |
+
query_vec = np.array(query_embedding_list)
|
| 484 |
+
query_vec_norm = np.linalg.norm(query_vec)
|
| 485 |
+
|
| 486 |
+
similarities = []
|
| 487 |
+
for row in response.data:
|
| 488 |
+
embedding = row['embedding']
|
| 489 |
+
if embedding:
|
| 490 |
+
doc_vec = np.array(embedding)
|
| 491 |
+
doc_vec_norm = np.linalg.norm(doc_vec)
|
| 492 |
+
|
| 493 |
+
if doc_vec_norm > 0 and query_vec_norm > 0:
|
| 494 |
+
similarity = np.dot(query_vec, doc_vec) / (query_vec_norm * doc_vec_norm)
|
| 495 |
+
similarities.append((similarity, row))
|
| 496 |
+
|
| 497 |
+
# Sort by similarity and take top k
|
| 498 |
+
similarities.sort(key=lambda x: x[0], reverse=True)
|
| 499 |
+
top_results = similarities[:k]
|
| 500 |
+
|
| 501 |
+
# Convert to Document objects
|
| 502 |
+
documents = []
|
| 503 |
+
for similarity, row in top_results:
|
| 504 |
+
documents.append(Document(
|
| 505 |
+
page_content=row['content'],
|
| 506 |
+
metadata={
|
| 507 |
+
'source': row['file_path'],
|
| 508 |
+
'filename': os.path.basename(row['file_path']),
|
| 509 |
+
'chunk_index': row.get('chunk_index', 0),
|
| 510 |
+
'total_chunks': row.get('total_chunks', 1),
|
| 511 |
+
'similarity': float(similarity)
|
| 512 |
+
}
|
| 513 |
+
))
|
| 514 |
+
|
| 515 |
+
return documents
|
| 516 |
+
|
| 517 |
+
except Exception as e:
|
| 518 |
+
logger.error(f"Fallback search error: {e}")
|
| 519 |
+
return []
|
| 520 |
+
|
| 521 |
+
async def get_relevant_context(
|
| 522 |
+
self,
|
| 523 |
+
workspace_id: str,
|
| 524 |
+
query: str,
|
| 525 |
+
max_chunks: int = 5
|
| 526 |
+
) -> str:
|
| 527 |
+
"""Get relevant context for a query"""
|
| 528 |
+
try:
|
| 529 |
+
docs = await self.search_similar_code(workspace_id, query, max_chunks)
|
| 530 |
+
if not docs:
|
| 531 |
+
return ""
|
| 532 |
+
|
| 533 |
+
context_parts = []
|
| 534 |
+
for i, doc in enumerate(docs, 1):
|
| 535 |
+
source = doc.metadata.get('source', 'Unknown')
|
| 536 |
+
chunk_idx = doc.metadata.get('chunk_index', 0)
|
| 537 |
+
similarity = doc.metadata.get('similarity', 0.0)
|
| 538 |
+
context_parts.append(
|
| 539 |
+
f"[Context {i}] File: {source} (chunk {chunk_idx}, similarity: {similarity:.3f})\n{doc.page_content}\n---"
|
| 540 |
+
)
|
| 541 |
+
|
| 542 |
+
return "\n\n".join(context_parts)
|
| 543 |
+
|
| 544 |
+
except Exception as e:
|
| 545 |
+
logger.error(f"Failed to get relevant context: {e}")
|
| 546 |
+
return ""
|
| 547 |
+
|
| 548 |
+
async def delete_file(self, workspace_id: str, file_path: str) -> Dict[str, any]:
|
| 549 |
+
"""Delete embeddings for a file"""
|
| 550 |
+
if self.client is None:
|
| 551 |
+
return {'error': 'Supabase client not initialized'}
|
| 552 |
+
|
| 553 |
+
try:
|
| 554 |
+
await self._delete_file_embeddings(workspace_id, file_path)
|
| 555 |
+
return {
|
| 556 |
+
'success': True,
|
| 557 |
+
'message': f'Deleted embeddings for {file_path}'
|
| 558 |
+
}
|
| 559 |
+
except Exception as e:
|
| 560 |
+
logger.error(f"Failed to delete file embeddings: {e}")
|
| 561 |
+
return {'error': str(e)}
|
| 562 |
+
|
| 563 |
+
async def get_index_stats(self, workspace_id: str) -> Dict[str, any]:
|
| 564 |
+
"""Get indexing statistics for a workspace"""
|
| 565 |
+
if self.client is None:
|
| 566 |
+
return {'error': 'Supabase client not initialized'}
|
| 567 |
+
|
| 568 |
+
try:
|
| 569 |
+
# Count embeddings for this workspace
|
| 570 |
+
response = self.client.table("code_embeddings").select(
|
| 571 |
+
"id, file_path",
|
| 572 |
+
count="exact"
|
| 573 |
+
).eq("workspace_id", workspace_id).execute()
|
| 574 |
+
|
| 575 |
+
total_vectors = response.count if hasattr(response, 'count') else len(response.data)
|
| 576 |
+
|
| 577 |
+
# Count unique files
|
| 578 |
+
unique_files = set()
|
| 579 |
+
for row in response.data:
|
| 580 |
+
unique_files.add(row['file_path'])
|
| 581 |
+
|
| 582 |
+
return {
|
| 583 |
+
'workspace_id': workspace_id,
|
| 584 |
+
'total_vectors': total_vectors,
|
| 585 |
+
'total_files': len(unique_files),
|
| 586 |
+
'is_ready': total_vectors > 0,
|
| 587 |
+
'embedding_model': self.EMBEDDING_MODEL
|
| 588 |
+
}
|
| 589 |
+
except Exception as e:
|
| 590 |
+
logger.error(f"Failed to get index stats: {e}")
|
| 591 |
+
return {'error': str(e)}
|
| 592 |
+
|
| 593 |
+
def is_ready(self, workspace_id: Optional[str] = None) -> bool:
|
| 594 |
+
"""Check if RAG service is ready (always true if client is initialized)"""
|
| 595 |
+
return self.client is not None
|
supabase_migrations/001_create_code_embeddings.sql
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
-- Enable pgvector extension
|
| 2 |
+
CREATE EXTENSION IF NOT EXISTS vector;
|
| 3 |
+
|
| 4 |
+
-- Create code_embeddings table for workspace-scoped RAG
|
| 5 |
+
CREATE TABLE IF NOT EXISTS code_embeddings (
|
| 6 |
+
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
| 7 |
+
workspace_id TEXT NOT NULL,
|
| 8 |
+
file_path TEXT NOT NULL,
|
| 9 |
+
content TEXT NOT NULL,
|
| 10 |
+
embedding vector(384), -- Using sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
|
| 11 |
+
chunk_index INTEGER DEFAULT 0,
|
| 12 |
+
total_chunks INTEGER DEFAULT 1,
|
| 13 |
+
file_size INTEGER DEFAULT 0,
|
| 14 |
+
created_at TIMESTAMPTZ DEFAULT NOW(),
|
| 15 |
+
updated_at TIMESTAMPTZ DEFAULT NOW()
|
| 16 |
+
);
|
| 17 |
+
|
| 18 |
+
-- Create indexes for efficient querying
|
| 19 |
+
CREATE INDEX IF NOT EXISTS idx_code_embeddings_workspace_id ON code_embeddings(workspace_id);
|
| 20 |
+
CREATE INDEX IF NOT EXISTS idx_code_embeddings_file_path ON code_embeddings(workspace_id, file_path);
|
| 21 |
+
CREATE INDEX IF NOT EXISTS idx_code_embeddings_embedding ON code_embeddings
|
| 22 |
+
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
|
| 23 |
+
|
| 24 |
+
-- Create index for workspace + file_path combination (for efficient deletion)
|
| 25 |
+
CREATE INDEX IF NOT EXISTS idx_code_embeddings_workspace_file ON code_embeddings(workspace_id, file_path);
|
| 26 |
+
|
| 27 |
+
-- Add comment
|
| 28 |
+
COMMENT ON TABLE code_embeddings IS 'Stores code embeddings for RAG with workspace isolation';
|
| 29 |
+
COMMENT ON COLUMN code_embeddings.workspace_id IS 'Hash of workspace path for isolation';
|
| 30 |
+
COMMENT ON COLUMN code_embeddings.embedding IS 'Vector embedding of code chunk (384 dimensions)';
|
| 31 |
+
|
| 32 |
+
-- Create function for efficient vector similarity search
|
| 33 |
+
CREATE OR REPLACE FUNCTION match_code_embeddings(
|
| 34 |
+
query_embedding vector(384),
|
| 35 |
+
workspace_filter text,
|
| 36 |
+
match_threshold float DEFAULT 0.5,
|
| 37 |
+
match_count int DEFAULT 5
|
| 38 |
+
)
|
| 39 |
+
RETURNS TABLE (
|
| 40 |
+
id uuid,
|
| 41 |
+
workspace_id text,
|
| 42 |
+
file_path text,
|
| 43 |
+
content text,
|
| 44 |
+
chunk_index integer,
|
| 45 |
+
total_chunks integer,
|
| 46 |
+
file_size integer,
|
| 47 |
+
similarity float
|
| 48 |
+
)
|
| 49 |
+
LANGUAGE plpgsql
|
| 50 |
+
AS $$
|
| 51 |
+
BEGIN
|
| 52 |
+
RETURN QUERY
|
| 53 |
+
SELECT
|
| 54 |
+
code_embeddings.id,
|
| 55 |
+
code_embeddings.workspace_id,
|
| 56 |
+
code_embeddings.file_path,
|
| 57 |
+
code_embeddings.content,
|
| 58 |
+
code_embeddings.chunk_index,
|
| 59 |
+
code_embeddings.total_chunks,
|
| 60 |
+
code_embeddings.file_size,
|
| 61 |
+
1 - (code_embeddings.embedding <=> query_embedding) AS similarity
|
| 62 |
+
FROM code_embeddings
|
| 63 |
+
WHERE code_embeddings.workspace_id = workspace_filter
|
| 64 |
+
AND 1 - (code_embeddings.embedding <=> query_embedding) > match_threshold
|
| 65 |
+
ORDER BY code_embeddings.embedding <=> query_embedding
|
| 66 |
+
LIMIT match_count;
|
| 67 |
+
END;
|
| 68 |
+
$$;
|