Spaces:

amryassin
/

embedding-bench

Running

App Files Files Community

AmrYassinIsFree commited on Apr 17

Commit

db0da0a

1 Parent(s): a9bc1f8

replace matplot with plotly, add more evals, UI re-org

Browse files

Files changed (7) hide show

README.md +70 -86
app.py +483 -159
corpus.py +6 -2
dataset_config.py +3 -1
evals/llm_judge.py +194 -0
evals/quality.py +59 -20
requirements.txt +1 -1

README.md CHANGED Viewed

@@ -12,22 +12,18 @@ license: mit
 # embedding-bench
-Compare text embedding models across retrieval quality, inference speed, and memory footprint. Everything runs locally — no external API calls.
-## Models
-| Key | Model | Backend | Role |
-|-----|-------|---------|------|
-| `mpnet` | `sentence-transformers/all-mpnet-base-v2` | sbert | Baseline |
-| `bge-small` | `BAAI/bge-small-en-v1.5` | sbert | |
-| `bge-small-fe` | `BAAI/bge-small-en-v1.5` | fastembed | |
-| `all-minilm-fe` | `sentence-transformers/all-MiniLM-L6-v2` | fastembed | |
-Three backends are supported:
-- **sbert** — [sentence-transformers](https://www.sbert.net/) (PyTorch). Default.
-- **fastembed** — [qdrant/fastembed](https://github.com/qdrant/fastembed) (ONNX Runtime). Lighter and often faster.
-- **gguf** — [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for quantised GGUF models.
 ## Setup
@@ -37,9 +33,40 @@ source .venv/bin/activate
 pip install -r requirements.txt
 ```
-## Usage
-### Basic
 ```bash
 # Full benchmark (quality + speed + memory)
@@ -48,78 +75,39 @@ python bench.py
 # Specific models
 python bench.py --models mpnet bge-small
-# Compare the same model across backends
 python bench.py --models bge-small bge-small-fe
 # Skip expensive evals
 python bench.py --skip-quality
 python bench.py --skip-memory
-# Tune corpus size and batch size
-python bench.py --corpus-size 500 --batch-size 32 --num-runs 5
-```
-### Datasets
-By default, quality is evaluated on the STS Benchmark. You can evaluate on multiple HuggingFace datasets using built-in presets:
-| Preset | HF Dataset | Type | Pairs |
-|--------|-----------|------|-------|
-| `sts` | `mteb/stsbenchmark-sts` | Scored (Spearman) | 1,379 |
-| `natural-questions` | `sentence-transformers/natural-questions` | Retrieval (MRR/Recall) | 100,231 |
-| `msmarco` | `sentence-transformers/msmarco-bm25` | Retrieval | 503,000 |
-| `squad` | `sentence-transformers/squad` | Retrieval | 87,599 |
-| `trivia-qa` | `sentence-transformers/trivia-qa` | Retrieval | 73,346 |
-| `gooaq` | `sentence-transformers/gooaq` | Retrieval | 3,012,496 |
-| `hotpotqa` | `sentence-transformers/hotpotqa` | Retrieval | 84,500 |
-```bash
-# Evaluate on multiple datasets
 python bench.py --models mpnet bge-small \
   --datasets sts natural-questions squad \
-  --skip-speed --skip-memory
-# Limit pairs for large datasets
-python bench.py --datasets msmarco gooaq --max-pairs 1000
-# Use a custom HF dataset (overrides --datasets)
 python bench.py --dataset my-org/my-pairs \
   --query-col query --passage-col passage --score-col none
-```
-Scored datasets (with `--score-col`) report **Spearman correlation**. Pair-only datasets (`--score-col none`) report **MRR**, **Recall@1**, **Recall@5**, and **Recall@10**.
-### Export results
-```bash
-# Export to CSV
-python bench.py --csv results.csv
-# Save charts as PNG
-python bench.py --charts ./results
-# Both
-python bench.py --models mpnet bge-small \
-  --datasets sts squad natural-questions \
-  --max-pairs 1000 \
-  --csv results.csv --charts ./results
 ```
-Charts generated:
-- `quality_<dataset>.png` — Spearman bar chart (scored) or grouped MRR/Recall bars (retrieval)
-- `speed.png` — sentences/second comparison
-- `memory.png` — peak memory usage comparison
-## Metrics
-| Dimension | Metric | Method |
-|-----------|--------|--------|
-| Quality (scored) | Spearman rho | Cosine similarity vs gold scores |
-| Quality (pairs) | MRR, Recall@k | Retrieval ranking of positive passages |
-| Speed | Median encode time | Wall-clock over N runs with warmup |
-| Memory | Peak RSS delta | Isolated subprocess via `psutil` |
-## CLI reference
 ```
 --models            Models to benchmark (default: all)
@@ -144,36 +132,32 @@ Charts generated:
 ## Adding a model
-Edit `models.py` and add an entry to `REGISTRY`:
 ```python
-# sentence-transformers backend (default)
 "e5-small": ModelConfig(
     name="e5-small-v2",
     model_id="intfloat/e5-small-v2",
 ),
-# fastembed backend
-"e5-small-fe": ModelConfig(
-    name="e5-small-v2 (fastembed)",
-    model_id="intfloat/e5-small-v2",
-    backend="fastembed",
-),
 ```
 ## Project structure
 ```
 embedding-bench/
 ├── bench.py             # CLI entry point
-├── models.py            # Model registry
-├── wrapper.py           # Backend wrappers (sbert, fastembed, gguf)
 ├── corpus.py            # Sentence corpus builder
 ├── dataset_config.py    # Dataset presets and configuration
-├── report.py            # Table formatting, CSV export, charts
 ├── evals/
-│   ├── quality.py       # STS + retrieval evaluation
 │   ├── speed.py         # Latency measurement
-│   └── memory.py        # Memory measurement
 └── requirements.txt
 ```

 # embedding-bench
+Compare text embedding models on quality, speed, and memory. Includes a Streamlit web UI and a CLI.
+## Features
+- **40+ pre-configured models** — sentence-transformers, BGE, E5, GTE, Nomic, Jina, Arctic, and more
+- **4 backends** — sbert (PyTorch), fastembed (ONNX), gguf (llama-cpp), libembedding
+- **7 built-in datasets** — STS Benchmark, Natural Questions, MS MARCO, SQuAD, TriviaQA, GooAQ, HotpotQA
+- **Custom datasets** — upload your own CSV/TSV or load any HuggingFace dataset
+- **Custom models** — add any HuggingFace embedding model from the UI
+- **11 retrieval metrics** — MRR, MAP@k, NDCG@k, Precision@k, Recall@k (all configurable)
+- **LLM as a Judge** — use OpenAI or Anthropic to rate retrieval relevance
+- **Interactive charts** — Plotly-powered, with hover, zoom, and PNG export
 ## Setup
 pip install -r requirements.txt
 ```
+## Web UI
+```bash
+streamlit run app.py
+```
+The sidebar has three sections:
+1. **Models** — select from the registry or add a custom HuggingFace model
+2. **Datasets** — pick built-in presets, upload a CSV/TSV, or add any HuggingFace dataset
+3. **Evaluation** — configure metrics, speed/memory benchmarks, LLM judge, and max pairs
+### Custom datasets
+You can add datasets two ways from the sidebar:
+- **Upload file** — CSV or TSV (max 50 MB, 50k rows) with a query column and a passage column. Optionally include a numeric score column for Spearman correlation; otherwise retrieval metrics (MRR, Recall@k, etc.) are used.
+- **HuggingFace Hub** — provide the dataset ID (e.g. `mteb/stsbenchmark-sts`), config, split, and column names. The dataset is validated on add.
+### LLM as a Judge
+Enable in the Evaluation section. Provide your OpenAI or Anthropic API key. For each sampled query, the top-5 retrieved passages are rated for relevance (1–5) by the LLM. Reports judge_avg@1, judge_avg@5, and judge_nDCG@5.
+### Metrics
+| Dimension | Metrics | Method |
+|-----------|---------|--------|
+| Quality (scored) | Spearman | Cosine similarity vs gold scores |
+| Quality (pairs) | MRR, MAP@5/10, NDCG@5/10, Precision@1/5/10, Recall@1/5/10 | Retrieval ranking of positive passages |
+| LLM Judge | Avg@1, Avg@5, nDCG@5 | LLM relevance ratings on retrieved passages |
+| Speed | Median encode time, sent/s | Wall-clock over N runs with warmup |
+| Memory | Peak RSS delta (MB) | Isolated subprocess via `psutil` |
+## CLI
 ```bash
 # Full benchmark (quality + speed + memory)
 # Specific models
 python bench.py --models mpnet bge-small
+# Compare backends
 python bench.py --models bge-small bge-small-fe
 # Skip expensive evals
 python bench.py --skip-quality
 python bench.py --skip-memory
+# Multiple datasets with pair limit
 python bench.py --models mpnet bge-small \
   --datasets sts natural-questions squad \
+  --max-pairs 1000 --skip-speed --skip-memory
+# Custom HF dataset
 python bench.py --dataset my-org/my-pairs \
   --query-col query --passage-col passage --score-col none
+# Export
+python bench.py --csv results.csv --charts ./results
 ```
+### Built-in dataset presets
+| Preset | HF Dataset | Type |
+|--------|-----------|------|
+| `sts` | `mteb/stsbenchmark-sts` | Scored (Spearman) |
+| `natural-questions` | `sentence-transformers/natural-questions` | Retrieval |
+| `msmarco` | `sentence-transformers/msmarco-bm25` | Retrieval |
+| `squad` | `sentence-transformers/squad` | Retrieval |
+| `trivia-qa` | `sentence-transformers/trivia-qa` | Retrieval |
+| `gooaq` | `sentence-transformers/gooaq` | Retrieval |
+| `hotpotqa` | `sentence-transformers/hotpotqa` | Retrieval |
+### CLI flags
 ```
 --models            Models to benchmark (default: all)
 ## Adding a model
+From the web UI, click **Add Custom Model** in the sidebar — just provide a display name and a HuggingFace model ID.
+Or edit `models.py` directly:
 ```python
 "e5-small": ModelConfig(
     name="e5-small-v2",
     model_id="intfloat/e5-small-v2",
 ),
 ```
 ## Project structure
 ```
 embedding-bench/
+├── app.py               # Streamlit web UI
 ├── bench.py             # CLI entry point
+├── models.py            # Model registry (40+ models)
+├── wrapper.py           # Backend wrappers (sbert, fastembed, gguf, libembedding)
 ├── corpus.py            # Sentence corpus builder
 ├── dataset_config.py    # Dataset presets and configuration
+├── report.py            # Table formatting, CSV export, charts (CLI)
 ├── evals/
+│   ├── quality.py       # Quality evaluation (Spearman + retrieval metrics)
 │   ├── speed.py         # Latency measurement
+│   ├── memory.py        # Memory measurement
+│   └── llm_judge.py     # LLM-as-a-Judge evaluation
 └── requirements.txt
 ```

app.py CHANGED Viewed

@@ -2,17 +2,19 @@ from __future__ import annotations
 import io
 import csv
 import time
-import matplotlib.pyplot as plt
 import numpy as np
 import streamlit as st
 from datasets import load_dataset
 from corpus import build_corpus
 from dataset_config import DATASET_PRESETS, DatasetConfig
-from evals.quality import evaluate_quality
 from evals.speed import evaluate_speed
 from models import (
     REGISTRY,
@@ -109,11 +111,22 @@ with col_badge:
 st.markdown("<hr class='section-divider'>", unsafe_allow_html=True)
 # ---------------------------------------------------------------------------
 # Sidebar — configuration
 # ---------------------------------------------------------------------------
 st.sidebar.markdown("### ⚙️ Configuration")
 st.sidebar.markdown("**Models**")
 available_models = list(REGISTRY.keys())
 selected_models = st.sidebar.multiselect(
@@ -125,22 +138,34 @@ selected_models = st.sidebar.multiselect(
 with st.sidebar.expander("➕ Add Custom Model"):
     with st.form("add_model_form", clear_on_submit=True):
-        new_key = st.text_input("Registry key", placeholder="my-model")
         new_name = st.text_input("Display name", placeholder="My Custom Model")
         new_model_id = st.text_input("HuggingFace model ID", placeholder="org/model-name")
         new_backend = st.selectbox("Backend", sorted(VALID_BACKENDS))
         new_gguf_file = st.text_input(
-            "GGUF filename (gguf backend only)", value="", placeholder="model.gguf"
         )
-        new_is_baseline = st.checkbox("Mark as baseline", value=False)
-        new_persist = st.checkbox("Save to disk", value=False,
-                                  help="Persist to custom_models.json so it loads next session")
         submitted = st.form_submit_button("Add Model", use_container_width=True)
     if submitted:
-        if not new_key or not new_name or not new_model_id:
-            st.sidebar.error("Key, name, and model ID are required.")
-        elif new_backend == "gguf" and not new_gguf_file:
-            st.sidebar.error("GGUF filename is required for gguf backend.")
         else:
             cfg = ModelConfig(
                 name=new_name,
@@ -157,58 +182,280 @@ with st.sidebar.expander("➕ Add Custom Model"):
             except ValueError as e:
                 st.sidebar.error(str(e))
 st.sidebar.markdown("**Datasets**")
-available_datasets = list(DATASET_PRESETS.keys())
 selected_datasets = st.sidebar.multiselect(
-    "Select dataset presets",
     available_datasets,
-    default=["sts"],
     label_visibility="collapsed",
 )
-max_pairs = st.sidebar.number_input(
-    "Max pairs per dataset",
-    min_value=100,
-    max_value=50000,
-    value=1000,
-    step=100,
-    help="Limits the number of pairs evaluated. Keep low for large datasets.",
-)
-st.sidebar.markdown("---")
-st.sidebar.markdown("**Speed & Memory**")
-run_speed = st.sidebar.checkbox("Speed benchmark", value=False)
-run_memory = st.sidebar.checkbox("Memory benchmark", value=False)
-corpus_size = 500
-num_runs = 3
-batch_size = 64
-if run_speed or run_memory:
-    corpus_size = st.sidebar.number_input("Corpus size", 100, 10000, 500, step=100)
-    batch_size = st.sidebar.number_input("Batch size", 8, 512, 64, step=8)
-if run_speed:
-    num_runs = st.sidebar.number_input("Speed runs", 1, 10, 3)
-st.sidebar.markdown("---")
-st.sidebar.markdown("**Cache**")
-_cache_c1, _cache_c2 = st.sidebar.columns(2)
-with _cache_c1:
-    if st.button("🗑️ Clear All", use_container_width=True,
-                 help="Clear cached models, datasets, and results"):
-        st.cache_resource.clear()
-        st.cache_data.clear()
-        for key in list(st.session_state.keys()):
-            del st.session_state[key]
-        st.rerun()
-with _cache_c2:
-    if st.button("🔄 Results", use_container_width=True,
-                 help="Clear eval results but keep models loaded"):
-        st.cache_data.clear()
-        for key in ["results", "selected_datasets"]:
-            st.session_state.pop(key, None)
-        st.rerun()
-st.sidebar.markdown("---")
 # ---------------------------------------------------------------------------
 # Cached functions
@@ -239,8 +486,9 @@ def cached_evaluate_quality(
     score_col: str | None,
     score_scale: float,
     max_pairs: int | None,
 ) -> dict[str, float]:
-    """Cache quality results keyed by (model, dataset, max_pairs).
     The _model arg is excluded from the hash (underscore prefix).
     model_key is used as a hashable stand-in.
@@ -250,7 +498,10 @@ def cached_evaluate_quality(
         query_col=query_col, passage_col=passage_col,
         score_col=score_col, score_scale=score_scale,
     )
-    return evaluate_quality(_model, ds_cfg, max_pairs=max_pairs)
 @st.cache_data(show_spinner="Building corpus...", ttl=3600)
@@ -274,6 +525,9 @@ def flatten_result(r: dict) -> dict:
     for ds_key, metrics in r.get("quality", {}).items():
         for metric_name, value in metrics.items():
             flat[f"{ds_key}/{metric_name}"] = value
     speed = r.get("speed")
     if speed:
         flat["Speed (sent/s)"] = speed["sentences_per_second"]
@@ -311,23 +565,19 @@ def render_metric_card(label: str, value: str, sub: str = "", css_class: str = "
 # ---------------------------------------------------------------------------
-# Chart style helper
 # ---------------------------------------------------------------------------
 CHART_BG = "#0E1117"
-CHART_TEXT = "#CCCCCC"
-def style_chart(fig, ax):
-    """Apply dark theme to a matplotlib chart."""
-    fig.patch.set_facecolor(CHART_BG)
-    ax.set_facecolor(CHART_BG)
-    ax.spines["top"].set_visible(False)
-    ax.spines["right"].set_visible(False)
-    ax.spines["left"].set_color("#444")
-    ax.spines["bottom"].set_color("#444")
-    ax.tick_params(colors=CHART_TEXT, labelsize=7)
-    ax.yaxis.label.set_color(CHART_TEXT)
-    ax.xaxis.label.set_color(CHART_TEXT)
-    ax.title.set_color("#FAFAFA")
 # ---------------------------------------------------------------------------
@@ -341,13 +591,20 @@ if not selected_datasets:
     st.warning("Select at least one dataset from the sidebar.")
     st.stop()
-run_btn = st.sidebar.button("🚀 Run Benchmark", type="primary", use_container_width=True)
 if run_btn:
-    ds_configs = [DATASET_PRESETS[k] for k in selected_datasets]
     results = []
     progress = st.progress(0, text="Starting...")
-    total_steps = len(selected_models) * (len(ds_configs) + int(run_speed) + int(run_memory))
     step = 0
     for model_key in selected_models:
@@ -363,23 +620,58 @@ if run_btn:
                 step / total_steps,
                 text=f"Evaluating **{cfg.name}** on *{ds_key}*...",
             )
-            quality_results[ds_key] = cached_evaluate_quality(
-                model, model_key,
-                ds_cfg.name, ds_cfg.config, ds_cfg.split,
-                ds_cfg.query_col, ds_cfg.passage_col,
-                ds_cfg.score_col, ds_cfg.score_scale,
-                max_pairs,
-            )
         result["quality"] = quality_results
         if run_speed:
             step += 1
             progress.progress(step / total_steps, text=f"Speed benchmark: **{cfg.name}**...")
             ds0 = ds_configs[0]
-            corpus = cached_build_corpus(
-                corpus_size, ds0.name, ds0.config, ds0.split,
-                ds0.query_col, ds0.passage_col,
-            )
             result["speed"] = evaluate_speed(model, corpus, num_runs=num_runs, batch_size=batch_size)
         if run_memory:
@@ -387,10 +679,13 @@ if run_btn:
             progress.progress(step / total_steps, text=f"Memory benchmark: **{cfg.name}**...")
             from evals.memory import evaluate_memory
             ds0 = ds_configs[0]
-            corpus = cached_build_corpus(
-                corpus_size, ds0.name, ds0.config, ds0.split,
-                ds0.query_col, ds0.passage_col,
-            )
             result["memory_mb"] = evaluate_memory(
                 cfg.model_id, corpus, batch_size=batch_size, backend=cfg.backend,
             )
@@ -412,7 +707,7 @@ if "results" not in st.session_state:
         "<div style='text-align:center; padding:3rem 0; color:#666;'>"
         "<p style='font-size:2.5rem; margin-bottom:0.5rem;'>📐</p>"
         "<p style='font-size:1.1rem;'>Configure models &amp; datasets in the sidebar,<br>"
-        "then hit <b>Run Benchmark</b>.</p></div>",
         unsafe_allow_html=True,
     )
     st.stop()
@@ -434,8 +729,13 @@ for r in results:
 if ds_keys:
     first_ds = ds_keys[0]
     first_metrics_sample = results[0].get("quality", {}).get(first_ds, {})
-    primary_metric = "spearman" if "spearman" in first_metrics_sample else "mrr"
-    primary_label = "Spearman" if primary_metric == "spearman" else "MRR"
     scores = [
         (r["name"], r.get("quality", {}).get(first_ds, {}).get(primary_metric, 0))
@@ -524,47 +824,73 @@ for ds_key in ds_keys:
     if "spearman" in first_metrics:
         values = [r.get("quality", {}).get(ds_key, {}).get("spearman", 0) for r in results]
-        fig, ax = plt.subplots(figsize=(4, 2.4))
-        style_chart(fig, ax)
-        bars = ax.bar(models, values, color="#4C72B0", edgecolor="#5a82c0", linewidth=0.5)
-        ax.set_ylabel("Spearman", fontsize=8)
-        ax.set_title(f"Quality — {ds_key}", fontsize=9, pad=8)
-        ax.set_ylim(0, 1.08)
-        for bar, v in zip(bars, values):
-            ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
-                    f"{v:.4f}", ha="center", va="bottom", fontsize=7, color=CHART_TEXT)
-        plt.xticks(rotation=30, ha="right")
-        plt.tight_layout()
-        st.pyplot(fig, use_container_width=False)
-        plt.close(fig)
     else:
-        metric_names = ["mrr", "recall@1", "recall@5", "recall@10"]
-        x = np.arange(len(models))
-        width = 0.18
-        colors = ["#4C72B0", "#55A868", "#C44E52", "#8172B2"]
-        fig, ax = plt.subplots(figsize=(max(4, len(models) * 1.4), 3.0))
-        style_chart(fig, ax)
-        for i, (metric, color) in enumerate(zip(metric_names, colors)):
             values = [r.get("quality", {}).get(ds_key, {}).get(metric, 0) for r in results]
-            offset = (i - 1.5) * width
-            bars = ax.bar(x + offset, values, width, label=metric, color=color,
-                          edgecolor=color, linewidth=0.3, alpha=0.9)
-            for bar, v in zip(bars, values):
-                ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.005,
-                        f"{v:.2f}", ha="center", va="bottom", fontsize=6, color=CHART_TEXT)
-        ax.set_ylabel("Score", fontsize=8)
-        ax.set_title(f"Retrieval Quality — {ds_key}", fontsize=9, pad=8)
-        ax.set_ylim(0, 1.12)
-        ax.set_xticks(x)
-        ax.set_xticklabels(models, rotation=30, ha="right", fontsize=7)
-        ax.legend(fontsize=6, ncol=4, loc="upper center",
-                  bbox_to_anchor=(0.5, -0.22),
-                  facecolor=CHART_BG, edgecolor="#444", labelcolor=CHART_TEXT)
-        plt.tight_layout()
-        fig.subplots_adjust(bottom=0.28)
-        st.pyplot(fig, use_container_width=False)
-        plt.close(fig)
 # Speed & Memory side by side
 speed_values = [r.get("speed", {}).get("sentences_per_second", 0) for r in results]
@@ -577,36 +903,34 @@ if has_speed or has_memory:
     if has_speed:
         with cols[0]:
-            fig, ax = plt.subplots(figsize=(3.5, 2.4))
-            style_chart(fig, ax)
-            bars = ax.bar(models, speed_values, color="#55A868", edgecolor="#65b878", linewidth=0.5)
-            ax.set_ylabel("Sent / s", fontsize=8)
-            ax.set_title("Encoding Speed", fontsize=9, pad=8)
-            for bar, v in zip(bars, speed_values):
-                if v > 0:
-                    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
-                            str(v), ha="center", va="bottom", fontsize=7, color=CHART_TEXT)
-            plt.xticks(rotation=30, ha="right")
-            plt.tight_layout()
-            st.pyplot(fig, use_container_width=False)
-            plt.close(fig)
     if has_memory:
         col_idx = 1 if has_speed else 0
         with cols[col_idx]:
-            fig, ax = plt.subplots(figsize=(3.5, 2.4))
-            style_chart(fig, ax)
-            bars = ax.bar(models, mem_values, color="#C44E52", edgecolor="#d45e62", linewidth=0.5)
-            ax.set_ylabel("MB", fontsize=8)
-            ax.set_title("Memory Usage", fontsize=9, pad=8)
-            for bar, v in zip(bars, mem_values):
-                if v > 0:
-                    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
-                            str(v), ha="center", va="bottom", fontsize=7, color=CHART_TEXT)
-            plt.xticks(rotation=30, ha="right")
-            plt.tight_layout()
-            st.pyplot(fig, use_container_width=False)
-            plt.close(fig)
 # ---------------------------------------------------------------------------
 # Footer

 import io
 import csv
+import re
 import time
 import numpy as np
+import pandas as pd
+import plotly.graph_objects as go
 import streamlit as st
 from datasets import load_dataset
 from corpus import build_corpus
 from dataset_config import DATASET_PRESETS, DatasetConfig
+from evals.quality import ALL_RETRIEVAL_METRICS, DEFAULT_RETRIEVAL_METRICS, evaluate_quality
 from evals.speed import evaluate_speed
 from models import (
     REGISTRY,
 st.markdown("<hr class='section-divider'>", unsafe_allow_html=True)
+# ---------------------------------------------------------------------------
+# Helper: slugify a display name into a registry key
+# ---------------------------------------------------------------------------
+def _slugify(name: str) -> str:
+    s = name.strip().lower()
+    s = re.sub(r"[^a-z0-9]+", "-", s)
+    return s.strip("-")
 # ---------------------------------------------------------------------------
 # Sidebar — configuration
 # ---------------------------------------------------------------------------
 st.sidebar.markdown("### ⚙️ Configuration")
+# ---- Models ---------------------------------------------------------------
 st.sidebar.markdown("**Models**")
 available_models = list(REGISTRY.keys())
 selected_models = st.sidebar.multiselect(
 with st.sidebar.expander("➕ Add Custom Model"):
     with st.form("add_model_form", clear_on_submit=True):
         new_name = st.text_input("Display name", placeholder="My Custom Model")
         new_model_id = st.text_input("HuggingFace model ID", placeholder="org/model-name")
         new_backend = st.selectbox("Backend", sorted(VALID_BACKENDS))
         new_gguf_file = st.text_input(
+            "GGUF filename", value="", placeholder="model.gguf",
+            help="Only needed for the gguf backend.",
         )
+        _adv_c1, _adv_c2 = st.columns(2)
+        new_is_baseline = _adv_c1.checkbox("Baseline", value=False)
+        new_persist = _adv_c2.checkbox("Save to disk", value=False,
+                                       help="Persist across sessions")
         submitted = st.form_submit_button("Add Model", use_container_width=True)
     if submitted:
+        new_key = _slugify(new_name) if new_name else ""
+        errors: list[str] = []
+        if not new_name:
+            errors.append("Display name is required.")
+        elif new_key in REGISTRY:
+            errors.append(f"A model named '{new_name}' already exists.")
+        if not new_model_id:
+            errors.append("HuggingFace model ID is required.")
+        elif "/" not in new_model_id:
+            errors.append("Model ID should be in `org/model-name` format.")
+        if new_backend == "gguf" and not new_gguf_file:
+            errors.append("GGUF filename is required for gguf backend.")
+        if errors:
+            for err in errors:
+                st.sidebar.error(err)
         else:
             cfg = ModelConfig(
                 name=new_name,
             except ValueError as e:
                 st.sidebar.error(str(e))
+# ---- Datasets -------------------------------------------------------------
 st.sidebar.markdown("**Datasets**")
+# Merge preset + user datasets (need this before the multiselect)
+user_datasets: dict[str, DatasetConfig] = st.session_state.get("user_datasets", {})
+all_datasets = {**DATASET_PRESETS, **user_datasets}
+available_datasets = list(all_datasets.keys())
 selected_datasets = st.sidebar.multiselect(
+    "Select datasets",
     available_datasets,
+    default=["sts"] if "sts" in available_datasets else available_datasets[:1],
     label_visibility="collapsed",
 )
+_MAX_UPLOAD_ROWS = 50_000
+_MAX_UPLOAD_MB = 50
+with st.sidebar.expander("➕ Add Dataset"):
+    ds_source = st.radio(
+        "Source", ["Upload file", "HuggingFace Hub"],
+        horizontal=True, label_visibility="collapsed",
+    )
+    if ds_source == "Upload file":
+        st.caption(
+            "CSV or TSV with query and passage columns. "
+            "Optional numeric score column enables Spearman correlation; "
+            "otherwise MRR & Recall@k are used. Max 50 MB / 50 k rows."
+        )
+        uploaded_file = st.file_uploader(
+            "Upload CSV or TSV", type=["csv", "tsv"], label_visibility="collapsed",
+        )
+        if uploaded_file is not None:
+            file_size_mb = uploaded_file.size / (1024 * 1024)
+            if file_size_mb > _MAX_UPLOAD_MB:
+                st.error(f"File too large ({file_size_mb:.1f} MB). Max {_MAX_UPLOAD_MB} MB.")
+            else:
+                sep = "\t" if uploaded_file.name.endswith(".tsv") else ","
+                try:
+                    user_df = pd.read_csv(uploaded_file, sep=sep)
+                except Exception as e:
+                    st.error(f"Failed to parse: {e}")
+                    user_df = None
+                if user_df is not None:
+                    errs: list[str] = []
+                    if len(user_df.columns) < 2:
+                        errs.append("Need at least 2 columns.")
+                    if len(user_df) == 0:
+                        errs.append("File is empty.")
+                    if len(user_df) > _MAX_UPLOAD_ROWS:
+                        errs.append(f"Too many rows ({len(user_df):,}). Max {_MAX_UPLOAD_ROWS:,}.")
+                    if user_df.columns.duplicated().any():
+                        errs.append("Duplicate column names.")
+                    if errs:
+                        for e in errs:
+                            st.error(e)
+                    else:
+                        cols = list(user_df.columns)
+                        st.dataframe(user_df.head(5), use_container_width=True, hide_index=True)
+                        with st.form("add_dataset_form", clear_on_submit=False):
+                            ds_label = st.text_input(
+                                "Dataset name",
+                                value=uploaded_file.name.rsplit(".", 1)[0],
+                            )
+                            user_query_col = st.selectbox("Query column", cols, index=0)
+                            user_passage_col = st.selectbox(
+                                "Passage column", cols, index=min(1, len(cols) - 1),
+                            )
+                            has_score = st.checkbox("Has score column")
+                            user_score_col = st.selectbox(
+                                "Score column", cols,
+                                index=min(2, len(cols) - 1),
+                                disabled=not has_score,
+                            )
+                            user_score_scale = st.number_input(
+                                "Score scale (max value)",
+                                min_value=1.0, value=5.0, step=1.0,
+                                disabled=not has_score,
+                                help="Scores divided by this to normalise to 0-1.",
+                            )
+                            ds_submitted = st.form_submit_button(
+                                "Add Dataset", use_container_width=True,
+                            )
+                        if ds_submitted:
+                            sub_errs: list[str] = []
+                            if not ds_label:
+                                sub_errs.append("Name is required.")
+                            if user_query_col == user_passage_col:
+                                sub_errs.append("Query and passage columns must differ.")
+                            if has_score and user_score_col in (
+                                user_query_col, user_passage_col,
+                            ):
+                                sub_errs.append("Score column must differ from query/passage.")
+                            if user_df[user_query_col].astype(str).str.strip().eq("").all():
+                                sub_errs.append(f"Query column '{user_query_col}' is empty.")
+                            if user_df[user_passage_col].astype(str).str.strip().eq("").all():
+                                sub_errs.append(f"Passage column '{user_passage_col}' is empty.")
+                            if has_score:
+                                try:
+                                    pd.to_numeric(user_df[user_score_col], errors="raise")
+                                except (ValueError, TypeError):
+                                    sub_errs.append(f"Score column '{user_score_col}' must be numeric.")
+                            if sub_errs:
+                                for e in sub_errs:
+                                    st.error(e)
+                            else:
+                                data_dict = {c: user_df[c].astype(str).tolist() for c in cols}
+                                if has_score:
+                                    data_dict[user_score_col] = [
+                                        float(v) for v in user_df[user_score_col]
+                                    ]
+                                user_ds_cfg = DatasetConfig(
+                                    name=f"user/{ds_label}",
+                                    query_col=user_query_col,
+                                    passage_col=user_passage_col,
+                                    score_col=user_score_col if has_score else None,
+                                    score_scale=user_score_scale if has_score else 1.0,
+                                    data=data_dict,
+                                )
+                                if "user_datasets" not in st.session_state:
+                                    st.session_state["user_datasets"] = {}
+                                st.session_state["user_datasets"][ds_label] = user_ds_cfg
+                                st.success(f"Added **{ds_label}** ({len(user_df):,} rows)")
+    else:  # HuggingFace Hub
+        st.caption("Load any dataset from [huggingface.co/datasets](https://huggingface.co/datasets).")
+        with st.form("add_hf_dataset_form", clear_on_submit=True):
+            hf_ds_label = st.text_input("Dataset name", placeholder="my-dataset")
+            hf_ds_id = st.text_input("HuggingFace ID", placeholder="org/dataset-name")
+            _hf_c1, _hf_c2 = st.columns(2)
+            hf_ds_config = _hf_c1.text_input("Config", value="", help="Leave blank if none.")
+            hf_ds_split = _hf_c2.text_input("Split", value="test")
+            hf_query_col = st.text_input("Query column", placeholder="query")
+            hf_passage_col = st.text_input("Passage column", placeholder="passage")
+            hf_has_score = st.checkbox("Has score column")
+            hf_score_col = st.text_input(
+                "Score column", placeholder="score", disabled=not hf_has_score,
+            )
+            hf_score_scale = st.number_input(
+                "Score scale (max value)", min_value=1.0, value=5.0, step=1.0,
+                disabled=not hf_has_score,
+                help="Scores divided by this to normalise to 0-1.",
+            )
+            hf_submitted = st.form_submit_button("Add Dataset", use_container_width=True)
+        if hf_submitted:
+            hf_errors: list[str] = []
+            if not hf_ds_label:
+                hf_errors.append("Dataset name is required.")
+            if not hf_ds_id:
+                hf_errors.append("HuggingFace ID is required.")
+            if not hf_query_col:
+                hf_errors.append("Query column is required.")
+            if not hf_passage_col:
+                hf_errors.append("Passage column is required.")
+            if hf_query_col and hf_passage_col and hf_query_col == hf_passage_col:
+                hf_errors.append("Query and passage columns must differ.")
+            if hf_has_score and not hf_score_col:
+                hf_errors.append("Score column is required when enabled.")
+            if hf_has_score and hf_score_col in (hf_query_col, hf_passage_col):
+                hf_errors.append("Score column must differ from query/passage.")
+            if hf_errors:
+                for err in hf_errors:
+                    st.error(err)
+            else:
+                try:
+                    _cfg_arg = hf_ds_config or None
+                    _test_ds = load_dataset(hf_ds_id, _cfg_arg, split=hf_ds_split)
+                    _ds_cols = _test_ds.column_names
+                    _missing = [
+                        c for c in [hf_query_col, hf_passage_col]
+                        + ([hf_score_col] if hf_has_score else [])
+                        if c not in _ds_cols
+                    ]
+                    if _missing:
+                        st.error(
+                            f"Column(s) not found: {', '.join(_missing)}. "
+                            f"Available: {', '.join(_ds_cols)}"
+                        )
+                    else:
+                        hf_ds_cfg = DatasetConfig(
+                            name=hf_ds_id,
+                            config=_cfg_arg,
+                            split=hf_ds_split,
+                            query_col=hf_query_col,
+                            passage_col=hf_passage_col,
+                            score_col=hf_score_col if hf_has_score else None,
+                            score_scale=hf_score_scale if hf_has_score else 1.0,
+                        )
+                        if "user_datasets" not in st.session_state:
+                            st.session_state["user_datasets"] = {}
+                        st.session_state["user_datasets"][hf_ds_label] = hf_ds_cfg
+                        st.success(f"Added **{hf_ds_label}**")
+                        st.rerun()
+                except Exception as e:
+                    st.error(f"Failed to load: {e}")
+# ---- Evaluation options ---------------------------------------------------
+_LLM_PROVIDERS = {"openai": "OpenAI", "anthropic": "Anthropic"}
+_DEFAULT_MODELS = {"openai": "gpt-4o-mini", "anthropic": "claude-haiku-4-5-20251001"}
+with st.sidebar.expander("⚙️ Evaluation"):
+    max_pairs = st.number_input(
+        "Max pairs per dataset",
+        min_value=100, max_value=50000, value=1000, step=100,
+        help="Caps the number of pairs evaluated per dataset.",
+    )
+    selected_metrics = st.multiselect(
+        "Retrieval metrics",
+        ALL_RETRIEVAL_METRICS,
+        default=DEFAULT_RETRIEVAL_METRICS,
+        help="Metrics for pair-based datasets (no score column). Scored datasets always use Spearman.",
+    )
+    st.markdown("---")
+    run_speed = st.checkbox("Speed benchmark")
+    run_memory = st.checkbox("Memory benchmark")
+    corpus_size = 500
+    num_runs = 3
+    batch_size = 64
+    if run_speed or run_memory:
+        _sp_c1, _sp_c2 = st.columns(2)
+        corpus_size = _sp_c1.number_input("Corpus size", 100, 10000, 500, step=100)
+        batch_size = _sp_c2.number_input("Batch size", 8, 512, 64, step=8)
+    if run_speed:
+        num_runs = st.number_input("Speed runs", 1, 10, 3)
+    st.markdown("---")
+    run_llm_judge = st.checkbox("LLM as a Judge")
+    llm_provider = "openai"
+    llm_api_key = ""
+    llm_model = ""
+    llm_max_samples = 50
+    if run_llm_judge:
+        st.caption(
+            "An LLM rates how relevant retrieved passages are to each query (1-5). "
+            "API charges apply."
+        )
+        llm_provider = st.selectbox(
+            "Provider", list(_LLM_PROVIDERS.keys()),
+            format_func=lambda k: _LLM_PROVIDERS[k],
+        )
+        llm_api_key = st.text_input(
+            "API key", type="password", placeholder="sk-...",
+        )
+        llm_model = st.text_input("Model", value=_DEFAULT_MODELS[llm_provider])
+        llm_max_samples = st.number_input(
+            "Samples to judge", min_value=5, max_value=500, value=50, step=5,
+            help="Queries sampled. Each = 5 API calls (top-5 passages).",
+        )
+    st.markdown("---")
+    _cache_c1, _cache_c2 = st.columns(2)
+    with _cache_c1:
+        if st.button("🗑 Clear All", use_container_width=True):
+            st.cache_resource.clear()
+            st.cache_data.clear()
+            for key in list(st.session_state.keys()):
+                del st.session_state[key]
+            st.rerun()
+    with _cache_c2:
+        if st.button("🔄 Results", use_container_width=True):
+            st.cache_data.clear()
+            for key in ["results", "selected_datasets"]:
+                st.session_state.pop(key, None)
+            st.rerun()
 # ---------------------------------------------------------------------------
 # Cached functions
     score_col: str | None,
     score_scale: float,
     max_pairs: int | None,
+    metrics: tuple[str, ...] | None = None,
 ) -> dict[str, float]:
+    """Cache quality results keyed by (model, dataset, max_pairs, metrics).
     The _model arg is excluded from the hash (underscore prefix).
     model_key is used as a hashable stand-in.
         query_col=query_col, passage_col=passage_col,
         score_col=score_col, score_scale=score_scale,
     )
+    return evaluate_quality(
+        _model, ds_cfg, max_pairs=max_pairs,
+        metrics=list(metrics) if metrics else None,
+    )
 @st.cache_data(show_spinner="Building corpus...", ttl=3600)
     for ds_key, metrics in r.get("quality", {}).items():
         for metric_name, value in metrics.items():
             flat[f"{ds_key}/{metric_name}"] = value
+    for ds_key, metrics in r.get("llm_judge", {}).items():
+        for metric_name, value in metrics.items():
+            flat[f"{ds_key}/{metric_name}"] = value
     speed = r.get("speed")
     if speed:
         flat["Speed (sent/s)"] = speed["sentences_per_second"]
 # ---------------------------------------------------------------------------
+# Chart helpers
 # ---------------------------------------------------------------------------
 CHART_BG = "#0E1117"
+_PLOTLY_LAYOUT = dict(
+    paper_bgcolor=CHART_BG,
+    plot_bgcolor=CHART_BG,
+    font=dict(color="#CCCCCC", size=11),
+    margin=dict(l=50, r=20, t=40, b=60),
+    bargap=0.25,
+    xaxis=dict(gridcolor="#2a2d35", zerolinecolor="#2a2d35"),
+    yaxis=dict(gridcolor="#2a2d35", zerolinecolor="#2a2d35"),
+)
 # ---------------------------------------------------------------------------
     st.warning("Select at least one dataset from the sidebar.")
     st.stop()
+if run_llm_judge and not llm_api_key:
+    st.warning("Enter an API key in the sidebar to use LLM judge evaluation.")
+    run_llm_judge = False
+run_btn = st.sidebar.button("🚀 Run", type="primary", use_container_width=True)
 if run_btn:
+    ds_configs = [all_datasets[k] for k in selected_datasets]
     results = []
     progress = st.progress(0, text="Starting...")
+    total_steps = len(selected_models) * (
+        len(ds_configs) + int(run_speed) + int(run_memory)
+        + (len(ds_configs) if run_llm_judge else 0)
+    )
     step = 0
     for model_key in selected_models:
                 step / total_steps,
                 text=f"Evaluating **{cfg.name}** on *{ds_key}*...",
             )
+            _metrics = selected_metrics or None
+            if ds_cfg.data is not None:
+                quality_results[ds_key] = evaluate_quality(
+                    model, ds_cfg, max_pairs=max_pairs, metrics=_metrics,
+                )
+            else:
+                quality_results[ds_key] = cached_evaluate_quality(
+                    model, model_key,
+                    ds_cfg.name, ds_cfg.config, ds_cfg.split,
+                    ds_cfg.query_col, ds_cfg.passage_col,
+                    ds_cfg.score_col, ds_cfg.score_scale,
+                    max_pairs,
+                    metrics=tuple(_metrics) if _metrics else None,
+                )
         result["quality"] = quality_results
+        if run_llm_judge:
+            from evals.llm_judge import LLMJudgeConfig, evaluate_llm_judge
+            judge_cfg = LLMJudgeConfig(
+                provider=llm_provider,
+                api_key=llm_api_key,
+                model=llm_model,
+                max_samples=llm_max_samples,
+            )
+            judge_results = {}
+            for ds_cfg in ds_configs:
+                ds_key = ds_cfg.name.split("/")[-1]
+                step += 1
+                progress.progress(
+                    step / total_steps,
+                    text=f"LLM judge: **{cfg.name}** on *{ds_key}*...",
+                )
+                try:
+                    judge_results[ds_key] = evaluate_llm_judge(
+                        model, ds_cfg, judge_cfg, max_pairs=max_pairs,
+                    )
+                except Exception as e:
+                    st.warning(f"LLM judge failed for {cfg.name}/{ds_key}: {e}")
+                    judge_results[ds_key] = {}
+            result["llm_judge"] = judge_results
         if run_speed:
             step += 1
             progress.progress(step / total_steps, text=f"Speed benchmark: **{cfg.name}**...")
             ds0 = ds_configs[0]
+            if ds0.data is not None:
+                corpus = build_corpus(corpus_size, ds0)
+            else:
+                corpus = cached_build_corpus(
+                    corpus_size, ds0.name, ds0.config, ds0.split,
+                    ds0.query_col, ds0.passage_col,
+                )
             result["speed"] = evaluate_speed(model, corpus, num_runs=num_runs, batch_size=batch_size)
         if run_memory:
             progress.progress(step / total_steps, text=f"Memory benchmark: **{cfg.name}**...")
             from evals.memory import evaluate_memory
             ds0 = ds_configs[0]
+            if ds0.data is not None:
+                corpus = build_corpus(corpus_size, ds0)
+            else:
+                corpus = cached_build_corpus(
+                    corpus_size, ds0.name, ds0.config, ds0.split,
+                    ds0.query_col, ds0.passage_col,
+                )
             result["memory_mb"] = evaluate_memory(
                 cfg.model_id, corpus, batch_size=batch_size, backend=cfg.backend,
             )
         "<div style='text-align:center; padding:3rem 0; color:#666;'>"
         "<p style='font-size:2.5rem; margin-bottom:0.5rem;'>📐</p>"
         "<p style='font-size:1.1rem;'>Configure models &amp; datasets in the sidebar,<br>"
+        "then hit <b>Run Evaluation</b>.</p></div>",
         unsafe_allow_html=True,
     )
     st.stop()
 if ds_keys:
     first_ds = ds_keys[0]
     first_metrics_sample = results[0].get("quality", {}).get(first_ds, {})
+    if "spearman" in first_metrics_sample:
+        primary_metric = "spearman"
+        primary_label = "Spearman"
+    else:
+        # Use the first available retrieval metric
+        primary_metric = next(iter(first_metrics_sample), "mrr")
+        primary_label = primary_metric.upper()
     scores = [
         (r["name"], r.get("quality", {}).get(first_ds, {}).get(primary_metric, 0))
     if "spearman" in first_metrics:
         values = [r.get("quality", {}).get(ds_key, {}).get("spearman", 0) for r in results]
+        fig = go.Figure(go.Bar(
+            x=models, y=values,
+            marker_color="#4C72B0",
+            text=[f"{v:.4f}" for v in values],
+            textposition="outside",
+        ))
+        fig.update_layout(
+            **_PLOTLY_LAYOUT,
+            title=f"Quality — {ds_key}",
+            yaxis_title="Spearman",
+            yaxis_range=[0, 1.08],
+        )
+        st.plotly_chart(fig, use_container_width=True)
     else:
+        metric_names = list(first_metrics.keys())
+        _palette = [
+            "#4C72B0", "#55A868", "#C44E52", "#8172B2",
+            "#E5AE38", "#DD8452", "#64B5CD", "#8C8C8C",
+            "#D4A6C8", "#6ACC65", "#D65F5F",
+        ]
+        fig = go.Figure()
+        for i, metric in enumerate(metric_names):
+            color = _palette[i % len(_palette)]
             values = [r.get("quality", {}).get(ds_key, {}).get(metric, 0) for r in results]
+            fig.add_trace(go.Bar(
+                name=metric, x=models, y=values,
+                marker_color=color,
+                text=[f"{v:.2f}" for v in values],
+                textposition="outside",
+            ))
+        fig.update_layout(
+            **_PLOTLY_LAYOUT,
+            title=f"Retrieval Quality — {ds_key}",
+            yaxis_title="Score",
+            yaxis_range=[0, 1.12],
+            barmode="group",
+            legend=dict(orientation="h", yanchor="bottom", y=-0.25, xanchor="center", x=0.5),
+        )
+        st.plotly_chart(fig, use_container_width=True)
+# LLM Judge charts
+for ds_key in ds_keys:
+    has_judge = any(r.get("llm_judge", {}).get(ds_key) for r in results)
+    if not has_judge:
+        continue
+    judge_metrics = ["judge_avg@1", "judge_avg@5", "judge_ndcg@5"]
+    judge_labels = ["Avg@1", "Avg@5", "nDCG@5"]
+    colors = ["#E5AE38", "#DD8452", "#C44E52"]
+    fig = go.Figure()
+    for metric, label, color in zip(judge_metrics, judge_labels, colors):
+        values = [r.get("llm_judge", {}).get(ds_key, {}).get(metric, 0) for r in results]
+        fig.add_trace(go.Bar(
+            name=label, x=models, y=values,
+            marker_color=color,
+            text=[f"{v:.2f}" for v in values],
+            textposition="outside",
+        ))
+    fig.update_layout(
+        **_PLOTLY_LAYOUT,
+        title=f"LLM Judge — {ds_key}",
+        yaxis_title="Score",
+        yaxis_range=[0, 1.12],
+        barmode="group",
+        legend=dict(orientation="h", yanchor="bottom", y=-0.25, xanchor="center", x=0.5),
+    )
+    st.plotly_chart(fig, use_container_width=True)
 # Speed & Memory side by side
 speed_values = [r.get("speed", {}).get("sentences_per_second", 0) for r in results]
     if has_speed:
         with cols[0]:
+            fig = go.Figure(go.Bar(
+                x=models, y=speed_values,
+                marker_color="#55A868",
+                text=[str(v) if v > 0 else "" for v in speed_values],
+                textposition="outside",
+            ))
+            fig.update_layout(
+                **_PLOTLY_LAYOUT,
+                title="Encoding Speed",
+                yaxis_title="Sent / s",
+            )
+            st.plotly_chart(fig, use_container_width=True)
     if has_memory:
         col_idx = 1 if has_speed else 0
         with cols[col_idx]:
+            fig = go.Figure(go.Bar(
+                x=models, y=mem_values,
+                marker_color="#C44E52",
+                text=[str(v) if v > 0 else "" for v in mem_values],
+                textposition="outside",
+            ))
+            fig.update_layout(
+                **_PLOTLY_LAYOUT,
+                title="Memory Usage",
+                yaxis_title="MB",
+            )
+            st.plotly_chart(fig, use_container_width=True)
 # ---------------------------------------------------------------------------
 # Footer

corpus.py CHANGED Viewed

@@ -9,8 +9,12 @@ def build_corpus(size: int, ds_cfg: DatasetConfig | None = None) -> list[str]:
     """Build a corpus of real sentences from the configured dataset."""
     if ds_cfg is None:
         ds_cfg = DatasetConfig()
-    dataset = load_dataset(ds_cfg.name, ds_cfg.config, split=ds_cfg.split)
-    sentences = list(dataset[ds_cfg.query_col]) + list(dataset[ds_cfg.passage_col])
     full: list[str] = []
     while len(full) < size:
         full.extend(sentences)

     """Build a corpus of real sentences from the configured dataset."""
     if ds_cfg is None:
         ds_cfg = DatasetConfig()
+    if ds_cfg.data is not None:
+        data = ds_cfg.data
+    else:
+        dataset = load_dataset(ds_cfg.name, ds_cfg.config, split=ds_cfg.split)
+        data = {col: list(dataset[col]) for col in dataset.column_names}
+    sentences = list(data[ds_cfg.query_col]) + list(data[ds_cfg.passage_col])
     full: list[str] = []
     while len(full) < size:
         full.extend(sentences)

dataset_config.py CHANGED Viewed

@@ -1,6 +1,6 @@
 from __future__ import annotations
-from dataclasses import dataclass
 @dataclass
@@ -14,6 +14,8 @@ class DatasetConfig:
     passage_col: str = "sentence2"
     score_col: str | None = "score"
     score_scale: float = 5.0
 DATASET_PRESETS: dict[str, DatasetConfig] = {

 from __future__ import annotations
+from dataclasses import dataclass, field
 @dataclass
     passage_col: str = "sentence2"
     score_col: str | None = "score"
     score_scale: float = 5.0
+    # Pre-loaded data (dict of column-name -> list). When set, skip HF download.
+    data: dict[str, list] | None = field(default=None, repr=False)
 DATASET_PRESETS: dict[str, DatasetConfig] = {

evals/llm_judge.py ADDED Viewed

	@@ -0,0 +1,194 @@

+from __future__ import annotations
+import json
+import random
+import urllib.request
+import urllib.error
+from dataclasses import dataclass
+import numpy as np
+from dataset_config import DatasetConfig
+@dataclass
+class LLMJudgeConfig:
+    provider: str  # "openai" or "anthropic"
+    api_key: str
+    model: str
+    max_samples: int = 50
+# ---------------------------------------------------------------------------
+# Provider-specific API calls
+# ---------------------------------------------------------------------------
+_SYSTEM_PROMPT = (
+    "You are an impartial relevance judge. Given a query and a passage, "
+    "rate how relevant the passage is to the query on a scale of 1 to 5.\n\n"
+    "1 = Completely irrelevant\n"
+    "2 = Slightly relevant\n"
+    "3 = Moderately relevant\n"
+    "4 = Highly relevant\n"
+    "5 = Perfectly relevant\n\n"
+    "Respond with ONLY a single integer (1-5), nothing else."
+)
+def _build_user_prompt(query: str, passage: str) -> str:
+    return f"Query: {query}\n\nPassage: {passage}"
+def _call_openai(api_key: str, model: str, query: str, passage: str) -> int:
+    body = json.dumps({
+        "model": model,
+        "messages": [
+            {"role": "system", "content": _SYSTEM_PROMPT},
+            {"role": "user", "content": _build_user_prompt(query, passage)},
+        ],
+        "max_tokens": 4,
+        "temperature": 0.0,
+    }).encode()
+    req = urllib.request.Request(
+        "https://api.openai.com/v1/chat/completions",
+        data=body,
+        headers={
+            "Authorization": f"Bearer {api_key}",
+            "Content-Type": "application/json",
+        },
+    )
+    with urllib.request.urlopen(req, timeout=30) as resp:
+        data = json.loads(resp.read())
+    text = data["choices"][0]["message"]["content"].strip()
+    return _parse_score(text)
+def _call_anthropic(api_key: str, model: str, query: str, passage: str) -> int:
+    body = json.dumps({
+        "model": model,
+        "max_tokens": 4,
+        "system": _SYSTEM_PROMPT,
+        "messages": [
+            {"role": "user", "content": _build_user_prompt(query, passage)},
+        ],
+    }).encode()
+    req = urllib.request.Request(
+        "https://api.anthropic.com/v1/messages",
+        data=body,
+        headers={
+            "x-api-key": api_key,
+            "anthropic-version": "2023-06-01",
+            "Content-Type": "application/json",
+        },
+    )
+    with urllib.request.urlopen(req, timeout=30) as resp:
+        data = json.loads(resp.read())
+    text = data["content"][0]["text"].strip()
+    return _parse_score(text)
+def _parse_score(text: str) -> int:
+    for ch in text:
+        if ch.isdigit() and ch in "12345":
+            return int(ch)
+    return 3  # fallback to neutral
+_PROVIDERS = {
+    "openai": _call_openai,
+    "anthropic": _call_anthropic,
+}
+# ---------------------------------------------------------------------------
+# Main evaluation entry point
+# ---------------------------------------------------------------------------
+def evaluate_llm_judge(
+    model,
+    ds_cfg: DatasetConfig,
+    judge_cfg: LLMJudgeConfig,
+    max_pairs: int | None = None,
+    progress_callback=None,
+) -> dict[str, float]:
+    """Use an LLM to judge retrieval relevance for top-k results.
+    For each sampled query, retrieves the top-5 passages by embedding
+    similarity and asks the LLM to rate each one. Returns average
+    relevance scores at different cut-offs.
+    """
+    from datasets import load_dataset
+    if ds_cfg.data is not None:
+        data = ds_cfg.data
+    else:
+        dataset = load_dataset(ds_cfg.name, ds_cfg.config, split=ds_cfg.split)
+        data = {col: list(dataset[col]) for col in dataset.column_names}
+    queries = list(data[ds_cfg.query_col])
+    passages = list(data[ds_cfg.passage_col])
+    if max_pairs is not None and len(queries) > max_pairs:
+        queries = queries[:max_pairs]
+        passages = passages[:max_pairs]
+    # Encode
+    emb_q = model.encode(queries, is_query=True)
+    emb_p = model.encode(passages, is_query=False)
+    # Normalise
+    emb_q = emb_q / np.linalg.norm(emb_q, axis=1, keepdims=True)
+    emb_p = emb_p / np.linalg.norm(emb_p, axis=1, keepdims=True)
+    # Sample queries to judge
+    n = len(queries)
+    sample_size = min(judge_cfg.max_samples, n)
+    sample_indices = sorted(random.sample(range(n), sample_size))
+    call_fn = _PROVIDERS[judge_cfg.provider]
+    top_k = 5
+    # For each sampled query, get top-k passages and judge them
+    relevance_at_k: list[list[int]] = []  # shape: (sample_size, top_k)
+    total_calls = sample_size * top_k
+    calls_done = 0
+    for idx in sample_indices:
+        query_emb = emb_q[idx : idx + 1]
+        sims = (query_emb @ emb_p.T).flatten()
+        top_indices = np.argsort(-sims)[:top_k]
+        scores_for_query = []
+        for passage_idx in top_indices:
+            try:
+                score = call_fn(
+                    judge_cfg.api_key, judge_cfg.model,
+                    queries[idx], passages[int(passage_idx)],
+                )
+            except Exception:
+                score = 0  # treat API errors as 0
+            scores_for_query.append(score)
+            calls_done += 1
+            if progress_callback:
+                progress_callback(calls_done, total_calls)
+        relevance_at_k.append(scores_for_query)
+    arr = np.array(relevance_at_k, dtype=float)  # (sample_size, top_k)
+    # Normalise scores to 0-1 (from 1-5 scale)
+    arr_norm = (arr - 1.0) / 4.0
+    # nDCG@5
+    def _dcg(scores: np.ndarray) -> np.ndarray:
+        positions = np.arange(1, scores.shape[1] + 1)
+        return np.sum(scores / np.log2(positions + 1), axis=1)
+    dcg = _dcg(arr_norm)
+    ideal = _dcg(np.sort(arr_norm, axis=1)[:, ::-1])
+    ndcg = np.where(ideal > 0, dcg / ideal, 0.0)
+    return {
+        "judge_avg@1": round(float(np.mean(arr_norm[:, 0])), 4),
+        "judge_avg@5": round(float(np.mean(arr_norm)), 4),
+        "judge_ndcg@5": round(float(np.mean(ndcg)), 4),
+    }

evals/quality.py CHANGED Viewed

@@ -13,8 +13,26 @@ def _normalize(emb: np.ndarray) -> np.ndarray:
     return emb / norms
-def _retrieval_metrics(emb_q: np.ndarray, emb_p: np.ndarray) -> dict[str, float]:
-    """Compute MRR and Recall@k assuming query i matches passage i."""
     emb_q = _normalize(emb_q)
     emb_p = _normalize(emb_p)
@@ -22,40 +40,61 @@ def _retrieval_metrics(emb_q: np.ndarray, emb_p: np.ndarray) -> dict[str, float]
     sims = emb_q @ emb_p.T
     n = sims.shape[0]
-    # For each query, rank passages by descending similarity
-    # ranks[i] = rank of the correct passage (0-indexed)
     sorted_indices = np.argsort(-sims, axis=1)
     ranks = np.array([int(np.where(sorted_indices[i] == i)[0][0]) for i in range(n)])
-    mrr = float(np.mean(1.0 / (ranks + 1)))
-    recall_1 = float(np.mean(ranks < 1))
-    recall_5 = float(np.mean(ranks < 5))
-    recall_10 = float(np.mean(ranks < 10))
-    return {
-        "mrr": round(mrr, 4),
-        "recall@1": round(recall_1, 4),
-        "recall@5": round(recall_5, 4),
-        "recall@10": round(recall_10, 4),
-    }
 def evaluate_quality(
     model,
     ds_cfg: DatasetConfig | None = None,
     max_pairs: int | None = None,
 ) -> dict[str, float]:
     """Evaluate embedding quality on a dataset.
     Returns a dict with either {"spearman": float} for scored datasets
-    or {"mrr", "recall@1", "recall@5", "recall@10"} for pair datasets.
     """
     if ds_cfg is None:
         ds_cfg = DatasetConfig()
-    dataset = load_dataset(ds_cfg.name, ds_cfg.config, split=ds_cfg.split)
-    queries = list(dataset[ds_cfg.query_col])
-    passages = list(dataset[ds_cfg.passage_col])
     if max_pairs is not None and len(queries) > max_pairs:
         queries = queries[:max_pairs]
@@ -66,7 +105,7 @@ def evaluate_quality(
     if ds_cfg.score_col is not None:
         # Scored mode: Spearman correlation
-        scores = list(dataset[ds_cfg.score_col])
         if max_pairs is not None and len(scores) > max_pairs:
             scores = scores[:max_pairs]
         gold_scores = [s / ds_cfg.score_scale for s in scores]
@@ -79,4 +118,4 @@ def evaluate_quality(
         return {"spearman": round(float(correlation), 4)}
     # Pair mode: retrieval metrics
-    return _retrieval_metrics(emb_q, emb_p)

     return emb / norms
+ALL_RETRIEVAL_METRICS = [
+    "mrr",
+    "map@5", "map@10",
+    "ndcg@5", "ndcg@10",
+    "precision@1", "precision@5", "precision@10",
+    "recall@1", "recall@5", "recall@10",
+]
+DEFAULT_RETRIEVAL_METRICS = ["mrr", "recall@1", "recall@5", "recall@10"]
+def _retrieval_metrics(
+    emb_q: np.ndarray,
+    emb_p: np.ndarray,
+    metrics: list[str] | None = None,
+) -> dict[str, float]:
+    """Compute retrieval metrics assuming query i matches passage i."""
+    if metrics is None:
+        metrics = DEFAULT_RETRIEVAL_METRICS
     emb_q = _normalize(emb_q)
     emb_p = _normalize(emb_p)
     sims = emb_q @ emb_p.T
     n = sims.shape[0]
     sorted_indices = np.argsort(-sims, axis=1)
     ranks = np.array([int(np.where(sorted_indices[i] == i)[0][0]) for i in range(n)])
+    results: dict[str, float] = {}
+    for m in metrics:
+        if m == "mrr":
+            results["mrr"] = round(float(np.mean(1.0 / (ranks + 1))), 4)
+        elif m.startswith("recall@"):
+            k = int(m.split("@")[1])
+            results[m] = round(float(np.mean(ranks < k)), 4)
+        elif m.startswith("precision@"):
+            k = int(m.split("@")[1])
+            # Single relevant doc per query: precision@k = 1/k if hit, else 0
+            results[m] = round(float(np.mean((ranks < k) / k)), 4)
+        elif m.startswith("map@"):
+            k = int(m.split("@")[1])
+            # Single relevant doc: AP = 1/(rank+1) if rank < k, else 0
+            ap = np.where(ranks < k, 1.0 / (ranks + 1), 0.0)
+            results[m] = round(float(np.mean(ap)), 4)
+        elif m.startswith("ndcg@"):
+            k = int(m.split("@")[1])
+            # Single relevant doc: DCG = 1/log2(rank+2) if rank < k, else 0
+            # ideal DCG = 1/log2(2) = 1.0
+            dcg = np.where(ranks < k, 1.0 / np.log2(ranks + 2), 0.0)
+            results[m] = round(float(np.mean(dcg)), 4)
+    return results
 def evaluate_quality(
     model,
     ds_cfg: DatasetConfig | None = None,
     max_pairs: int | None = None,
+    metrics: list[str] | None = None,
 ) -> dict[str, float]:
     """Evaluate embedding quality on a dataset.
     Returns a dict with either {"spearman": float} for scored datasets
+    or selected retrieval metrics for pair datasets.
     """
     if ds_cfg is None:
         ds_cfg = DatasetConfig()
+    if ds_cfg.data is not None:
+        data = ds_cfg.data
+    else:
+        dataset = load_dataset(ds_cfg.name, ds_cfg.config, split=ds_cfg.split)
+        data = {col: list(dataset[col]) for col in dataset.column_names}
+    queries = list(data[ds_cfg.query_col])
+    passages = list(data[ds_cfg.passage_col])
     if max_pairs is not None and len(queries) > max_pairs:
         queries = queries[:max_pairs]
     if ds_cfg.score_col is not None:
         # Scored mode: Spearman correlation
+        scores = list(data[ds_cfg.score_col])
         if max_pairs is not None and len(scores) > max_pairs:
             scores = scores[:max_pairs]
         gold_scores = [s / ds_cfg.score_scale for s in scores]
         return {"spearman": round(float(correlation), 4)}
     # Pair mode: retrieval metrics
+    return _retrieval_metrics(emb_q, emb_p, metrics=metrics)

requirements.txt CHANGED Viewed

@@ -7,5 +7,5 @@ fastembed
 libembedding
 numpy
 scipy
-matplotlib
 streamlit

 libembedding
 numpy
 scipy
+plotly
 streamlit