--- title: SQL Debug RL Environment emoji: 🗄️ colorFrom: purple colorTo: indigo sdk: docker app_port: 7860 short_description: SQL RLVE — 7 tasks with live DuckDB verification tags: - reinforcement-learning - sql - duckdb - data-engineering - openenv - rlve - agent license: apache-2.0 ---
# 🗄️ SQL Debug Environment (OpenEnv) **An execution-based Reinforcement Learning Sandbox for Data Engineering AI Models** [![OpenEnv Standard](https://img.shields.io/badge/OpenEnv-Compatible-blue.svg)](https://openenv.ai) [![DuckDB Built](https://img.shields.io/badge/DuckDB-In--Memory-yellow.svg)](https://duckdb.org/) [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-green.svg)](https://www.python.org/)
--- ## 📌 The Problem Traditional Large Language Models (LLMs) are primarily trained on static datasets to imitate code syntax. While they can often produce code that *looks* right, they frequently hallucinate logic or fail on semantic edge cases in rigorous data tasks like SQL generation and ETL pipelines. When a model generates a bad SQL query during standard training, the pipeline only knows if it's an exact string match to an answer key. This is a fundamentally flawed signal: many different SQL queries can yield the exact same correct data, and conversely, a completely wrong string could be functionally correct. **AI models need verifiable, execution-based feedback loops to improve their logic.** ## 💡 The Solution This project provides a state-of-the-art **execution-based Reinforcement Learning (RL) environment** built specifically for training AI agents on database operations and SQL debugging. Instead of relying on static string matching, this environment wraps an ephemeral, in-memory **DuckDB** instance. When an AI agent submits a SQL script, the system: 1. Dynamically generates mock tables, schemas, and live data in DuckDB. 2. Sandboxes and executes the AI's generated SQL query natively. 3. Performs structural AST validation and execution validation. 4. Computes a **continuous, dense fractional reward** comparing the AI's output dataframe against the ground-truth dataframe down to the cell level. This project strictly adheres to the [OpenEnv Specifications](https://openenv.ai), making it instantly compatible with agentic frameworks and standard RL algorithms (e.g., PPO or GRPO via HuggingFace's TRL). --- ## 🚀 QuickStart & Installation ### 1. Requirements You will need Python 3.10+ installed on your system. It's recommended to use a virtual environment. ### 2. Setup the Environment You can install dependencies using either `pip` or modern tools like `uv`: ```bash # Clone the repository git clone https://github.com/Sairishwanth89/sql-debug-env.git cd sql-debug-env # Install dependencies (DuckDB, FastAPI, Pandas, etc.) pip install -e . ``` ### 3. Initialize the Server Since this is an OpenEnv server, you simply run it using `uvicorn`. This boots up the DuckDB evaluation engine and opens the REST endpoints. ```bash uvicorn app:app --host 0.0.0.0 --port 7860 ``` *The server will be live at `http://localhost:7860`. You can test it by visiting the Swagger UI documentation at `http://localhost:7860/docs`.* --- ## 🏗️ Project Architecture ```text sql_env/ ├── openenv.yaml # 🔧 Manifest: Defines environment capabilities, tasks, and reward structure ├── app.py # 🧠 Server: Core OpenEnv FastAPI application & DuckDB execution logic ├── models.py # 📦 Schemas: Pydantic models for API interfaces (State, Reset, Step) ├── client.py # 🤝 Client: Python wrapper to cleanly interact with the local environment ├── inference.py # 🤖 Agent Loop: Example of an AI agent "playing" the environment ├── train_grpo.py # 📈 Training: Example of hooking the env into RL algorithms (TRL/GRPO) ├── pyproject.toml / uv.lock # ⚙️ Config: Modern Python packaging and strict dependency locking ├── Dockerfile # 🐳 Deployment: Container configuration for production ├── deploy_hf_space.md # ☁️ Hugging Face Spaces deployment instructions └── README.md # 📖 Documentation ``` --- ## 🎯 Supported Tasks The environment supports **7 tasks** — 4 foundational and 3 advanced RLVE challenges. Initialize any task via `POST /reset` with the `task_id`. ### Foundational Tasks | Task ID | Difficulty | Objective | |---|---|---| | `task_1_easy` | **Easy** | Fix a SQL query with a missing comma between column names. | | `task_2_medium` | **Medium** | Add a missing `GROUP BY` clause to an aggregation query. | | `task_3_hard` | **Hard** | Add `PARTITION BY` to a window function that ranks globally instead of per-department. | | `task_4_expert` | **Expert** | Fix an invalid date literal (month 13) inside a CTE. | ### Advanced RLVE Tasks (Live DuckDB Verifier) | Task ID | Difficulty | Verifier Logic | |---|---|---| | `task_5_optimization` | **Advanced** | Rewrite a CROSS JOIN query to use `INNER JOIN`. Reward only if output matches baseline **and** `EXPLAIN` shows no `CROSS_PRODUCT`. | | `task_6_migration` | **Advanced** | Normalize a denormalized `messy_dump` table into 3NF (`users` + `orders`). Destructive early `DROP` triggers -0.3 penalty and ends episode. | | `task_7_chaos` | **Advanced** | Live ETL corrupts data every step (duplicate IDs + NULL emails). Apply a patch and `UNIQUE` index before the pipeline contaminates the DB again. | --- ## 🏆 Dense Reward System and Anti-Cheating To prevent the "sparse gradient" problem where RL agents receive flat zero-rewards until they randomly achieve perfection, we implement a **dense multi-stepped reward function**. A maximum score is `1.0`. Here is how an agent is graded (Tasks 1, 2, 4): * `+0.10`: **Parser Validation** - Did the SQL successfully parse via AST (no syntax errors)? * `+0.20`: **Execution Validation** - Did DuckDB successfully run the query against the schema? * `+0.10`: **Column Accuracy** - Do the returned columns match the expected datatypes and shape? * `+0.30`: **Data Similarity (Jaccard)** - Fractional reward given based on how closely the dataframe matches the ground-truth data. * `+0.30`: **Exact Match Bonus** - Strict cell-for-cell match. ### 🛡️ Penalties The environment also automatically deducts points via server-side execution analysis to enforce best practices: * `-0.10`: Submitting a duplicate query already attempted in the episode. * `-0.20`: Efficiency penalties (excessive joins or full table scans). * `-0.30`: Destructive actions (`DROP`, `DELETE` clauses). * `-0.50`: Hardcoding values to bypass logic.