---
title: SQL Debug RL Environment
emoji: 🗄️
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
short_description: SQL RLVE — 7 tasks with live DuckDB verification
tags:
  - reinforcement-learning
  - sql
  - duckdb
  - data-engineering
  - openenv
  - rlve
  - agent
license: apache-2.0
---

<div align="center">
  
# 🗄️ SQL Debug Environment (OpenEnv)
**An execution-based Reinforcement Learning Sandbox for Data Engineering AI Models**

[![OpenEnv Standard](https://img.shields.io/badge/OpenEnv-Compatible-blue.svg)](https://openenv.ai)
[![DuckDB Built](https://img.shields.io/badge/DuckDB-In--Memory-yellow.svg)](https://duckdb.org/)
[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-green.svg)](https://www.python.org/)

</div>

---

## 📌 The Problem
Traditional Large Language Models (LLMs) are primarily trained on static datasets to imitate code syntax. While they can often produce code that *looks* right, they frequently hallucinate logic or fail on semantic edge cases in rigorous data tasks like SQL generation and ETL pipelines. 

When a model generates a bad SQL query during standard training, the pipeline only knows if it's an exact string match to an answer key. This is a fundamentally flawed signal: many different SQL queries can yield the exact same correct data, and conversely, a completely wrong string could be functionally correct. **AI models need verifiable, execution-based feedback loops to improve their logic.**

## 💡 The Solution
This project provides a state-of-the-art **execution-based Reinforcement Learning (RL) environment** built specifically for training AI agents on database operations and SQL debugging. 

Instead of relying on static string matching, this environment wraps an ephemeral, in-memory **DuckDB** instance. When an AI agent submits a SQL script, the system:
1. Dynamically generates mock tables, schemas, and live data in DuckDB.
2. Sandboxes and executes the AI's generated SQL query natively.
3. Performs structural AST validation and execution validation.
4. Computes a **continuous, dense fractional reward** comparing the AI's output dataframe against the ground-truth dataframe down to the cell level.

This project strictly adheres to the [OpenEnv Specifications](https://openenv.ai), making it instantly compatible with agentic frameworks and standard RL algorithms (e.g., PPO or GRPO via HuggingFace's TRL).

---

## 🚀 QuickStart & Installation

### 1. Requirements 
You will need Python 3.10+ installed on your system. It's recommended to use a virtual environment.

### 2. Setup the Environment
You can install dependencies using either `pip` or modern tools like `uv`:

```bash
# Clone the repository
git clone https://github.com/Sairishwanth89/sql-debug-env.git
cd sql-debug-env

# Install dependencies (DuckDB, FastAPI, Pandas, etc.)
pip install -e .
```

### 3. Initialize the Server
Since this is an OpenEnv server, you simply run it using `uvicorn`. This boots up the DuckDB evaluation engine and opens the REST endpoints.

```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```
*The server will be live at `http://localhost:7860`. You can test it by visiting the Swagger UI documentation at `http://localhost:7860/docs`.*

---

## 🏗️ Project Architecture

```text
sql_env/
├── openenv.yaml               # 🔧 Manifest: Defines environment capabilities, tasks, and reward structure
├── app.py                     # 🧠 Server: Core OpenEnv FastAPI application & DuckDB execution logic
├── models.py                  # 📦 Schemas: Pydantic models for API interfaces (State, Reset, Step)
├── client.py                  # 🤝 Client: Python wrapper to cleanly interact with the local environment
├── inference.py               # 🤖 Agent Loop: Example of an AI agent "playing" the environment
├── train_grpo.py              # 📈 Training: Example of hooking the env into RL algorithms (TRL/GRPO)
├── pyproject.toml / uv.lock   # ⚙️ Config: Modern Python packaging and strict dependency locking
├── Dockerfile                 # 🐳 Deployment: Container configuration for production
├── deploy_hf_space.md         # ☁️ Hugging Face Spaces deployment instructions
└── README.md                  # 📖 Documentation
```

---

## 🎯 Supported Tasks

The environment supports **7 tasks** — 4 foundational and 3 advanced RLVE challenges. Initialize any task via `POST /reset` with the `task_id`.

### Foundational Tasks

| Task ID | Difficulty | Objective |
|---|---|---|
| `task_1_easy` | **Easy** | Fix a SQL query with a missing comma between column names. |
| `task_2_medium` | **Medium** | Add a missing `GROUP BY` clause to an aggregation query. |
| `task_3_hard` | **Hard** | Add `PARTITION BY` to a window function that ranks globally instead of per-department. |
| `task_4_expert` | **Expert** | Fix an invalid date literal (month 13) inside a CTE. |

### Advanced RLVE Tasks (Live DuckDB Verifier)

| Task ID | Difficulty | Verifier Logic |
|---|---|---|
| `task_5_optimization` | **Advanced** | Rewrite a CROSS JOIN query to use `INNER JOIN`. Reward only if output matches baseline **and** `EXPLAIN` shows no `CROSS_PRODUCT`. |
| `task_6_migration` | **Advanced** | Normalize a denormalized `messy_dump` table into 3NF (`users` + `orders`). Destructive early `DROP` triggers -0.3 penalty and ends episode. |
| `task_7_chaos` | **Advanced** | Live ETL corrupts data every step (duplicate IDs + NULL emails). Apply a patch and `UNIQUE` index before the pipeline contaminates the DB again. |

---

## 🏆 Dense Reward System and Anti-Cheating

To prevent the "sparse gradient" problem where RL agents receive flat zero-rewards until they randomly achieve perfection, we implement a **dense multi-stepped reward function**. 

A maximum score is `1.0`. Here is how an agent is graded (Tasks 1, 2, 4):
* `+0.10`: **Parser Validation** - Did the SQL successfully parse via AST (no syntax errors)?
* `+0.20`: **Execution Validation** - Did DuckDB successfully run the query against the schema?
* `+0.10`: **Column Accuracy** - Do the returned columns match the expected datatypes and shape?
* `+0.30`: **Data Similarity (Jaccard)** - Fractional reward given based on how closely the dataframe matches the ground-truth data.
* `+0.30`: **Exact Match Bonus** - Strict cell-for-cell match.

### 🛡️ Penalties
The environment also automatically deducts points via server-side execution analysis to enforce best practices:
* `-0.10`: Submitting a duplicate query already attempted in the episode.
* `-0.20`: Efficiency penalties (excessive joins or full table scans).
* `-0.30`: Destructive actions (`DROP`, `DELETE` clauses).
* `-0.50`: Hardcoding values to bypass logic.