# **CUGA Evaluation** An evaluation framework for **CUGA**, enabling you to **test your APIs** against structured test cases with detailed scoring and reporting. --- ## **Features** - ✅ Validate **API responses** against expected outputs - ✅ Score **keywords**, **tool calls**, and **response similarity** - ✅ Generate **JSON** and **CSV** reports for easy analysis --- ## **Test File Schema** Your test file must be a **JSON** following this structure: ```json { "name": "name for the test suite", "title": "TestCases", "type": "object", "properties": { "test_cases": { "type": "array", "items": { "$ref": "#/definitions/TestCase" } } }, "required": ["test_cases"], "definitions": { "ToolCall": { "type": "object", "properties": { "name": { "type": "string" }, "args": { "type": "object" } }, "required": ["name", "arguments"] }, "ExpectedOutput": { "type": "object", "properties": { "response": { "type": "string" }, "keywords": { "type": "array", "items": { "type": "string" } }, "tool_calls": { "type": "array", "items": { "$ref": "#/definitions/ToolCall" } } }, "required": ["response", "keywords", "tool_calls"] }, "TestCase": { "type": "object", "properties": { "name": { "type": "string" }, "description": { "type": "string" }, "intent": { "type": "string" }, "expected_output": { "$ref": "#/definitions/ExpectedOutput" } }, "required": ["name", "description", "intent", "expected_output"] } } } ``` --- ### **Schema Overview** | Entity | Description | |------------------|----------------------------------------------| | **ToolCall** | Represents a tool invocation with `name` and `args`. | | **ExpectedOutput** | Expected response, keywords, and tool calls. | | **TestCase** | Defines a single test case with intent and expected output. | --- ## **Output Format** The evaluation generates **two files**: - `results.json` - `results.csv` ### **JSON Structure** ```json { "summary": { "total_tests": "...", "avg_keyword_score": "...", "avg_tool_call_score": "...", "avg_response_score": "..." }, "results": [ { "index": "...", "test_name": "...", "score": { "keyword_score": "...", "tool_call_score": "...", "response_score": "...", "response_scoring_type": "..." }, "details": { "missing_keywords": "...", "expected_keywords": "...", "expected_tool_calls": "...", "tool_call_mismatches": "...", "response_expected": "...", "response_actual": "...", "response_scoring_type": "..." } } ] } ``` ## **Langfuse Tracing (Optional)** ### Setup Langfuse In a different folder (not under Cuga) run ```bash # Get a copy of the latest Langfuse repository git clone https://github.com/langfuse/langfuse.git cd langfuse # Run the langfuse docker compose docker compose up ``` ### Get API Keys 1. Access the Langfuse UI: Open a web browser and navigate to the URL where your self-hosted Langfuse instance is running (e.g., http://localhost:3000 if running locally with default ports). 2. Log in: Sign in with the user account you created during the initial setup or create a new account. 3. Navigate to Project Settings: Click on the "Project" menu (usually in the sidebar or top navigation). Select "Settings". 4. View API Keys: In the settings area, you will find a section for API keys. You can view or regenerate your LANGFUSE_PUBLIC_KEY (username) and LANGFUSE_SECRET_KEY (password) there. The secret key is hidden by default; you may need to click an eye icon or a specific button to reveal and copy it. 5. Add the API keys and host to your .env file ```.dotenv LANGFUSE_SECRET_KEY="your-secret-key" LANGFUSE_PUBLIC_KEY="your-public-key" LANGFUSE_HOST="http://localhost:3000" ``` ### Update settings Then in `vendor/cuga-agent/src/cuga/settings.toml` update ``` langfuse_tracing = true ``` --- ## **Quick Start Example** Run the evaluation on our default `digital_sales` API using our example test case. This is the example input JSON: ```json { "name": "digital-sales", "test_cases": [ { "name": "test_get_top_account", "description": "gets the top account by revenue", "intent": "get my top account by revenue", "expected_output": { "response": "**Top Account by Revenue** - **Name:** Andromeda Inc. - **Revenue:** $9,700,000 - **Account ID:** acc_49", "keywords": ["Andromeda Inc.", "9,700,000"], "tool_calls": [ { "name": "digital_sales_get_my_accounts_my_accounts_get", "args": { } } ] } } ] } ``` First set `tracker_enabled = true` in the `settings.toml` Now you can start running the example. 1. **Update API URL** in [mcp_servers.yaml](src/cuga/backend/tools_env/registry/config/mcp_servers.yaml): ```yaml url: http://localhost:8000/openapi.json ``` 2. **Start the API server**: ```bash uv run digital_sales_openapi ``` 3. **Run evaluation**: ```bash cuga evaluate docs/examples/evaluation/input_example.json ``` You’ll get `results.json` and `results.csv` in the project root. --- ## **Usage** ```bash cuga evaluate -t -r ``` Steps: 1. Update [mcp_servers.yaml](src/cuga/backend/tools_env/registry/config/mcp_servers.yaml) with your APIs or create a new YAML file and run ```shell export MCP_SERVERS_FILE= ``` 2. Create a test file following the schema. 3. Run the evaluation command above. ---