Spaces:
Running
Running
CUGA Evaluation
An evaluation framework for CUGA, enabling you to test your APIs against structured test cases with detailed scoring and reporting.
Features
- ✅ Validate API responses against expected outputs
- ✅ Score keywords, tool calls, and response similarity
- ✅ Generate JSON and CSV reports for easy analysis
Test File Schema
Your test file must be a JSON following this structure:
{
"name": "name for the test suite",
"title": "TestCases",
"type": "object",
"properties": {
"test_cases": {
"type": "array",
"items": {
"$ref": "#/definitions/TestCase"
}
}
},
"required": ["test_cases"],
"definitions": {
"ToolCall": {
"type": "object",
"properties": {
"name": { "type": "string" },
"args": { "type": "object" }
},
"required": ["name", "arguments"]
},
"ExpectedOutput": {
"type": "object",
"properties": {
"response": { "type": "string" },
"keywords": {
"type": "array",
"items": { "type": "string" }
},
"tool_calls": {
"type": "array",
"items": { "$ref": "#/definitions/ToolCall" }
}
},
"required": ["response", "keywords", "tool_calls"]
},
"TestCase": {
"type": "object",
"properties": {
"name": { "type": "string" },
"description": { "type": "string" },
"intent": { "type": "string" },
"expected_output": { "$ref": "#/definitions/ExpectedOutput" }
},
"required": ["name", "description", "intent", "expected_output"]
}
}
}
Schema Overview
| Entity | Description |
|---|---|
| ToolCall | Represents a tool invocation with name and args. |
| ExpectedOutput | Expected response, keywords, and tool calls. |
| TestCase | Defines a single test case with intent and expected output. |
Output Format
The evaluation generates two files:
results.jsonresults.csv
JSON Structure
{
"summary": {
"total_tests": "...",
"avg_keyword_score": "...",
"avg_tool_call_score": "...",
"avg_response_score": "..."
},
"results": [
{
"index": "...",
"test_name": "...",
"score": {
"keyword_score": "...",
"tool_call_score": "...",
"response_score": "...",
"response_scoring_type": "..."
},
"details": {
"missing_keywords": "...",
"expected_keywords": "...",
"expected_tool_calls": "...",
"tool_call_mismatches": "...",
"response_expected": "...",
"response_actual": "...",
"response_scoring_type": "..."
}
}
]
}
Langfuse Tracing (Optional)
Setup Langfuse
In a different folder (not under Cuga) run
# Get a copy of the latest Langfuse repository
git clone https://github.com/langfuse/langfuse.git
cd langfuse
# Run the langfuse docker compose
docker compose up
Get API Keys
- Access the Langfuse UI: Open a web browser and navigate to the URL where your self-hosted Langfuse instance is running (e.g., http://localhost:3000 if running locally with default ports).
- Log in: Sign in with the user account you created during the initial setup or create a new account.
- Navigate to Project Settings: Click on the "Project" menu (usually in the sidebar or top navigation). Select "Settings".
- View API Keys: In the settings area, you will find a section for API keys. You can view or regenerate your LANGFUSE_PUBLIC_KEY (username) and LANGFUSE_SECRET_KEY (password) there. The secret key is hidden by default; you may need to click an eye icon or a specific button to reveal and copy it.
- Add the API keys and host to your .env file
LANGFUSE_SECRET_KEY="your-secret-key"
LANGFUSE_PUBLIC_KEY="your-public-key"
LANGFUSE_HOST="http://localhost:3000"
Update settings
Then in vendor/cuga-agent/src/cuga/settings.toml update
langfuse_tracing = true
Quick Start Example
Run the evaluation on our default digital_sales API using our example test case.
This is the example input JSON:
{
"name": "digital-sales",
"test_cases": [
{
"name": "test_get_top_account",
"description": "gets the top account by revenue",
"intent": "get my top account by revenue",
"expected_output": {
"response": "**Top Account by Revenue** - **Name:** Andromeda Inc. - **Revenue:** $9,700,000 - **Account ID:** acc_49",
"keywords": ["Andromeda Inc.", "9,700,000"],
"tool_calls": [
{
"name": "digital_sales_get_my_accounts_my_accounts_get",
"args": {
}
}
]
}
}
]
}
First set tracker_enabled = true in the settings.toml
Now you can start running the example.
- Update API URL in mcp_servers.yaml:
url: http://localhost:8000/openapi.json - Start the API server:
uv run digital_sales_openapi - Run evaluation:
cuga evaluate docs/examples/evaluation/input_example.json
You’ll get results.json and results.csv in the project root.
Usage
cuga evaluate -t <test file path> -r <results file path>
Steps:
- Update mcp_servers.yaml with your APIs or create a new YAML file and run
export MCP_SERVERS_FILE=<location>
- Create a test file following the schema.
- Run the evaluation command above.