Sami Marreed
feat: docker-v1 with optimized frontend
0646b18

CUGA Evaluation

An evaluation framework for CUGA, enabling you to test your APIs against structured test cases with detailed scoring and reporting.


Features

  • ✅ Validate API responses against expected outputs
  • ✅ Score keywords, tool calls, and response similarity
  • ✅ Generate JSON and CSV reports for easy analysis

Test File Schema

Your test file must be a JSON following this structure:

{
  "name": "name for the test suite",
  "title": "TestCases",
  "type": "object",
  "properties": {
    "test_cases": {
      "type": "array",
      "items": {
        "$ref": "#/definitions/TestCase"
      }
    }
  },
  "required": ["test_cases"],
  "definitions": {
    "ToolCall": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "args": { "type": "object" }
      },
      "required": ["name", "arguments"]
    },
    "ExpectedOutput": {
      "type": "object",
      "properties": {
        "response": { "type": "string" },
        "keywords": {
          "type": "array",
          "items": { "type": "string" }
        },
        "tool_calls": {
          "type": "array",
          "items": { "$ref": "#/definitions/ToolCall" }
        }
      },
      "required": ["response", "keywords", "tool_calls"]
    },
    "TestCase": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "description": { "type": "string" },
        "intent": { "type": "string" },
        "expected_output": { "$ref": "#/definitions/ExpectedOutput" }
      },
      "required": ["name", "description", "intent", "expected_output"]
    }
  }
}

Schema Overview

Entity Description
ToolCall Represents a tool invocation with name and args.
ExpectedOutput Expected response, keywords, and tool calls.
TestCase Defines a single test case with intent and expected output.

Output Format

The evaluation generates two files:

  • results.json
  • results.csv

JSON Structure

{
  "summary": {
    "total_tests": "...",
    "avg_keyword_score": "...",
    "avg_tool_call_score": "...",
    "avg_response_score": "..."
  },
  "results": [
    {
      "index": "...",
      "test_name": "...",
      "score": {
        "keyword_score": "...",
        "tool_call_score": "...",
        "response_score": "...",
        "response_scoring_type": "..."
      },
      "details": {
        "missing_keywords": "...",
        "expected_keywords": "...",
        "expected_tool_calls": "...",
        "tool_call_mismatches": "...",
        "response_expected": "...",
        "response_actual": "...",
        "response_scoring_type": "..."
      }
    }
  ]
}

Langfuse Tracing (Optional)

Setup Langfuse

In a different folder (not under Cuga) run

# Get a copy of the latest Langfuse repository
git clone https://github.com/langfuse/langfuse.git
cd langfuse

# Run the langfuse docker compose
docker compose up

Get API Keys

  1. Access the Langfuse UI: Open a web browser and navigate to the URL where your self-hosted Langfuse instance is running (e.g., http://localhost:3000 if running locally with default ports).
  2. Log in: Sign in with the user account you created during the initial setup or create a new account.
  3. Navigate to Project Settings: Click on the "Project" menu (usually in the sidebar or top navigation). Select "Settings".
  4. View API Keys: In the settings area, you will find a section for API keys. You can view or regenerate your LANGFUSE_PUBLIC_KEY (username) and LANGFUSE_SECRET_KEY (password) there. The secret key is hidden by default; you may need to click an eye icon or a specific button to reveal and copy it.
  5. Add the API keys and host to your .env file
LANGFUSE_SECRET_KEY="your-secret-key"
LANGFUSE_PUBLIC_KEY="your-public-key"
LANGFUSE_HOST="http://localhost:3000"

Update settings

Then in vendor/cuga-agent/src/cuga/settings.toml update

langfuse_tracing = true

Quick Start Example

Run the evaluation on our default digital_sales API using our example test case.

This is the example input JSON:

{
  "name": "digital-sales",
  "test_cases": [
    {
      "name": "test_get_top_account",
      "description": "gets the top account by revenue",
      "intent": "get my top account by revenue",
      "expected_output": {
        "response": "**Top Account by Revenue** - **Name:** Andromeda Inc. - **Revenue:** $9,700,000 - **Account ID:** acc_49",
        "keywords": ["Andromeda Inc.", "9,700,000"],
        "tool_calls": [
                  {
          "name": "digital_sales_get_my_accounts_my_accounts_get",
          "args": {
          }
        }
        ]
      }
    }
  ]
}

First set tracker_enabled = true in the settings.toml

Now you can start running the example.

  1. Update API URL in mcp_servers.yaml:
    url: http://localhost:8000/openapi.json
    
  2. Start the API server:
    uv run digital_sales_openapi
    
  3. Run evaluation:
    cuga evaluate docs/examples/evaluation/input_example.json
    

You’ll get results.json and results.csv in the project root.


Usage

cuga evaluate -t <test file path> -r <results file path>

Steps:

  1. Update mcp_servers.yaml with your APIs or create a new YAML file and run
export MCP_SERVERS_FILE=<location>
  1. Create a test file following the schema.
  2. Run the evaluation command above.