Spaces:

ibm-research
/

cuga-agent

Running

App Files Files Community

cuga-agent / src /cuga /evaluation /README.md

Sami Marreed

feat: docker-v1 with optimized frontend

0646b18 2 days ago

preview code

raw

history blame contribute delete

5.88 kB

CUGA Evaluation

An evaluation framework for CUGA, enabling you to test your APIs against structured test cases with detailed scoring and reporting.

Features

✅ Validate API responses against expected outputs
✅ Score keywords, tool calls, and response similarity
✅ Generate JSON and CSV reports for easy analysis

Test File Schema

Your test file must be a JSON following this structure:

{
  "name": "name for the test suite",
  "title": "TestCases",
  "type": "object",
  "properties": {
    "test_cases": {
      "type": "array",
      "items": {
        "$ref": "#/definitions/TestCase"
      }
    }
  },
  "required": ["test_cases"],
  "definitions": {
    "ToolCall": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "args": { "type": "object" }
      },
      "required": ["name", "arguments"]
    },
    "ExpectedOutput": {
      "type": "object",
      "properties": {
        "response": { "type": "string" },
        "keywords": {
          "type": "array",
          "items": { "type": "string" }
        },
        "tool_calls": {
          "type": "array",
          "items": { "$ref": "#/definitions/ToolCall" }
        }
      },
      "required": ["response", "keywords", "tool_calls"]
    },
    "TestCase": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "description": { "type": "string" },
        "intent": { "type": "string" },
        "expected_output": { "$ref": "#/definitions/ExpectedOutput" }
      },
      "required": ["name", "description", "intent", "expected_output"]
    }
  }
}

Schema Overview

Entity	Description
ToolCall	Represents a tool invocation with `name` and `args`.
ExpectedOutput	Expected response, keywords, and tool calls.
TestCase	Defines a single test case with intent and expected output.

Output Format

The evaluation generates two files:

results.json
results.csv

JSON Structure

{
  "summary": {
    "total_tests": "...",
    "avg_keyword_score": "...",
    "avg_tool_call_score": "...",
    "avg_response_score": "..."
  },
  "results": [
    {
      "index": "...",
      "test_name": "...",
      "score": {
        "keyword_score": "...",
        "tool_call_score": "...",
        "response_score": "...",
        "response_scoring_type": "..."
      },
      "details": {
        "missing_keywords": "...",
        "expected_keywords": "...",
        "expected_tool_calls": "...",
        "tool_call_mismatches": "...",
        "response_expected": "...",
        "response_actual": "...",
        "response_scoring_type": "..."
      }
    }
  ]
}

Langfuse Tracing (Optional)

Setup Langfuse

In a different folder (not under Cuga) run

# Get a copy of the latest Langfuse repository
git clone https://github.com/langfuse/langfuse.git
cd langfuse

# Run the langfuse docker compose
docker compose up

Get API Keys

Access the Langfuse UI: Open a web browser and navigate to the URL where your self-hosted Langfuse instance is running (e.g., http://localhost:3000 if running locally with default ports).
Log in: Sign in with the user account you created during the initial setup or create a new account.
Navigate to Project Settings: Click on the "Project" menu (usually in the sidebar or top navigation). Select "Settings".
View API Keys: In the settings area, you will find a section for API keys. You can view or regenerate your LANGFUSE_PUBLIC_KEY (username) and LANGFUSE_SECRET_KEY (password) there. The secret key is hidden by default; you may need to click an eye icon or a specific button to reveal and copy it.
Add the API keys and host to your .env file

LANGFUSE_SECRET_KEY="your-secret-key"
LANGFUSE_PUBLIC_KEY="your-public-key"
LANGFUSE_HOST="http://localhost:3000"

Update settings

Then in vendor/cuga-agent/src/cuga/settings.toml update

langfuse_tracing = true

Quick Start Example

Run the evaluation on our default digital_sales API using our example test case.

This is the example input JSON:

{
  "name": "digital-sales",
  "test_cases": [
    {
      "name": "test_get_top_account",
      "description": "gets the top account by revenue",
      "intent": "get my top account by revenue",
      "expected_output": {
        "response": "**Top Account by Revenue** - **Name:** Andromeda Inc. - **Revenue:** $9,700,000 - **Account ID:** acc_49",
        "keywords": ["Andromeda Inc.", "9,700,000"],
        "tool_calls": [
                  {
          "name": "digital_sales_get_my_accounts_my_accounts_get",
          "args": {
          }
        }
        ]
      }
    }
  ]
}

First set tracker_enabled = true in the settings.toml

Now you can start running the example.

Update API URL in mcp_servers.yaml:

url: http://localhost:8000/openapi.json

Start the API server:
```
uv run digital_sales_openapi
```

Run evaluation:

cuga evaluate docs/examples/evaluation/input_example.json

You’ll get results.json and results.csv in the project root.

Usage

cuga evaluate -t <test file path> -r <results file path>

Steps:

Update mcp_servers.yaml with your APIs or create a new YAML file and run

export MCP_SERVERS_FILE=<location>

Create a test file following the schema.
Run the evaluation command above.