Skip to main content
Contact Kolena to enable this feature for your Organization.
Agent Validation lets you compare your Agent’s output to known correct answers (called Ground Truths). Kolena will score how well the Agent matches these Ground Truths. This allows you to:
  • Estimate the overall accuracy of the Agent at a given task
  • Perform change management by testing accuracy before and after changes to your Prompts

Overview

The validation workflow has three steps:
  1. Create a Ground Truth for your Agent
  2. Perform Validation
  3. Review results of the Validation

Prerequisites

To perform validation on an Agent, you must ensure:
  • The Agent has Prompts created
  • Runs are uploaded to the Agent using inputs for which you have known correct outputs
For example, if you wish to validate an Lease Abstraction Agent:
  • Upload the lease documents you wish to abstract into Runs
  • Define all Prompts necessary to perform the abstraction
  • Ensure you have known answers to compare against for those input documents

Creating a Ground Truth

Using the UI, navigate to the “Validation” tab for your Agent. Click “Create New Ground Truth”. Provide a Name for your Ground Truth and click “Continue”.

Associate Ground Truth to Runs

In order to compare Agent outputs to Ground Truths, you need a way to associate the two. Kolena uses a label called user_defined_id to achieve this. Both your ground truths and the Agent Runs must use this label to join the two. Click “Assign User Defined IDs” and provide natural language instructions on how to assign this label to each Run on your Agent. Kolena will take your instruction and define the label for each Run. You can check progress and then continue when complete.
For example, if you have a Loan Review Agent, use an identifier like the loan ID as the label. Assign User Defined IDs by providing instructions like “Grab the loan ID from the cover sheet”.

Upload Ground Truth

There are two ways to upload Ground Truth:
  1. Convert from an existing format like excel spreadsheet
  2. Directly upload in the expected schema
If your correct answers are in a spreadsheet (.xlsx or .csv), use option 1. Kolena will map your spreadsheet columns to the correct Prompt names and generate the Ground Truth JSON for you.

Ground Truth Schema

If uploading Ground Truth directly as JSON, the following structure is expected:
{
  "evaluation_instructions": "Values should match exactly.",
  "runs": [
    {
      "user_defined_id": "lease-001",
      "data": {
        "tenant_name": "Acme Corp",
        "lease_start_date": "01/15/2024",
        "monthly_rent": 4500
      }
    },
    {
      "user_defined_id": "lease-002",
      "data": {
        "tenant_name": "Globex LLC",
        "lease_start_date": "03/01/2024",
        "monthly_rent": 6200
      }
    }
  ]
}
Fields:
  • evaluation_instructions — Natural language guidance for how Kolena should compare outputs to ground truths. Useful for specifying acceptable formats, equivalences (e.g. “$4,500” vs 4500), or what counts as a correct partial match.
  • runs — A list of expected results, one per Agent Run.
    • user_defined_id — A stable identifier that matches the ground truth to a Run in the Agent.
    • data — A map from Prompt name to expected value.

Run a Validation

  1. Open a Ground Truth from the Validation tab.
  2. Click Run Validation.
  3. Results stream in real-time as each cell is evaluated. When complete, you’ll see:
    • An overall accuracy score (0–100) for the sheet
    • Per-Prompt scores showing which Prompts are most/least accurate
    • Per-Run scores showing which Runs had the most errors
    • Cell-level reasoning explaining why a value was marked incorrect

Editing Ground Truth

You can update individual expected values without re-uploading the entire file:
  1. Open a Validation and click on a cell.
  2. Click “Edit Ground Truth” and Edit the expected value in the side pane.
  3. Click Save. Kolena creates a new version of the Ground Truth, preserving history.

How Scoring Works

Each matching Prompt+Run combination is compared to its expected value using the following methodology:
  • If an exact match is found the score is 100%
  • Otherwise, fuzzy evaluation is performed using the evaluation_instructions.
All scores are on a 0-100 scale. Scores are then aggregated by Run, by Prompt, and an overall matching score is provided.
Use evaluation_instructions to tell Kolena how to handle acceptable variations — for example, equivalent date formats, optional whitespace, or acceptable abbreviations. This prevents the evaluator from penalizing correct-but-differently-formatted outputs.

Using the API

Validation can also be performed programmatically using Kolena’s API.