Measuring Accuracy

Contact Kolena to enable this feature for your Organization.

Agent Validation lets you compare your Agent’s output to known correct answers (called Ground Truths). Kolena will score how well the Agent matches these Ground Truths. This allows you to:

Estimate the overall accuracy of the Agent at a given task
Perform change management by testing accuracy before and after changes to your Prompts

Overview

The validation workflow has three steps:

Create a Ground Truth for your Agent
Perform Validation
Review results of the Validation

Prerequisites

To perform validation on an Agent, you must ensure:

The Agent has Prompts created
Runs are uploaded to the Agent using inputs for which you have known correct outputs

For example, if you wish to validate an Lease Abstraction Agent:

Upload the lease documents you wish to abstract into Runs
Define all Prompts necessary to perform the abstraction
Ensure you have known answers to compare against for those input documents

Creating a Ground Truth

Using the UI, navigate to the “Validation” tab for your Agent. Click “Create New Ground Truth”. Provide a Name for your Ground Truth and click “Continue”.

Associate Ground Truth to Runs

In order to compare Agent outputs to Ground Truths, you need a way to associate the two. Kolena uses a label called user_defined_id to achieve this. Both your ground truths and the Agent Runs must use this label to join the two. Click “Assign User Defined IDs” and provide natural language instructions on how to assign this label to each Run on your Agent. Kolena will take your instruction and define the label for each Run. You can check progress and then continue when complete.

For example, if you have a Loan Review Agent, use an identifier like the loan ID as the label. Assign User Defined IDs by providing instructions like “Grab the loan ID from the cover sheet”.

Upload Ground Truth

There are two ways to upload Ground Truth:

Convert from an existing format like excel spreadsheet
Directly upload in the expected schema

If your correct answers are in a spreadsheet (.xlsx or .csv), use option 1. Kolena will map your spreadsheet columns to the correct Prompt names and generate the Ground Truth JSON for you.

Ground Truth Schema

If uploading Ground Truth directly as JSON, the following structure is expected:

{
  "evaluation_instructions": "Values should match exactly.",
  "runs": [
    {
      "user_defined_id": "lease-001",
      "data": {
        "tenant_name": "Acme Corp",
        "lease_start_date": "01/15/2024",
        "monthly_rent": 4500
      }
    },
    {
      "user_defined_id": "lease-002",
      "data": {
        "tenant_name": "Globex LLC",
        "lease_start_date": "03/01/2024",
        "monthly_rent": 6200
      }
    }
  ]
}

Fields:

evaluation_instructions — Natural language guidance for how Kolena should compare outputs to ground truths. Useful for specifying acceptable formats, equivalences (e.g. “$4,500” vs 4500), or what counts as a correct partial match.
runs — A list of expected results, one per Agent Run.
- user_defined_id — A stable identifier that matches the ground truth to a Run in the Agent.
- data — A map from Prompt name to expected value.

Run a Validation

Open a Ground Truth from the Validation tab.
Click Run Validation.
Results stream in real-time as each cell is evaluated. When complete, you’ll see:
- An overall accuracy score (0–100) for the sheet
- Per-Prompt scores showing which Prompts are most/least accurate
- Per-Run scores showing which Runs had the most errors
- Cell-level reasoning explaining why a value was marked incorrect

Editing Ground Truth

You can update individual expected values without re-uploading the entire file:

Open a Validation and click on a cell.
Click “Edit Ground Truth” and Edit the expected value in the side pane.
Click Save. Kolena creates a new version of the Ground Truth, preserving history.

How Scoring Works

Each matching Prompt+Run combination is compared to its expected value using the following methodology:

If an exact match is found the score is 100%
Otherwise, fuzzy evaluation is performed using the evaluation_instructions.

All scores are on a 0-100 scale. Scores are then aggregated by Run, by Prompt, and an overall matching score is provided.

Evaluation Modes

When you start a Validation, you can choose how outputs are compared to Ground Truth:

Standard (default) — Each value is compared to its expected value. If an exact match is found the score is 100%; otherwise fuzzy evaluation is performed using your evaluation_instructions. Results are scored on a 0-100 scale.
Script-Based Evaluation (Beta) — An AI coding agent inspects your data and writes and runs a script to compare outputs field-by-field, producing deterministic match / no-match results. A second pass is performed on remaining mismatches using natural language comparison.

Beta: Script-Based Evaluation is in Beta — its behavior and output may change.

To use Script-Based Evaluation, select it from the Start Validation dropdown, or choose Rerun Validation (Script-Based) from the Ground Truth menu.

Custom instructions

Evaluation Instructions can be set to customize the scoring logic. Use these to tell Kolena how to handle acceptable variations — for example, equivalent date formats, optional whitespace, or acceptable abbreviations. This prevents the evaluator from penalizing correct-but-differently-formatted outputs.

Evaluation Instructions can be used to provide custom logic per Prompt or to ignore differences in specific outputs.For example,

General instructions
When comparing booleans, consider “yes”/“true” and “no”/“false” as equals
My Prompt
When comparing “My Prompt”, ignore columns called “index” in the comparison

Using the API

Validation can also be performed programmatically using Kolena’s API.

​Overview

​Prerequisites

​Creating a Ground Truth

​Associate Ground Truth to Runs

​Upload Ground Truth

​Ground Truth Schema

​Run a Validation

​Editing Ground Truth

​How Scoring Works

​Evaluation Modes

​Custom instructions

​General instructions

My Prompt

​Using the API

Overview

Prerequisites

Creating a Ground Truth

Associate Ground Truth to Runs

Upload Ground Truth

Ground Truth Schema

Run a Validation

Editing Ground Truth

How Scoring Works

Evaluation Modes

Custom instructions

General instructions

Using the API