> ## Documentation Index
> Fetch the complete documentation index at: https://docs.agents.kolena.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Measuring Accuracy

> Use Agent Validation to Compare Agent Output Against Known Answers

<Note>
  Contact Kolena to enable this feature for your Organization.
</Note>

**Agent Validation** lets you compare your Agent's output to known correct answers (called **Ground Truths**).
Kolena will score how well the Agent matches these Ground Truths.
This allows you to:

* Estimate the **overall accuracy** of the Agent at a given task
* Perform **change management** by testing accuracy before and after changes to your Prompts

## Overview

The validation workflow has three steps:

1. Create a Ground Truth for your Agent
2. Perform Validation
3. Review results of the Validation

## Prerequisites

To perform validation on an Agent, you must ensure:

* The Agent has Prompts created
* Runs are uploaded to the Agent using inputs for which you have known correct outputs

<Tip>
  For example, if you wish to validate an Lease Abstraction Agent:

  * Upload the lease documents you wish to abstract into Runs
  * Define all Prompts necessary to perform the abstraction
  * Ensure you have known answers to compare against for those input documents
</Tip>

## Creating a Ground Truth

Using the UI, navigate to the "Validation" tab for your Agent.
Click "Create New Ground Truth".
Provide a Name for your Ground Truth and click "Continue".

### Associate Ground Truth to Runs

In order to compare Agent outputs to Ground Truths, you need a way to associate the two.
Kolena uses a label called `user_defined_id` to achieve this.
Both your ground truths and the Agent Runs must use this label to join the two.

Click "Assign User Defined IDs" and provide natural language instructions on how to assign this label to each Run on your Agent.
Kolena will take your instruction and define the label for each Run.
You can check progress and then continue when complete.

<Tip>
  For example, if you have a Loan Review Agent, use an identifier like the loan ID as the label.
  Assign User Defined IDs by providing instructions like "Grab the loan ID from the cover sheet".
</Tip>

### Upload Ground Truth

There are two ways to upload Ground Truth:

1. Convert from an existing format like excel spreadsheet
2. Directly upload in the expected schema

If your correct answers are in a spreadsheet (.xlsx or .csv), use option 1.
Kolena will map your spreadsheet columns to the correct Prompt names and generate the Ground Truth JSON for you.

#### Ground Truth Schema

If uploading Ground Truth directly as JSON, the following structure is expected:

```json theme={null}
{
  "evaluation_instructions": "Values should match exactly.",
  "runs": [
    {
      "user_defined_id": "lease-001",
      "data": {
        "tenant_name": "Acme Corp",
        "lease_start_date": "01/15/2024",
        "monthly_rent": 4500
      }
    },
    {
      "user_defined_id": "lease-002",
      "data": {
        "tenant_name": "Globex LLC",
        "lease_start_date": "03/01/2024",
        "monthly_rent": 6200
      }
    }
  ]
}
```

**Fields:**

* `evaluation_instructions` — Natural language guidance for how Kolena should compare outputs to ground truths. Useful for specifying acceptable formats, equivalences (e.g. "\$4,500" vs `4500`), or what counts as a correct partial match.
* `runs` — A list of expected results, one per Agent Run.
  * `user_defined_id` — A stable identifier that matches the ground truth to a Run in the Agent.
  * `data` — A map from Prompt name to expected value.

## Run a Validation

1. Open a Ground Truth from the **Validation** tab.
2. Click **Run Validation**.
3. Results stream in real-time as each cell is evaluated. When complete, you'll see:
   * An **overall accuracy score** (0–100) for the sheet
   * **Per-Prompt scores** showing which Prompts are most/least accurate
   * **Per-Run scores** showing which Runs had the most errors
   * **Cell-level reasoning** explaining why a value was marked incorrect

## Editing Ground Truth

You can update individual expected values without re-uploading the entire file:

1. Open a Validation and click on a cell.
2. Click "Edit Ground Truth" and Edit the expected value in the side pane.
3. Click **Save**. Kolena creates a new version of the Ground Truth, preserving history.

## How Scoring Works

Each matching Prompt+Run combination is compared to its expected value using the following methodology:

* If an exact match is found the score is 100%
* Otherwise, fuzzy evaluation is performed using the `evaluation_instructions`.

All scores are on a 0-100 scale. Scores are then aggregated by Run, by Prompt, and an overall matching score is provided.

### Evaluation Modes

When you start a Validation, you can choose how outputs are compared to Ground Truth:

* **Standard** (default) — Each value is compared to its expected value. If an exact match is found the
  score is 100%; otherwise fuzzy evaluation is performed using your `evaluation_instructions`. Results
  are scored on a 0-100 scale.
* **Script-Based Evaluation** *(Beta)* — An AI coding agent inspects your data and writes and runs a
  script to compare outputs **field-by-field**, producing deterministic match / no-match results.
  A second pass is performed on remaining mismatches using natural language comparison.

<Note>
  **Beta:** Script-Based Evaluation is in Beta — its behavior and output may change.
</Note>

To use Script-Based Evaluation, select it from the **Start Validation** dropdown, or choose
**Rerun Validation (Script-Based)** from the Ground Truth menu.

### Custom instructions

**Evaluation Instructions** can be set to customize the scoring logic.
Use these to tell Kolena how to handle acceptable variations — for example, equivalent date formats, optional whitespace, or acceptable abbreviations.
This prevents the evaluator from penalizing correct-but-differently-formatted outputs.

<Tip>
  **Evaluation Instructions** can be used to provide custom logic per Prompt or to ignore differences in specific outputs.

  For example,

  > #### General instructions
  >
  > When comparing booleans, consider "yes"/"true" and "no"/"false" as equals
  >
  > ##### My Prompt
  >
  > When comparing "My Prompt", ignore columns called "index" in the comparison
</Tip>

## Using the API

Validation can also be performed programmatically using [Kolena's API](/api-reference/ground-truths/create-ground-truth).
