Contact Kolena to enable this feature for your Organization.
- Estimate the overall accuracy of the Agent at a given task
- Perform change management by testing accuracy before and after changes to your Prompts
Overview
The validation workflow has three steps:- Create a Ground Truth for your Agent
- Perform Validation
- Review results of the Validation
Prerequisites
To perform validation on an Agent, you must ensure:- The Agent has Prompts created
- Runs are uploaded to the Agent using inputs for which you have known correct outputs
Creating a Ground Truth
Using the UI, navigate to the “Validation” tab for your Agent. Click “Create New Ground Truth”. Provide a Name for your Ground Truth and click “Continue”.Associate Ground Truth to Runs
In order to compare Agent outputs to Ground Truths, you need a way to associate the two. Kolena uses a label calleduser_defined_id to achieve this.
Both your ground truths and the Agent Runs must use this label to join the two.
Click “Assign User Defined IDs” and provide natural language instructions on how to assign this label to each Run on your Agent.
Kolena will take your instruction and define the label for each Run.
You can check progress and then continue when complete.
Upload Ground Truth
There are two ways to upload Ground Truth:- Convert from an existing format like excel spreadsheet
- Directly upload in the expected schema
Ground Truth Schema
If uploading Ground Truth directly as JSON, the following structure is expected:evaluation_instructions— Natural language guidance for how Kolena should compare outputs to ground truths. Useful for specifying acceptable formats, equivalences (e.g. “$4,500” vs4500), or what counts as a correct partial match.runs— A list of expected results, one per Agent Run.user_defined_id— A stable identifier that matches the ground truth to a Run in the Agent.data— A map from Prompt name to expected value.
Run a Validation
- Open a Ground Truth from the Validation tab.
- Click Run Validation.
- Results stream in real-time as each cell is evaluated. When complete, you’ll see:
- An overall accuracy score (0–100) for the sheet
- Per-Prompt scores showing which Prompts are most/least accurate
- Per-Run scores showing which Runs had the most errors
- Cell-level reasoning explaining why a value was marked incorrect
Editing Ground Truth
You can update individual expected values without re-uploading the entire file:- Open a Validation and click on a cell.
- Click “Edit Ground Truth” and Edit the expected value in the side pane.
- Click Save. Kolena creates a new version of the Ground Truth, preserving history.
How Scoring Works
Each matching Prompt+Run combination is compared to its expected value using the following methodology:- If an exact match is found the score is 100%
- Otherwise, fuzzy evaluation is performed using the
evaluation_instructions.
