Skip to content

Testing & Benchmarks

Check that your agents behave as expected.

Check that your agents behave as expected. Capture conversations as test cases, replay them as test runs, and compare models or configurations with benchmarks.

ManualWhat it covers
Test CasesDefine expected agent behaviour, including saving a conversation as a test case.
Test RunsReplay test cases and review the results step by step.
BenchmarksBenchmark runs and their per-item results.

Related: test cases can be created directly from the chat window.