Testing & Benchmarks
Check that your agents behave as expected.
Check that your agents behave as expected. Capture conversations as test cases, replay them as test runs, and compare models or configurations with benchmarks.
| Manual | What it covers |
|---|---|
| Test Cases | Define expected agent behaviour, including saving a conversation as a test case. |
| Test Runs | Replay test cases and review the results step by step. |
| Benchmarks | Benchmark runs and their per-item results. |
Related: test cases can be created directly from the chat window.