Evaluations (Evals)

Systematic testing of agent performance: accuracy, safety, reliability.

Why it matters

You wouldn't deploy software without tests. Evals are the test suite for AI agents.

In practice

Our QA Judge subagent runs boolean pass/fail criteria against every story in our PRD — automated quality assurance.

Related terms

Using a second AI model to evaluate the quality of a primary agent's output.

An autonomous optimization loop: propose change, experiment, measure, keep or revert, repeat.

Rules and constraints that prevent an agent from taking harmful or unauthorized actions.