Evaluations (Evals)
Systematic testing of agent performance: accuracy, safety, reliability.
Why it matters
You wouldn't deploy software without tests. Evals are the test suite for AI agents.
In practice
Our QA Judge subagent runs boolean pass/fail criteria against every story in our PRD — automated quality assurance.
Related terms
LLM-as-Judge
Using a second AI model to evaluate the quality of a primary agent's output.
AutoResearch (Karpathy Method)
An autonomous optimization loop: propose change, experiment, measure, keep or revert, repeat.
Guardrails
Rules and constraints that prevent an agent from taking harmful or unauthorized actions.