📄️ Overview and basic usage
Evals are static, ground truth datasets designed to be run multiple times against your agents to assess their performance. These datasets are comprised of multiple test cases which are pairs of inputs (query or plan) and outputs (plan or plan run).
📄️ Custom evaluators
Evaluators are responsible for the calculation of metrics. To help you get started quickly, Steel Thread provides a multitude of built-in evaluators you can configure from the dashboard and then pass to your EvalRun via the DefaultEvaluator class. This explained in the previous section on basic usage.
📄️ Tool stubbing
When running evals, your agent may call tools like weatherlookup, search, or sendemail. If those tools hit live systems, you'll get non-deterministic results — which can make evaluation noisy and inconsistent. There's also of course undesirable real-world effects (e.g. emails sent) and cost to these tool calls when you're simply trying to run evals!
📄️ Visualise Eval results
Results from Evals are pushed to the Portia UI for visualization. This allows you to quickly see trends over time and to drill into why metrics may have changed. Clicking into a dataset will show you a summary of the current metrics for that dataset. Metrics are plotted by run.