📈 Evals | Portia AI Docs

📄️ Overview and basic usage

Evals are static, ground truth datasets designed to be run multiple times against your agents to assess their performance. These datasets are comprised of multiple test cases which are pairs of inputs (query or plan) and outputs (plan or plan run).

📄️ Custom evaluators

Evaluators are responsible for the calculation of metrics. To help you get started quickly, Steel Thread provides a multitude of built-in evaluators you can configure from the dashboard and then pass to your EvalRun via the DefaultEvaluator class. This explained in the previous section on basic usage.

📄️ Tool stubbing

When running evals, your agent may call tools like weatherlookup, search, or sendemail. If those tools hit live systems, you'll get non-deterministic results — which can make evaluation noisy and inconsistent. There's also of course undesirable real-world effects (e.g. emails sent) and cost to these tool calls when you're simply trying to run evals!

📄️ Visualise Eval results

Results from Evals are pushed to the Portia UI for visualization. This allows you to quickly see trends over time and to drill into why metrics may have changed. Clicking into a dataset will show you a summary of the current metrics for that dataset. Metrics are plotted by run.