Tool stubbing
When running evals, your agent may call tools like weather_lookup
, search
, or send_email
. If those tools hit live systems, you'll get non-deterministic results β which can make evaluation noisy and inconsistent. There's also of course undesirable real-world effects (e.g. emails sent) and cost to these tool calls when you're simply trying to run evals!
To solve this, Steel Thread supports stubbing tools. This makes your tests:
- Deterministic β the same input always produces the same output
- Isolated β no external API calls or flaky systems
- Repeatable β easy to track regressions across changes
When to stubβ
Use tool stubs when:
- You're writing evals
- The tool response affects the plan or output
- You want consistent scoring across iterations
How Tool stubbing worksβ
Steel Thread provides a ToolStubRegistry
β a drop-in replacement for Portiaβs default registry. You can wrap your existing tools and selectively override individual tools by ID. Tool stubs are simple Python functions and the ToolStubContext
contains all the original tool's context to help you generate realistic stubs. Below is an example where we use use a tool stub for the open source weather_tool
available in the Portia SDK.
from portia import Portia, Config, DefaultToolRegistry
from steelthread.steelthread import SteelThread, EvalConfig
from steelthread.portia.tools import ToolStubRegistry, ToolStubContext
from dotenv import load_dotenv
load_dotenv(override=True)
# Define stub behavior
def weather_stub_response(
ctx: ToolStubContext,
) -> str:
"""Stub for weather tool to return deterministic weather."""
city = ctx.kwargs.get("city", "").lower()
if city == "sydney":
return "33.28"
if city == "london":
return "2.00"
return f"Unknown city: {city}"
config = Config.from_default()
# Run evals with stubs
portia = Portia(
config,
tools=ToolStubRegistry(
DefaultToolRegistry(config),
stubs={
"weather_tool": weather_stub_response,
},
),
)
SteelThread().run_evals(
portia,
EvalConfig(
eval_dataset_name="your-dataset-name-here",
config=config,
iterations=5
),
)
With the stubbed tool in place, your evals will be clean, fast, and reproducible. The rest of your tools still work as normal with only the stubbed one being overridden.
- Stub only the tools that matter for evaluation
- Use consistent return types (e.g. same as real tool)
- Use
tool_call_index
if you want per-run variance - Combine stubbing with assertions to detect misuse (e.g. tool called too many times)