Skip to main content

πŸš€ Getting Started with Steel Thread

Steel Thread lets you evaluate your agents β€” both during development and in production β€” using real data, real metrics, and minimal boilerplate.

This page walks you through getting started in three steps:


1️⃣ Sign Up on Portia Cloud​

To use Steel Thread, you need data from your agent runs. This means integrating them with Portia Cloud.

  • Go to app.portialabs.ai
  • Log in or create a new account
  • Create a new API Key.
  • Run your agent making sure to enable the Portia Cloud integration by setting an API key in the config.
  • These runs β€” plans, tool calls, outputs β€” will form the basis of your evaluation datasets

πŸ’‘ Every run you create in Portia can be used as an evaluation input β€” no extra data labeling required.


2️⃣ Create a Dataset to Evaluate​

There are two types of datasets that can be created via the UI.

πŸ“¦ Offline Dataset​

Use this to create a static, repeatable set of evals.

  • In the Portia UI, create a new Offline Eval Set with a distinct name e.g.offline_evals_v1.
  • Add new test cases to the newly created dataset using the Add to Dataset wizard.

🌐 Online Dataset​

Use this to continuously evaluate production runs.

  • In the Portia UI, create a new Online Eval Set with a distinct name e.g.online_evals_v1.
  • Test cases will be automatically sampled based on the config you provide.
  • [n.b.] Online Datasets only sample data after creation, so you will need to generate new data after creating the dataset.

3️⃣ Run Evals Locally Using the SDK​

First, install the Steel Thread SDK:

# Using pip
pip install steel-thread

# Using poetry
poetry add steel-thread

# Using uv
uv pip install steel-thread

Then set the correct env vars. At a minimum you need PORTIA_API_KEY set and one LLM provider key (i.e. OPENAI_API_KEY).

export OPENAI_API_KEY=""
export PORTIA_API_KEY=""
export ANTHROPIC_API_KEY=""
export MISTRAL_API_KEY=""
export GOOGLE_API_KEY=""
export AZURE_OPENAI_API_KEY=""
export AZURE_OPENAI_ENDPOINT=""

Then, run your evals:

πŸ§ͺ Offline Example​

from portia import Config, Portia
from steelthread.steelthread import SteelThread, OfflineEvalConfig

config = Config.from_default()

SteelThread().run_offline(
Portia(config),
OfflineEvalConfig(
data_set_name="offline_evals_v1",
config=config,
iterations=3,
),
)

πŸ“ˆ Online Example​

from portia import Config
from steelthread.steelthread import SteelThread, OnlineEvalConfig

config = Config.from_default()

SteelThread().run_online(
OnlineEvalConfig(
data_set_name="prod_online_evals",
config=config,
)
)

βœ… You’re Set!​

Once you're running evals, you can:

  • View metrics in your terminal or save them to dashboards
  • Catch regressions before they ship
  • Monitor live agent quality in production
  • Iterate faster with confidence