Crawl Tool

Usage

Portia offers both open source tools as well as a cloud-hosted library of tools to save you development time. You can dig into the specs of those tools in our open source repo (SDK repo ↗).

You can import our open source tools into your project using from portia.open_source_tools.registry import open_source_tool_registry and load them into an InMemoryToolRegistry object. You can also combine their use with cloud or custom tools as explained in the docs (Add custom tools ↗).

Tool details

Tool ID: crawl_tool

Tool description: Crawls websites using graph-based website traversal tool that can explore hundreds of paths in parallel with built-in extraction and intelligent discovery. Provide a starting URL and optional instructions for what to find, and the tool will navigate and extract relevant content from multiple pages. Supports depth control, domain filtering, and path selection for comprehensive site exploration.

Usage notes:

This tool uses the Tavily API. You can sign up to obtain a Tavily API key (↗) and set it in the environment variable TAVILY_API_KEY.

Args schema:

{
  "description": "Input for CrawlTool.",
  "properties": {
    "url": {
      "description": "The root URL to begin the crawl (e.g., 'https://docs.tavily.com')",
      "title": "Url",
      "type": "string"
    },
    "instructions": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Natural language instructions for the crawler (e.g., 'Find all pages on the Python SDK')",
      "title": "Instructions"
    },
    "max_depth": {
      "default": 1,
      "description": "Max depth of the crawl. Defines how far from the base URL the crawler can explore",
      "maximum": 5,
      "minimum": 1,
      "title": "Max Depth",
      "type": "integer"
    },
    "max_breadth": {
      "default": 20,
      "description": "Max number of links to follow per level of the tree (i.e., per page)",
      "maximum": 100,
      "minimum": 1,
      "title": "Max Breadth",
      "type": "integer"
    },
    "limit": {
      "default": 50,
      "description": "Total number of links the crawler will process before stopping",
      "maximum": 500,
      "minimum": 1,
      "title": "Limit",
      "type": "integer"
    },
    "select_paths": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Regex patterns to select only URLs with specific path patterns (e.g., ['/docs/.*', '/api/v1.*'])",
      "title": "Select Paths"
    },
    "select_domains": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Regex patterns to select crawling to specific domains or subdomains (e.g., ['^docs\\.example\\.com$'])",
      "title": "Select Domains"
    },
    "exclude_paths": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Regex patterns to exclude URLs with specific path patterns (e.g., ['/private/.*', '/admin/.*'])",
      "title": "Exclude Paths"
    },
    "exclude_domains": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Regex patterns to exclude specific domains or subdomains from crawling (e.g., ['^private\\.example\\.com$'])",
      "title": "Exclude Domains"
    },
    "allow_external": {
      "default": false,
      "description": "Whether to allow following links that go to external domains",
      "title": "Allow External",
      "type": "boolean"
    }
  },
  "required": [
    "url"
  ],
  "title": "CrawlToolSchema",
  "type": "object"
}

Output schema:

('str', 'str: crawled content and discovered pages')

Usage​

Tool details​

Usage

Tool details