Extract Tool
Usage
Portia offers both open source tools as well as a cloud-hosted library of tools to save you development time. You can dig into the specs of those tools in our open source repo (SDK repo ↗).
You can import our open source tools into your project using from portia.open_source_tools.registry import open_source_tool_registry
and load them into an InMemoryToolRegistry
object. You can also combine their use with cloud or custom tools as explained in the docs (Add custom tools ↗).
Tool details
Tool ID: extract_tool
Tool description: Extracts web page content from one or more specified URLs using Tavily Extract and returns the raw content, images, and metadata from those pages. The extract tool can access publicly available web pages but cannot extract content from pages that block automated access
Usage notes:
This tool uses the Tavily API. You can sign up to obtain a Tavily API key (↗) and set it in the environment variable TAVILY_API_KEY
.
Args schema:
{
"description": "Input for ExtractTool.",
"properties": {
"urls": {
"description": "List of URLs to extract content from",
"items": {
"type": "string"
},
"title": "Urls",
"type": "array"
},
"include_images": {
"default": false,
"description": "Whether to include images in the extraction",
"title": "Include Images",
"type": "boolean"
},
"include_favicon": {
"default": false,
"description": "Whether to include favicon in the extraction",
"title": "Include Favicon",
"type": "boolean"
},
"extract_depth": {
"default": "basic",
"description": "The depth of the extraction process. Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency. Basic extraction costs 1 credit per 5 successful URL extractions, while advanced extraction costs 2 credits per 5 successful URL extractions.",
"title": "Extract Depth",
"type": "string"
},
"format": {
"default": "markdown",
"description": "Output format: 'markdown' or 'text'",
"title": "Format",
"type": "string"
}
},
"required": [
"urls"
],
"title": "ExtractToolSchema",
"type": "object"
}
Output schema:
('str', 'str: extracted content from URLs')