Skip to content

Add CRW web scraping tools#5108

Closed
us wants to merge 2 commits intocrewAIInc:mainfrom
us:feat/add-crw-tools
Closed

Add CRW web scraping tools#5108
us wants to merge 2 commits intocrewAIInc:mainfrom
us:feat/add-crw-tools

Conversation

@us
Copy link
Copy Markdown

@us us commented Mar 26, 2026

Summary

  • Add CRW integration with three new tools:
    • CrwScrapeWebsiteTool – scrape a single page and return clean markdown
    • CrwCrawlWebsiteTool – async BFS crawl across multiple pages with polling
    • CrwMapWebsiteTool – discover all URLs on a site via sitemap + link traversal
  • CRW is an open-source web scraper built for AI agents. It can run self-hosted (free) or via the managed cloud at fastcrw.com
  • Configuration via CRW_API_URL and CRW_API_KEY environment variables
  • Follows the same pattern as existing Firecrawl tools

Test plan

  • Verify CrwScrapeWebsiteTool scrapes a page and returns markdown
  • Verify CrwCrawlWebsiteTool starts a crawl job and polls until completion
  • Verify CrwMapWebsiteTool returns discovered URLs
  • Confirm tools work with both self-hosted and cloud (fastcrw.com) endpoints

Copilot AI review requested due to automatic review settings March 26, 2026 12:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CRW (crw-project) integration to CrewAI Tools, providing web scraping, site mapping, and crawl-with-polling capabilities similar to the existing Firecrawl tools pattern.

Changes:

  • Introduce three new tools: CrwScrapeWebsiteTool, CrwMapWebsiteTool, and CrwCrawlWebsiteTool.
  • Export the new tools through crewai_tools.tools and top-level crewai_tools package init files.
  • Add a crw_tool package initializer to expose the new tool classes.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_scrape_tool.py New CRW “scrape single page” tool calling /v1/scrape and returning markdown (with fallbacks).
lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_map_tool.py New CRW “map website” tool calling /v1/map and returning discovered links.
lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_crawl_tool.py New CRW “crawl website” tool starting /v1/crawl then polling /v1/crawl/{id} until completion.
lib/crewai-tools/src/crewai_tools/tools/crw_tool/__init__.py Exposes the CRW tool classes via __all__.
lib/crewai-tools/src/crewai_tools/tools/__init__.py Re-exports CRW tools from the main tools module for discovery/spec generation.
lib/crewai-tools/src/crewai_tools/__init__.py Re-exports CRW tools from the package top-level API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return "\n\n---\n\n".join(combined) if combined else "No content found."

if status_data["status"] == "failed":
raise RuntimeError("CRW crawl job failed")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the crawl job reports failed, the raised error drops any details returned by the CRW API, which makes debugging harder. Consider including any available error fields from status_data (or the full response body) in the exception message.

Suggested change
raise RuntimeError("CRW crawl job failed")
error_detail = (
status_data.get("error")
or status_data.get("message")
or status_data.get("reason")
or str(status_data)
)
raise RuntimeError(f"CRW crawl job failed: {error_detail}")

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +20
class CrwCrawlWebsiteTool(BaseTool):
"""Crawl websites using CRW and return content from multiple pages.

CRW performs async BFS crawling with rate limiting, robots.txt respect,
and sitemap support. Runs self-hosted (free) or via fastcrw.com cloud.

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests are added for this new tool. Please add unit tests that mock requests.post/requests.get to cover: start-job success, polling until completion, handling of failed, and timeout behavior (without actually sleeping—e.g., patch time.sleep).

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +36
Configuration options:
formats (list[str]): Output formats. Default: ["markdown"]
Options: "markdown", "html", "rawHtml", "plainText", "links", "json"
only_main_content (bool): Strip nav/footer/sidebar. Default: True
render_js (bool|None): None=auto, True=force JS, False=HTTP only. Default: None
wait_for (int): ms to wait after JS rendering. Default: None
include_tags (list[str]): CSS selectors to include. Default: []
exclude_tags (list[str]): CSS selectors to exclude. Default: []
css_selector (str): Extract specific CSS selector. Default: None
xpath (str): Extract specific XPath. Default: None
headers (dict): Custom HTTP headers. Default: {}
json_schema (dict): JSON Schema for LLM extraction. Default: None
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring describes config options using snake_case keys (e.g., only_main_content, render_js, wait_for), but the tool’s default config and payload use CRW’s camelCase keys (e.g., onlyMainContent). This mismatch makes it unclear what users should pass in config; either update the documentation to match the actual request keys or add a translation layer to accept the documented snake_case keys and convert them before calling CRW.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +18
class CrwScrapeWebsiteTool(BaseTool):
"""Scrape webpages using CRW and return clean markdown content.

CRW is an open-source web scraper built for AI agents. It can run
self-hosted (free) or via the managed cloud at fastcrw.com.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests are added for this new tool. The repo has tool-level tests (including for Firecrawl tools); please add unit tests (e.g., mock requests.post response) to cover success/failure handling and the format fallback behavior (markdown/plainText/html/json).

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +27
max_depth (int): Maximum discovery depth. Default: 2
use_sitemap (bool): Also read sitemap.xml. Default: True
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring lists map configuration options as snake_case (max_depth, use_sitemap), but the tool sends camelCase keys (maxDepth, useSitemap). Please align the docstring with the actual payload keys or support the documented snake_case keys by converting them before the request.

Suggested change
max_depth (int): Maximum discovery depth. Default: 2
use_sitemap (bool): Also read sitemap.xml. Default: True
maxDepth (int): Maximum discovery depth. Default: 2
useSitemap (bool): Also read sitemap.xml. Default: True

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +18
class CrwMapWebsiteTool(BaseTool):
"""Discover all URLs on a website using CRW's map endpoint.

Useful for understanding site structure before targeted scraping.
Uses sitemap.xml and link discovery to find all pages.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests are added for this new tool. Please add unit tests (mocking requests.post) to validate the request shape (URL + config), success parsing, and the empty-links branch that returns "No links discovered.".

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +32
max_depth (int): Maximum link-follow depth. Default: 2
max_pages (int): Maximum pages to scrape. Default: 10
formats (list[str]): Output formats per page. Default: ["markdown"]
only_main_content (bool): Strip boilerplate. Default: True
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring describes crawl config options in snake_case (max_depth, max_pages, only_main_content), but the tool’s config and payload use camelCase (maxDepth, maxPages, onlyMainContent). Please align docs with the actual request keys or accept the documented snake_case keys and convert them before sending.

Suggested change
max_depth (int): Maximum link-follow depth. Default: 2
max_pages (int): Maximum pages to scrape. Default: 10
formats (list[str]): Output formats per page. Default: ["markdown"]
only_main_content (bool): Strip boilerplate. Default: True
maxDepth (int): Maximum link-follow depth. Default: 2
maxPages (int): Maximum pages to scrape. Default: 10
formats (list[str]): Output formats per page. Default: ["markdown"]
onlyMainContent (bool): Strip boilerplate. Default: True

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +47
poll_interval: int = 2
max_wait: int = 300
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poll_interval and max_wait are user-configurable but aren’t validated. If poll_interval is set to 0 (or negative), this loop can become a tight request loop; if max_wait is <= 0 it will always time out immediately. Consider enforcing positive values via Field(gt=0) (and possibly le/reasonable defaults) to prevent accidental misconfiguration.

Suggested change
poll_interval: int = 2
max_wait: int = 300
poll_interval: int = Field(default=2, gt=0)
max_wait: int = Field(default=300, gt=0)

Copilot uses AI. Check for mistakes.
- Include error details from API response when crawl job fails
- Fix docstrings to use camelCase keys matching actual config
- Add Field(gt=0) validation for poll_interval and max_wait
- Add unit tests for scrape, map, and crawl tools
@greysonlalonde
Copy link
Copy Markdown
Contributor

@us
Copy link
Copy Markdown
Author

us commented Mar 27, 2026

Thanks for the pointer! Published as a standalone package per the custom tools guide:

Three tools: CrwScrapeWebsiteTool, CrwCrawlWebsiteTool, CrwMapWebsiteTool

Works with both self-hosted CRW and the managed cloud at fastcrw.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants