Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new CRW (crw-project) integration to CrewAI Tools, providing web scraping, site mapping, and crawl-with-polling capabilities similar to the existing Firecrawl tools pattern.
Changes:
- Introduce three new tools:
CrwScrapeWebsiteTool,CrwMapWebsiteTool, andCrwCrawlWebsiteTool. - Export the new tools through
crewai_tools.toolsand top-levelcrewai_toolspackage init files. - Add a
crw_toolpackage initializer to expose the new tool classes.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_scrape_tool.py |
New CRW “scrape single page” tool calling /v1/scrape and returning markdown (with fallbacks). |
lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_map_tool.py |
New CRW “map website” tool calling /v1/map and returning discovered links. |
lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_crawl_tool.py |
New CRW “crawl website” tool starting /v1/crawl then polling /v1/crawl/{id} until completion. |
lib/crewai-tools/src/crewai_tools/tools/crw_tool/__init__.py |
Exposes the CRW tool classes via __all__. |
lib/crewai-tools/src/crewai_tools/tools/__init__.py |
Re-exports CRW tools from the main tools module for discovery/spec generation. |
lib/crewai-tools/src/crewai_tools/__init__.py |
Re-exports CRW tools from the package top-level API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return "\n\n---\n\n".join(combined) if combined else "No content found." | ||
|
|
||
| if status_data["status"] == "failed": | ||
| raise RuntimeError("CRW crawl job failed") |
There was a problem hiding this comment.
When the crawl job reports failed, the raised error drops any details returned by the CRW API, which makes debugging harder. Consider including any available error fields from status_data (or the full response body) in the exception message.
| raise RuntimeError("CRW crawl job failed") | |
| error_detail = ( | |
| status_data.get("error") | |
| or status_data.get("message") | |
| or status_data.get("reason") | |
| or str(status_data) | |
| ) | |
| raise RuntimeError(f"CRW crawl job failed: {error_detail}") |
| class CrwCrawlWebsiteTool(BaseTool): | ||
| """Crawl websites using CRW and return content from multiple pages. | ||
|
|
||
| CRW performs async BFS crawling with rate limiting, robots.txt respect, | ||
| and sitemap support. Runs self-hosted (free) or via fastcrw.com cloud. | ||
|
|
There was a problem hiding this comment.
No tests are added for this new tool. Please add unit tests that mock requests.post/requests.get to cover: start-job success, polling until completion, handling of failed, and timeout behavior (without actually sleeping—e.g., patch time.sleep).
| Configuration options: | ||
| formats (list[str]): Output formats. Default: ["markdown"] | ||
| Options: "markdown", "html", "rawHtml", "plainText", "links", "json" | ||
| only_main_content (bool): Strip nav/footer/sidebar. Default: True | ||
| render_js (bool|None): None=auto, True=force JS, False=HTTP only. Default: None | ||
| wait_for (int): ms to wait after JS rendering. Default: None | ||
| include_tags (list[str]): CSS selectors to include. Default: [] | ||
| exclude_tags (list[str]): CSS selectors to exclude. Default: [] | ||
| css_selector (str): Extract specific CSS selector. Default: None | ||
| xpath (str): Extract specific XPath. Default: None | ||
| headers (dict): Custom HTTP headers. Default: {} | ||
| json_schema (dict): JSON Schema for LLM extraction. Default: None |
There was a problem hiding this comment.
The docstring describes config options using snake_case keys (e.g., only_main_content, render_js, wait_for), but the tool’s default config and payload use CRW’s camelCase keys (e.g., onlyMainContent). This mismatch makes it unclear what users should pass in config; either update the documentation to match the actual request keys or add a translation layer to accept the documented snake_case keys and convert them before calling CRW.
| class CrwScrapeWebsiteTool(BaseTool): | ||
| """Scrape webpages using CRW and return clean markdown content. | ||
|
|
||
| CRW is an open-source web scraper built for AI agents. It can run | ||
| self-hosted (free) or via the managed cloud at fastcrw.com. |
There was a problem hiding this comment.
No tests are added for this new tool. The repo has tool-level tests (including for Firecrawl tools); please add unit tests (e.g., mock requests.post response) to cover success/failure handling and the format fallback behavior (markdown/plainText/html/json).
| max_depth (int): Maximum discovery depth. Default: 2 | ||
| use_sitemap (bool): Also read sitemap.xml. Default: True |
There was a problem hiding this comment.
The docstring lists map configuration options as snake_case (max_depth, use_sitemap), but the tool sends camelCase keys (maxDepth, useSitemap). Please align the docstring with the actual payload keys or support the documented snake_case keys by converting them before the request.
| max_depth (int): Maximum discovery depth. Default: 2 | |
| use_sitemap (bool): Also read sitemap.xml. Default: True | |
| maxDepth (int): Maximum discovery depth. Default: 2 | |
| useSitemap (bool): Also read sitemap.xml. Default: True |
| class CrwMapWebsiteTool(BaseTool): | ||
| """Discover all URLs on a website using CRW's map endpoint. | ||
|
|
||
| Useful for understanding site structure before targeted scraping. | ||
| Uses sitemap.xml and link discovery to find all pages. |
There was a problem hiding this comment.
No tests are added for this new tool. Please add unit tests (mocking requests.post) to validate the request shape (URL + config), success parsing, and the empty-links branch that returns "No links discovered.".
| max_depth (int): Maximum link-follow depth. Default: 2 | ||
| max_pages (int): Maximum pages to scrape. Default: 10 | ||
| formats (list[str]): Output formats per page. Default: ["markdown"] | ||
| only_main_content (bool): Strip boilerplate. Default: True |
There was a problem hiding this comment.
The docstring describes crawl config options in snake_case (max_depth, max_pages, only_main_content), but the tool’s config and payload use camelCase (maxDepth, maxPages, onlyMainContent). Please align docs with the actual request keys or accept the documented snake_case keys and convert them before sending.
| max_depth (int): Maximum link-follow depth. Default: 2 | |
| max_pages (int): Maximum pages to scrape. Default: 10 | |
| formats (list[str]): Output formats per page. Default: ["markdown"] | |
| only_main_content (bool): Strip boilerplate. Default: True | |
| maxDepth (int): Maximum link-follow depth. Default: 2 | |
| maxPages (int): Maximum pages to scrape. Default: 10 | |
| formats (list[str]): Output formats per page. Default: ["markdown"] | |
| onlyMainContent (bool): Strip boilerplate. Default: True |
| poll_interval: int = 2 | ||
| max_wait: int = 300 |
There was a problem hiding this comment.
poll_interval and max_wait are user-configurable but aren’t validated. If poll_interval is set to 0 (or negative), this loop can become a tight request loop; if max_wait is <= 0 it will always time out immediately. Consider enforcing positive values via Field(gt=0) (and possibly le/reasonable defaults) to prevent accidental misconfiguration.
| poll_interval: int = 2 | |
| max_wait: int = 300 | |
| poll_interval: int = Field(default=2, gt=0) | |
| max_wait: int = Field(default=300, gt=0) |
- Include error details from API response when crawl job fails - Fix docstrings to use camelCase keys matching actual config - Add Field(gt=0) validation for poll_interval and max_wait - Add unit tests for scrape, map, and crawl tools
|
Thanks for the pointer! Published as a standalone package per the custom tools guide:
Three tools: Works with both self-hosted CRW and the managed cloud at fastcrw.com. |
Summary
CRW_API_URLandCRW_API_KEYenvironment variablesTest plan
CrwScrapeWebsiteToolscrapes a page and returns markdownCrwCrawlWebsiteToolstarts a crawl job and polls until completionCrwMapWebsiteToolreturns discovered URLs