Add CRW web scraping tools by us · Pull Request #5108 · crewAIInc/crewAI

us · 2026-03-26T12:25:16Z

Summary

Add CRW integration with three new tools:
- CrwScrapeWebsiteTool – scrape a single page and return clean markdown
- CrwCrawlWebsiteTool – async BFS crawl across multiple pages with polling
- CrwMapWebsiteTool – discover all URLs on a site via sitemap + link traversal
CRW is an open-source web scraper built for AI agents. It can run self-hosted (free) or via the managed cloud at fastcrw.com
Configuration via CRW_API_URL and CRW_API_KEY environment variables
Follows the same pattern as existing Firecrawl tools

Test plan

Verify CrwScrapeWebsiteTool scrapes a page and returns markdown
Verify CrwCrawlWebsiteTool starts a crawl job and polls until completion
Verify CrwMapWebsiteTool returns discovered URLs
Confirm tools work with both self-hosted and cloud (fastcrw.com) endpoints

Copilot

Pull request overview

Adds a new CRW (crw-project) integration to CrewAI Tools, providing web scraping, site mapping, and crawl-with-polling capabilities similar to the existing Firecrawl tools pattern.

Changes:

Introduce three new tools: CrwScrapeWebsiteTool, CrwMapWebsiteTool, and CrwCrawlWebsiteTool.
Export the new tools through crewai_tools.tools and top-level crewai_tools package init files.
Add a crw_tool package initializer to expose the new tool classes.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_scrape_tool.py`	New CRW “scrape single page” tool calling `/v1/scrape` and returning markdown (with fallbacks).
`lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_map_tool.py`	New CRW “map website” tool calling `/v1/map` and returning discovered links.
`lib/crewai-tools/src/crewai_tools/tools/crw_tool/crw_crawl_tool.py`	New CRW “crawl website” tool starting `/v1/crawl` then polling `/v1/crawl/{id}` until completion.
`lib/crewai-tools/src/crewai_tools/tools/crw_tool/__init__.py`	Exposes the CRW tool classes via `__all__`.
`lib/crewai-tools/src/crewai_tools/tools/__init__.py`	Re-exports CRW tools from the main tools module for discovery/spec generation.
`lib/crewai-tools/src/crewai_tools/__init__.py`	Re-exports CRW tools from the package top-level API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T12:31:08Z

+                return "\n\n---\n\n".join(combined) if combined else "No content found."
+
+            if status_data["status"] == "failed":
+                raise RuntimeError("CRW crawl job failed")


When the crawl job reports failed, the raised error drops any details returned by the CRW API, which makes debugging harder. Consider including any available error fields from status_data (or the full response body) in the exception message.

Suggested change

raise RuntimeError("CRW crawl job failed")

error_detail = (

status_data.get("error")

or status_data.get("message")

or status_data.get("reason")

or str(status_data)

)

raise RuntimeError(f"CRW crawl job failed: {error_detail}")

Copilot · 2026-03-26T12:31:08Z

+class CrwCrawlWebsiteTool(BaseTool):
+    """Crawl websites using CRW and return content from multiple pages.
+
+    CRW performs async BFS crawling with rate limiting, robots.txt respect,
+    and sitemap support. Runs self-hosted (free) or via fastcrw.com cloud.
+


No tests are added for this new tool. Please add unit tests that mock requests.post/requests.get to cover: start-job success, polling until completion, handling of failed, and timeout behavior (without actually sleeping—e.g., patch time.sleep).

Copilot · 2026-03-26T12:31:09Z

+    Configuration options:
+        formats (list[str]): Output formats. Default: ["markdown"]
+            Options: "markdown", "html", "rawHtml", "plainText", "links", "json"
+        only_main_content (bool): Strip nav/footer/sidebar. Default: True
+        render_js (bool|None): None=auto, True=force JS, False=HTTP only. Default: None
+        wait_for (int): ms to wait after JS rendering. Default: None
+        include_tags (list[str]): CSS selectors to include. Default: []
+        exclude_tags (list[str]): CSS selectors to exclude. Default: []
+        css_selector (str): Extract specific CSS selector. Default: None
+        xpath (str): Extract specific XPath. Default: None
+        headers (dict): Custom HTTP headers. Default: {}
+        json_schema (dict): JSON Schema for LLM extraction. Default: None


The docstring describes config options using snake_case keys (e.g., only_main_content, render_js, wait_for), but the tool’s default config and payload use CRW’s camelCase keys (e.g., onlyMainContent). This mismatch makes it unclear what users should pass in config; either update the documentation to match the actual request keys or add a translation layer to accept the documented snake_case keys and convert them before calling CRW.

Copilot · 2026-03-26T12:31:09Z

+class CrwScrapeWebsiteTool(BaseTool):
+    """Scrape webpages using CRW and return clean markdown content.
+
+    CRW is an open-source web scraper built for AI agents. It can run
+    self-hosted (free) or via the managed cloud at fastcrw.com.


No tests are added for this new tool. The repo has tool-level tests (including for Firecrawl tools); please add unit tests (e.g., mock requests.post response) to cover success/failure handling and the format fallback behavior (markdown/plainText/html/json).

Copilot · 2026-03-26T12:31:09Z

+        max_depth (int): Maximum discovery depth. Default: 2
+        use_sitemap (bool): Also read sitemap.xml. Default: True


The docstring lists map configuration options as snake_case (max_depth, use_sitemap), but the tool sends camelCase keys (maxDepth, useSitemap). Please align the docstring with the actual payload keys or support the documented snake_case keys by converting them before the request.

Suggested change

max_depth (int): Maximum discovery depth. Default: 2

use_sitemap (bool): Also read sitemap.xml. Default: True

maxDepth (int): Maximum discovery depth. Default: 2

useSitemap (bool): Also read sitemap.xml. Default: True

Copilot · 2026-03-26T12:31:09Z

+class CrwMapWebsiteTool(BaseTool):
+    """Discover all URLs on a website using CRW's map endpoint.
+
+    Useful for understanding site structure before targeted scraping.
+    Uses sitemap.xml and link discovery to find all pages.


No tests are added for this new tool. Please add unit tests (mocking requests.post) to validate the request shape (URL + config), success parsing, and the empty-links branch that returns "No links discovered.".

Copilot · 2026-03-26T12:31:10Z

+        max_depth (int): Maximum link-follow depth. Default: 2
+        max_pages (int): Maximum pages to scrape. Default: 10
+        formats (list[str]): Output formats per page. Default: ["markdown"]
+        only_main_content (bool): Strip boilerplate. Default: True


The docstring describes crawl config options in snake_case (max_depth, max_pages, only_main_content), but the tool’s config and payload use camelCase (maxDepth, maxPages, onlyMainContent). Please align docs with the actual request keys or accept the documented snake_case keys and convert them before sending.

Suggested change

max_depth (int): Maximum link-follow depth. Default: 2

max_pages (int): Maximum pages to scrape. Default: 10

formats (list[str]): Output formats per page. Default: ["markdown"]

only_main_content (bool): Strip boilerplate. Default: True

maxDepth (int): Maximum link-follow depth. Default: 2

maxPages (int): Maximum pages to scrape. Default: 10

formats (list[str]): Output formats per page. Default: ["markdown"]

onlyMainContent (bool): Strip boilerplate. Default: True

Copilot · 2026-03-26T12:31:10Z

+    poll_interval: int = 2
+    max_wait: int = 300


poll_interval and max_wait are user-configurable but aren’t validated. If poll_interval is set to 0 (or negative), this loop can become a tight request loop; if max_wait is <= 0 it will always time out immediately. Consider enforcing positive values via Field(gt=0) (and possibly le/reasonable defaults) to prevent accidental misconfiguration.

Suggested change

poll_interval: int = 2

max_wait: int = 300

poll_interval: int = Field(default=2, gt=0)

max_wait: int = Field(default=300, gt=0)

- Include error details from API response when crawl job fails - Fix docstrings to use camelCase keys matching actual config - Add Field(gt=0) validation for poll_interval and max_wait - Add unit tests for scrape, map, and crawl tools

greysonlalonde · 2026-03-27T02:57:40Z

https://docs.crewai.com/en/guides/tools/publish-custom-tools

us · 2026-03-27T13:25:28Z

Thanks for the pointer! Published as a standalone package per the custom tools guide:

PyPI: pip install crewai-crw
GitHub: us/crewai-crw

Three tools: CrwScrapeWebsiteTool, CrwCrawlWebsiteTool, CrwMapWebsiteTool

Works with both self-hosted CRW and the managed cloud at fastcrw.com.

Add CRW web scraping tools

6df1857

Copilot AI review requested due to automatic review settings March 26, 2026 12:25

Copilot started reviewing on behalf of us March 26, 2026 12:25 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

fix: address review comments on CRW tools

0d35e3f

- Include error details from API response when crawl job fails - Fix docstrings to use camelCase keys matching actual config - Add Field(gt=0) validation for poll_interval and max_wait - Add unit tests for scrape, map, and crawl tools

greysonlalonde closed this Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CRW web scraping tools#5108

Add CRW web scraping tools#5108
us wants to merge 2 commits intocrewAIInc:mainfrom
us:feat/add-crw-tools

us commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

greysonlalonde commented Mar 27, 2026

Uh oh!

us commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                raise RuntimeError("CRW crawl job failed")
+                error_detail = (
+                    status_data.get("error")
+                    or status_data.get("message")
+                    or status_data.get("reason")
+                    or str(status_data)
+                )
+                raise RuntimeError(f"CRW crawl job failed: {error_detail}")

		max_depth (int): Maximum discovery depth. Default: 2
		use_sitemap (bool): Also read sitemap.xml. Default: True

Conversation

us commented Mar 26, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

greysonlalonde commented Mar 27, 2026

Uh oh!

us commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants