mcp-server-fetch hard-codes use_readability=True which requires Node.js (undeclared);    fails silently when Node is unavailable or misconfigured

**Describe the bug**
mcp-server-fetch hard-codes use_readability=True when calling readabilipy.simple_json.simple_json_from_html_string in extract_content_from_html (server.py, ~line 36). This silently shells out to a bundled Node.js script that imports Mozilla's @mozilla/readability. Node.js is not declared as a dependency in pyproject.toml, not mentioned in the README, and there is no fallback or timeout: if the Node subprocess hangs (e.g., due to an unrelated npm config issue, a slow Node startup, or Node being absent), fetch_url blocks indefinitely. The MCP client eventually times out and the agent receives a Connection lost: Timed out while waiting for response to ClientRequest error with no actionable detail.


**To Reproduce**
Steps to reproduce the behavior:
1. Install in a Python-only environment: pip install mcp-server-fetch (no Node.js / npm configuration assumed).
2. Spawn the server via stdio and call the fetch tool against any HTML page, e.g. https://en.wikipedia.org/wiki/Main_Page. Minimal harness:

```
import os, asyncio, time
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def main():
    params = StdioServerParameters(
        command="python3",
        args=["-m", "mcp_server_fetch", "--ignore-robots-txt"],
        env=dict(os.environ),
    )
    async with stdio_client(params) as (r, w):
        async with ClientSession(r, w) as s:
            await s.initialize()
            t0 = time.time()
            try:
                res = await asyncio.wait_for(
                    s.call_tool("fetch", {"url": "https://en.wikipedia.org/wiki/Main_Page", "max_length": 2000}),
                    timeout=70,
                )
                print(f"OK in {time.time()-t0:.1f}s; isError={res.isError}")
            except asyncio.TimeoutError:
                print(f"TIMEOUT after {time.time()-t0:.1f}s")
asyncio.run(main())
```

3. Observe: harness times out at 70s. Probing inside fetch_url via stderr logging shows the HTTP request itself completes in ~0.3s with status 200 and a 224 KB body; the hang is entirely inside `readabilipy.simple_json_from_html_string(html, use_readability=True)`.
4. Workaround that confirms the diagnosis: in the installed package's mcp_server_fetch/server.py (resolvable via `python3 -c "import mcp_server_fetch, inspect; print(inspect.getfile(mcp_server_fetch))"`), edit `extract_content_from_html` (~line 36) and change `use_readability=True` to `use_readability=False`:

```
ret = readabilipy.simple_json.simple_json_from_html_string(
      html, use_readability=False  # was True
)
```

The probe then completes in ~0.6–1.2s with markdown content. (`use_readability=False` uses readabilipy's pure-Python lxml/regex path with no Node dependency.)



**Expected behavior**
One of the following:
- pip install mcp-server-fetch should not silently introduce a runtime requirement on Node.js. Either declare it explicitly in pyproject.toml / README, or default to use_readability=False (Python-only).
- If Node.js is required, the absence or misbehavior of Node should fail fast with a clear error rather than blocking on a subprocess with no timeout.
- A user-facing flag (e.g., --no-readability-js / --readability-backend=python) so operators can opt out without monkey-patching the installed package.


**Logs**.
Instrumentation used in mcp_server_fetch/server.py to produce the trace below — only `_trace(...)` calls were added; everything else is the upstream code:

```
async def fetch_url(url: str, user_agent: str, force_raw: bool = False) -> Tuple[str, str]:
    from httpx import AsyncClient, HTTPError
    import time, os
    def _trace(msg):
        with open("/tmp/fetch_trace.log", "a") as f:
            f.write(f"{time.time():.3f} {msg}\n")
    _trace(f"[fetch_url] entered url={url} HTTPS_PROXY={os.environ.get('HTTPS_PROXY')!r}")
    t0 = time.time()
    async with AsyncClient() as client:
        _trace(f"[fetch_url] client created at +{time.time()-t0:.1f}s")
        try:
            response = await client.get(
                url,
                follow_redirects=True,
                headers={"User-Agent": user_agent},
                timeout=30,
            )
            _trace(f"[fetch_url] got response status={response.status_code} at +{time.time()-t0:.1f}s")
        except HTTPError as e:
            _trace(f"[fetch_url] HTTPError after +{time.time()-t0:.1f}s: {e!r}")
            raise McpError(ErrorData(code=INTERNAL_ERROR, message=f"Failed to fetch {url}: {e!r}"))
        if response.status_code >= 400:
            raise McpError(ErrorData(
                code=INTERNAL_ERROR,
                message=f"Failed to fetch {url} - status code {response.status_code}",
            ))
        _trace(f"[fetch_url] reading response.text at +{time.time()-t0:.1f}s")
        page_raw = response.text
        _trace(f"[fetch_url] page_raw len={len(page_raw)} at +{time.time()-t0:.1f}s")

    content_type = response.headers.get("content-type", "")
    is_page_html = (
        "<html" in page_raw[:100] or "text/html" in content_type or not content_type
    )
    _trace(f"[fetch_url] is_page_html={is_page_html} ct={content_type!r}")

    if is_page_html and not force_raw:
        _trace(f"[fetch_url] calling extract_content_from_html at +{time.time()-t0:.1f}s")
        extracted = extract_content_from_html(page_raw)
        _trace(f"[fetch_url] extract done len={len(extracted)} at +{time.time()-t0:.1f}s")
        return extracted, ""

    _trace(f"[fetch_url] returning raw at +{time.time()-t0:.1f}s")
    return (
        page_raw,
        f"Content type {content_type} cannot be simplified to markdown, but here is the raw content:\n",
    )
```

Trace output (Node installed in container, npm config emits unrelated warnings on startup):

```
[fetch_url] entered url=https://en.wikipedia.org/wiki/Main_Page HTTPS_PROXY='http://...:3128'
[fetch_url] client created at +0.0s
[fetch_url] got response status=200 at +0.3s
[fetch_url] reading response.text at +0.3s
[fetch_url] page_raw len=223644 at +0.3s
[fetch_url] is_page_html=True ct='text/html; charset=UTF-8'
[fetch_url] calling extract_content_from_html at +0.3s
<no further trace; harness times out at 70s>
```

npm warnings appear on stderr at the start of each call (npm warn Unknown user config "chromedriver_cdnurl", etc.), confirming a Node subprocess is being launched.

After the workaround (use_readability=False), 5 consecutive runs:

```
[fetch_url] calling extract_content_from_html at +0.3-0.9s
[fetch_url] extract done len=12671 at +0.6-1.2s
```

**Additional context**
- Versions: mcp-server-fetch (current PyPI), readabilipy (current PyPI), Python 3.11, Linux x86_64.
- The Node call originates in readabilipy/simple_json.py → readabilipy/javascript/ExtractArticle.js, invoked via `subprocess.check_output` with no timeout.
- The MCP client wraps the hang as Connection lost: Timed out while waiting for response to `ClientRequest`. Waited 60.0 seconds. — opaque from the agent's perspective.
- Concrete suggestions:
  a. Default `use_readability=False`, or surface as a CLI flag.
  b. Probe `shutil.which("node")` at startup; warn or fail clearly when `use_readability=True` is requested without Node.
  c. Pass a `timeout=` to the underlying subprocess so a wedged Node is recoverable.
  d. Document the Node requirement in `README.md` and add it to extras (`pip install "mcp-server-fetch[readability]"`).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcp-server-fetch hard-codes use_readability=True which requires Node.js (undeclared); fails silently when Node is unavailable or misconfigured #4199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mcp-server-fetch hard-codes use_readability=True which requires Node.js (undeclared); fails silently when Node is unavailable or misconfigured #4199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions