Skip to content

mcp-server-fetch hard-codes use_readability=True which requires Node.js (undeclared); fails silently when Node is unavailable or misconfigured #4199

@yxie326

Description

@yxie326

Describe the bug
mcp-server-fetch hard-codes use_readability=True when calling readabilipy.simple_json.simple_json_from_html_string in extract_content_from_html (server.py, ~line 36). This silently shells out to a bundled Node.js script that imports Mozilla's @mozilla/readability. Node.js is not declared as a dependency in pyproject.toml, not mentioned in the README, and there is no fallback or timeout: if the Node subprocess hangs (e.g., due to an unrelated npm config issue, a slow Node startup, or Node being absent), fetch_url blocks indefinitely. The MCP client eventually times out and the agent receives a Connection lost: Timed out while waiting for response to ClientRequest error with no actionable detail.

To Reproduce
Steps to reproduce the behavior:

  1. Install in a Python-only environment: pip install mcp-server-fetch (no Node.js / npm configuration assumed).
  2. Spawn the server via stdio and call the fetch tool against any HTML page, e.g. https://en.wikipedia.org/wiki/Main_Page. Minimal harness:
import os, asyncio, time
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def main():
    params = StdioServerParameters(
        command="python3",
        args=["-m", "mcp_server_fetch", "--ignore-robots-txt"],
        env=dict(os.environ),
    )
    async with stdio_client(params) as (r, w):
        async with ClientSession(r, w) as s:
            await s.initialize()
            t0 = time.time()
            try:
                res = await asyncio.wait_for(
                    s.call_tool("fetch", {"url": "https://en.wikipedia.org/wiki/Main_Page", "max_length": 2000}),
                    timeout=70,
                )
                print(f"OK in {time.time()-t0:.1f}s; isError={res.isError}")
            except asyncio.TimeoutError:
                print(f"TIMEOUT after {time.time()-t0:.1f}s")
asyncio.run(main())
  1. Observe: harness times out at 70s. Probing inside fetch_url via stderr logging shows the HTTP request itself completes in ~0.3s with status 200 and a 224 KB body; the hang is entirely inside readabilipy.simple_json_from_html_string(html, use_readability=True).
  2. Workaround that confirms the diagnosis: in the installed package's mcp_server_fetch/server.py (resolvable via python3 -c "import mcp_server_fetch, inspect; print(inspect.getfile(mcp_server_fetch))"), edit extract_content_from_html (~line 36) and change use_readability=True to use_readability=False:
ret = readabilipy.simple_json.simple_json_from_html_string(
      html, use_readability=False  # was True
)

The probe then completes in ~0.6–1.2s with markdown content. (use_readability=False uses readabilipy's pure-Python lxml/regex path with no Node dependency.)

Expected behavior
One of the following:

  • pip install mcp-server-fetch should not silently introduce a runtime requirement on Node.js. Either declare it explicitly in pyproject.toml / README, or default to use_readability=False (Python-only).
  • If Node.js is required, the absence or misbehavior of Node should fail fast with a clear error rather than blocking on a subprocess with no timeout.
  • A user-facing flag (e.g., --no-readability-js / --readability-backend=python) so operators can opt out without monkey-patching the installed package.

Logs.
Instrumentation used in mcp_server_fetch/server.py to produce the trace below — only _trace(...) calls were added; everything else is the upstream code:

async def fetch_url(url: str, user_agent: str, force_raw: bool = False) -> Tuple[str, str]:
    from httpx import AsyncClient, HTTPError
    import time, os
    def _trace(msg):
        with open("/tmp/fetch_trace.log", "a") as f:
            f.write(f"{time.time():.3f} {msg}\n")
    _trace(f"[fetch_url] entered url={url} HTTPS_PROXY={os.environ.get('HTTPS_PROXY')!r}")
    t0 = time.time()
    async with AsyncClient() as client:
        _trace(f"[fetch_url] client created at +{time.time()-t0:.1f}s")
        try:
            response = await client.get(
                url,
                follow_redirects=True,
                headers={"User-Agent": user_agent},
                timeout=30,
            )
            _trace(f"[fetch_url] got response status={response.status_code} at +{time.time()-t0:.1f}s")
        except HTTPError as e:
            _trace(f"[fetch_url] HTTPError after +{time.time()-t0:.1f}s: {e!r}")
            raise McpError(ErrorData(code=INTERNAL_ERROR, message=f"Failed to fetch {url}: {e!r}"))
        if response.status_code >= 400:
            raise McpError(ErrorData(
                code=INTERNAL_ERROR,
                message=f"Failed to fetch {url} - status code {response.status_code}",
            ))
        _trace(f"[fetch_url] reading response.text at +{time.time()-t0:.1f}s")
        page_raw = response.text
        _trace(f"[fetch_url] page_raw len={len(page_raw)} at +{time.time()-t0:.1f}s")

    content_type = response.headers.get("content-type", "")
    is_page_html = (
        "<html" in page_raw[:100] or "text/html" in content_type or not content_type
    )
    _trace(f"[fetch_url] is_page_html={is_page_html} ct={content_type!r}")

    if is_page_html and not force_raw:
        _trace(f"[fetch_url] calling extract_content_from_html at +{time.time()-t0:.1f}s")
        extracted = extract_content_from_html(page_raw)
        _trace(f"[fetch_url] extract done len={len(extracted)} at +{time.time()-t0:.1f}s")
        return extracted, ""

    _trace(f"[fetch_url] returning raw at +{time.time()-t0:.1f}s")
    return (
        page_raw,
        f"Content type {content_type} cannot be simplified to markdown, but here is the raw content:\n",
    )

Trace output (Node installed in container, npm config emits unrelated warnings on startup):

[fetch_url] entered url=https://en.wikipedia.org/wiki/Main_Page HTTPS_PROXY='http://...:3128'
[fetch_url] client created at +0.0s
[fetch_url] got response status=200 at +0.3s
[fetch_url] reading response.text at +0.3s
[fetch_url] page_raw len=223644 at +0.3s
[fetch_url] is_page_html=True ct='text/html; charset=UTF-8'
[fetch_url] calling extract_content_from_html at +0.3s
<no further trace; harness times out at 70s>

npm warnings appear on stderr at the start of each call (npm warn Unknown user config "chromedriver_cdnurl", etc.), confirming a Node subprocess is being launched.

After the workaround (use_readability=False), 5 consecutive runs:

[fetch_url] calling extract_content_from_html at +0.3-0.9s
[fetch_url] extract done len=12671 at +0.6-1.2s

Additional context

  • Versions: mcp-server-fetch (current PyPI), readabilipy (current PyPI), Python 3.11, Linux x86_64.
  • The Node call originates in readabilipy/simple_json.py → readabilipy/javascript/ExtractArticle.js, invoked via subprocess.check_output with no timeout.
  • The MCP client wraps the hang as Connection lost: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds. — opaque from the agent's perspective.
  • Concrete suggestions:
    a. Default use_readability=False, or surface as a CLI flag.
    b. Probe shutil.which("node") at startup; warn or fail clearly when use_readability=True is requested without Node.
    c. Pass a timeout= to the underlying subprocess so a wedged Node is recoverable.
    d. Document the Node requirement in README.md and add it to extras (pip install "mcp-server-fetch[readability]").

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions