Skip to content

fix: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (e.g. French PDFs)#1895

Open
echavet wants to merge 2 commits into
microsoft:mainfrom
echavet:fix/ipynb-converter-unicode-decode-error
Open

fix: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (e.g. French PDFs)#1895
echavet wants to merge 2 commits into
microsoft:mainfrom
echavet:fix/ipynb-converter-unicode-decode-error

Conversation

@echavet
Copy link
Copy Markdown

@echavet echavet commented May 18, 2026

Summary

IpynbConverter.accepts() can raise UnicodeDecodeError when processing files with non-ASCII content (e.g. PDFs with French or other accented text). This propagates uncaught through the conversion pipeline and crashes the entire conversion — even though the file has nothing to do with Jupyter notebooks.

Problem

The engine in _markitdown.py only protects accepts() calls against NotImplementedError:

_accepts = False
try:
    _accepts = converter.accepts(file_stream, stream_info, **_kwargs)
except NotImplementedError:
    pass

Any other exception raised by accepts() propagates up and crashes the caller. IpynbConverter.accepts() reads the file stream and decodes it to look for nbformat markers. If the file contains bytes that cannot be decoded (binary content, wrong charset), a UnicodeDecodeError is raised.

Traceback (production environment — Windows Server)

File "markitdown\converters\_ipynb_converter.py", line 36, in accepts
    notebook_content = file_stream.read().decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 43916: ordinal not in range(128)

The affected file was a standard French-language PDF invoice. The byte 0xc3 is the first byte of a multi-byte UTF-8 sequence representing accented characters (é, è, à, etc.).

Root Cause

accepts() is a predicate — per the contract defined in _base_converter.py, it must return bool and never raise. The decode block was not protected against encoding failures:

# Before — unsafe
encoding = stream_info.charset or "utf-8"
notebook_content = file_stream.read().decode(encoding)  # 💥 raises on binary/wrong-charset files
return "nbformat" in notebook_content and "nbformat_minor" in notebook_content

Note: stream_info.charset can be explicitly set to "ascii" by MIME/charset detection for certain file types, making the issue reproducible even when "utf-8" is the fallback.

Fix

Wrap the decode + check block in try/except (UnicodeDecodeError, ValueError) and return False on failure. A file that cannot be decoded is definitively not a Jupyter notebook.

# After — safe
try:
    encoding = stream_info.charset or "utf-8"
    notebook_content = file_stream.read().decode(encoding)
    return (
        "nbformat" in notebook_content
        and "nbformat_minor" in notebook_content
    )
except (UnicodeDecodeError, ValueError):
    # File contains non-decodable bytes — definitely not a notebook
    return False

The finally: file_stream.seek(cur_pos) block is preserved, ensuring the stream position contract is always respected.

Compatibility

  • ✅ No behavior change for valid .ipynb files or JSON streams
  • ✅ No new dependencies
  • ✅ No changes to method signatures
  • ✅ Respects the accepts() contract: returns bool, resets stream position

Related

Type of Change

  • Bug fix (non-breaking change — adds error handling only)

echavet added 2 commits April 30, 2026 09:46
Add ConversionProgress dataclass and ProgressCallback Protocol to enable
real-time progress reporting during document conversion.

Converters emit progress events for each logical unit processed:
- PdfConverter: per page
- PptxConverter: per slide
- EpubConverter: per chapter
- XlsxConverter / XlsConverter: per sheet

The callback is optional and passed via kwargs (progress_callback).
Converters that do not support progress simply ignore it.
Fully backward-compatible — no changes to existing API signatures.

Signed-off-by: Eric Chavet <echavet@gmail.com>
…I files

accepts() must never raise — it is a predicate that returns True/False.
When a file contains non-ASCII bytes (e.g. a French PDF with accented
characters encoded as multi-byte UTF-8 sequences like 0xc3...), decoding
with 'ascii' or even 'utf-8' can fail if the stream contains arbitrary
binary content.

The fix wraps the decode + check block in a try/except
(UnicodeDecodeError, ValueError) and returns False on failure.
A file that cannot be decoded is definitively not a Jupyter notebook.

Fixes: microsoft#1894
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants