fix: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (e.g. French PDFs) by echavet · Pull Request #1895 · microsoft/markitdown

echavet · 2026-05-18T08:57:53Z

Summary

IpynbConverter.accepts() can raise UnicodeDecodeError when processing files with non-ASCII content (e.g. PDFs with French or other accented text). This propagates uncaught through the conversion pipeline and crashes the entire conversion — even though the file has nothing to do with Jupyter notebooks.

Problem

The engine in _markitdown.py only protects accepts() calls against NotImplementedError:

_accepts = False
try:
    _accepts = converter.accepts(file_stream, stream_info, **_kwargs)
except NotImplementedError:
    pass

Any other exception raised by accepts() propagates up and crashes the caller. IpynbConverter.accepts() reads the file stream and decodes it to look for nbformat markers. If the file contains bytes that cannot be decoded (binary content, wrong charset), a UnicodeDecodeError is raised.

Traceback (production environment — Windows Server)

File "markitdown\converters\_ipynb_converter.py", line 36, in accepts
    notebook_content = file_stream.read().decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 43916: ordinal not in range(128)

The affected file was a standard French-language PDF invoice. The byte 0xc3 is the first byte of a multi-byte UTF-8 sequence representing accented characters (é, è, à, etc.).

Root Cause

accepts() is a predicate — per the contract defined in _base_converter.py, it must return bool and never raise. The decode block was not protected against encoding failures:

# Before — unsafe
encoding = stream_info.charset or "utf-8"
notebook_content = file_stream.read().decode(encoding)  # 💥 raises on binary/wrong-charset files
return "nbformat" in notebook_content and "nbformat_minor" in notebook_content

Note: stream_info.charset can be explicitly set to "ascii" by MIME/charset detection for certain file types, making the issue reproducible even when "utf-8" is the fallback.

Fix

Wrap the decode + check block in try/except (UnicodeDecodeError, ValueError) and return False on failure. A file that cannot be decoded is definitively not a Jupyter notebook.

# After — safe
try:
    encoding = stream_info.charset or "utf-8"
    notebook_content = file_stream.read().decode(encoding)
    return (
        "nbformat" in notebook_content
        and "nbformat_minor" in notebook_content
    )
except (UnicodeDecodeError, ValueError):
    # File contains non-decodable bytes — definitely not a notebook
    return False

The finally: file_stream.seek(cur_pos) block is preserved, ensuring the stream position contract is always respected.

Compatibility

✅ No behavior change for valid .ipynb files or JSON streams
✅ No new dependencies
✅ No changes to method signatures
✅ Respects the accepts() contract: returns bool, resets stream position

Type of Change

Bug fix (non-breaking change — adds error handling only)

Add ConversionProgress dataclass and ProgressCallback Protocol to enable real-time progress reporting during document conversion. Converters emit progress events for each logical unit processed: - PdfConverter: per page - PptxConverter: per slide - EpubConverter: per chapter - XlsxConverter / XlsConverter: per sheet The callback is optional and passed via kwargs (progress_callback). Converters that do not support progress simply ignore it. Fully backward-compatible — no changes to existing API signatures. Signed-off-by: Eric Chavet <echavet@gmail.com>

…I files accepts() must never raise — it is a predicate that returns True/False. When a file contains non-ASCII bytes (e.g. a French PDF with accented characters encoded as multi-byte UTF-8 sequences like 0xc3...), decoding with 'ascii' or even 'utf-8' can fail if the stream contains arbitrary binary content. The fix wraps the decode + check block in a try/except (UnicodeDecodeError, ValueError) and returns False on failure. A file that cannot be decoded is definitively not a Jupyter notebook. Fixes: microsoft#1894

echavet added 2 commits April 30, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (e.g. French PDFs)#1895

fix: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (e.g. French PDFs)#1895
echavet wants to merge 2 commits into
microsoft:mainfrom
echavet:fix/ipynb-converter-unicode-decode-error

echavet commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

echavet commented May 18, 2026

Summary

Problem

Traceback (production environment — Windows Server)

Root Cause

Fix

Compatibility

Related

Type of Change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants