fix: recover PDF text after inline images by he-yufeng · Pull Request #1889 · microsoft/markitdown

he-yufeng · 2026-05-15T21:52:24Z

Summary

add a lazy PyMuPDF text-recovery pass for PDFs with inline image operators
keep PyMuPDF optional: no default dependency change, and the fallback only runs when the package is installed
warn instead of silently returning likely partial text when inline-image recovery is needed but PyMuPDF is unavailable
add regression coverage for both the recovery and warning paths

Addresses #1870.

python -m pytest packages/markitdown/tests/test_pdf_memory.py -q -k "inline_image"
python -m pytest packages/markitdown/tests/test_pdf_memory.py packages/markitdown/tests/test_pdf_tables.py -q
python -m py_compile packages/markitdown/src/markitdown/converters/_pdf_converter.py packages/markitdown/tests/test_pdf_memory.py
python -m mypy --ignore-missing-imports packages/markitdown/src/markitdown/converters/_pdf_converter.py packages/markitdown/tests/test_pdf_memory.py
python -m pip install -e "packages/markitdown[pymupdf]"
git diff --check

fix: recover PDF text after inline images

b75be1c