Skip to content

fix: raise FileConversionException for corrupt or non-OOXML files#1893

Open
Aleeza-Bhatti wants to merge 1 commit into
microsoft:mainfrom
Aleeza-Bhatti:fix/invalid-ooxml-raises-exception
Open

fix: raise FileConversionException for corrupt or non-OOXML files#1893
Aleeza-Bhatti wants to merge 1 commit into
microsoft:mainfrom
Aleeza-Bhatti:fix/invalid-ooxml-raises-exception

Conversation

@Aleeza-Bhatti
Copy link
Copy Markdown

@Aleeza-Bhatti Aleeza-Bhatti commented May 18, 2026

Fixes Issue #1408

Problem

If you pass a corrupt or invalid .docx, .pptx, or .xlsx file to MarkItDown, it doesn't raise an exception, it just returns the raw file contents as markdown and acts like everything went fine. So there's no way to tell if the conversion actually worked or not.

Cause

When the specific converter (DocxConverter, PptxConverter, etc.) fails on the file, the loop doesn't stop it just keeps trying other converters. Magika sees the file contents as plain text and flags it with a charset, which causes PlainTextConverter to pick it up on the next attempt and return the raw bytes as if it were a successful conversion.

Fix

Added a flag in _convert() that gets set when a specific-format converter accepts a file and fails. When that happens, the loop stops before any generic converter gets a chance to run. Normal conversions on valid files are not affected.

Tests

Added test_invalid_ooxml_raises_exception which writes a plain-text file with each of the three extensions and checks that FileConversionException is raised with the right converter in the error. All existing tests still pass.

…crosoft#1408)

When a file with a .docx, .pptx, or .xlsx extension is not a valid
Office Open XML archive, the specific converter (DocxConverter,
PptxConverter, XlsxConverter) accepts the file by extension, attempts
conversion, and fails. Previously the conversion loop continued to
subsequent stream_info guesses, where Magika's charset detection caused
PlainTextConverter to accept the file and return its raw contents as
markdown -- silently masking the failure.

Fix by tracking when a specific-format converter (priority <
PRIORITY_GENERIC_FILE_FORMAT) has accepted and failed. After each
stream_info guess iteration, break out of the loop if this has occurred
so that generic converters cannot silently succeed on a known-format file
that failed to convert.

Add test_invalid_ooxml_raises_exception to verify that FileConversionException
is raised (with the correct converter named) for all three affected formats.
@Aleeza-Bhatti
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants