fix: raise FileConversionException for corrupt or non-OOXML files by Aleeza-Bhatti · Pull Request #1893 · microsoft/markitdown

Aleeza-Bhatti · 2026-05-18T06:14:25Z

Fixes Issue #1408

Problem

If you pass a corrupt or invalid .docx, .pptx, or .xlsx file to MarkItDown, it doesn't raise an exception, it just returns the raw file contents as markdown and acts like everything went fine. So there's no way to tell if the conversion actually worked or not.

Cause

When the specific converter (DocxConverter, PptxConverter, etc.) fails on the file, the loop doesn't stop it just keeps trying other converters. Magika sees the file contents as plain text and flags it with a charset, which causes PlainTextConverter to pick it up on the next attempt and return the raw bytes as if it were a successful conversion.

Fix

Added a flag in _convert() that gets set when a specific-format converter accepts a file and fails. When that happens, the loop stops before any generic converter gets a chance to run. Normal conversions on valid files are not affected.

Tests

Added test_invalid_ooxml_raises_exception which writes a plain-text file with each of the three extensions and checks that FileConversionException is raised with the right converter in the error. All existing tests still pass.

…crosoft#1408) When a file with a .docx, .pptx, or .xlsx extension is not a valid Office Open XML archive, the specific converter (DocxConverter, PptxConverter, XlsxConverter) accepts the file by extension, attempts conversion, and fails. Previously the conversion loop continued to subsequent stream_info guesses, where Magika's charset detection caused PlainTextConverter to accept the file and return its raw contents as markdown -- silently masking the failure. Fix by tracking when a specific-format converter (priority < PRIORITY_GENERIC_FILE_FORMAT) has accepted and failed. After each stream_info guess iteration, break out of the loop if this has occurred so that generic converters cannot silently succeed on a known-format file that failed to convert. Add test_invalid_ooxml_raises_exception to verify that FileConversionException is raised (with the correct converter named) for all three affected formats.

Aleeza-Bhatti · 2026-05-18T06:19:00Z

@microsoft-github-policy-service agree

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: raise FileConversionException for corrupt or non-OOXML files#1893

fix: raise FileConversionException for corrupt or non-OOXML files#1893
Aleeza-Bhatti wants to merge 1 commit into
microsoft:mainfrom
Aleeza-Bhatti:fix/invalid-ooxml-raises-exception

Aleeza-Bhatti commented May 18, 2026 •

edited

Loading

Uh oh!

Aleeza-Bhatti commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aleeza-Bhatti commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Cause

Fix

Tests

Uh oh!

Aleeza-Bhatti commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aleeza-Bhatti commented May 18, 2026 •

edited

Loading