fix: raise FileConversionException for corrupt or non-OOXML files#1893
Open
Aleeza-Bhatti wants to merge 1 commit into
Open
fix: raise FileConversionException for corrupt or non-OOXML files#1893Aleeza-Bhatti wants to merge 1 commit into
Aleeza-Bhatti wants to merge 1 commit into
Conversation
…crosoft#1408) When a file with a .docx, .pptx, or .xlsx extension is not a valid Office Open XML archive, the specific converter (DocxConverter, PptxConverter, XlsxConverter) accepts the file by extension, attempts conversion, and fails. Previously the conversion loop continued to subsequent stream_info guesses, where Magika's charset detection caused PlainTextConverter to accept the file and return its raw contents as markdown -- silently masking the failure. Fix by tracking when a specific-format converter (priority < PRIORITY_GENERIC_FILE_FORMAT) has accepted and failed. After each stream_info guess iteration, break out of the loop if this has occurred so that generic converters cannot silently succeed on a known-format file that failed to convert. Add test_invalid_ooxml_raises_exception to verify that FileConversionException is raised (with the correct converter named) for all three affected formats.
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes Issue #1408
Problem
If you pass a corrupt or invalid .docx, .pptx, or .xlsx file to MarkItDown, it doesn't raise an exception, it just returns the raw file contents as markdown and acts like everything went fine. So there's no way to tell if the conversion actually worked or not.
Cause
When the specific converter (DocxConverter, PptxConverter, etc.) fails on the file, the loop doesn't stop it just keeps trying other converters. Magika sees the file contents as plain text and flags it with a charset, which causes PlainTextConverter to pick it up on the next attempt and return the raw bytes as if it were a successful conversion.
Fix
Added a flag in _convert() that gets set when a specific-format converter accepts a file and fails. When that happens, the loop stops before any generic converter gets a chance to run. Normal conversions on valid files are not affected.
Tests
Added test_invalid_ooxml_raises_exception which writes a plain-text file with each of the three extensions and checks that FileConversionException is raised with the right converter in the error. All existing tests still pass.