Skip to content

feat: add DocConverter for legacy .doc files using unword parser#1869

Open
yousrae2004 wants to merge 3 commits into
microsoft:mainfrom
yousrae2004:main
Open

feat: add DocConverter for legacy .doc files using unword parser#1869
yousrae2004 wants to merge 3 commits into
microsoft:mainfrom
yousrae2004:main

Conversation

@yousrae2004
Copy link
Copy Markdown

@yousrae2004 yousrae2004 commented May 9, 2026

Closes #23
Closes #56
(Note: This PR includes both contributions to these issues as I forgot to open a new branch)

Summary

Issue 23 - Adds support for converting Microsoft Word .doc files
(Word 97-2003 OLE format) to Markdown. Uses the unword library as parser backend to extract body text with heading levels, page breaks, and textbox contents.

Issue 56 - Extracts and saves images to disk when using pptx files, and are saved to the current
directory by default, or a custom directory via the output_dir kwarg.

Changes

  • Added _doc_converter.py with a new DocConverter class
  • Registered DocConverter in converters/__init__.py and _markitdown.py
  • Added test.doc test file and test vector in _test_vectors.py
  • Made changes to '_pptx+converter.py' to save image blobs to disk using the correct file extension from the image content type.

Testing

All local tests pass

@yousrae2004
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PPTX: Extract images Support for .doc extensions

2 participants