feat: add DocConverter for legacy .doc files using unword parser by yousrae2004 · Pull Request #1869 · microsoft/markitdown

yousrae2004 · 2026-05-09T02:56:33Z

Closes #23
Closes #56
(Note: This PR includes both contributions to these issues as I forgot to open a new branch)

Summary

Issue 23 - Adds support for converting Microsoft Word .doc files
(Word 97-2003 OLE format) to Markdown. Uses the unword library as parser backend to extract body text with heading levels, page breaks, and textbox contents.

Issue 56 - Extracts and saves images to disk when using pptx files, and are saved to the current
directory by default, or a custom directory via the output_dir kwarg.

Changes

Added _doc_converter.py with a new DocConverter class
Registered DocConverter in converters/__init__.py and _markitdown.py
Added test.doc test file and test vector in _test_vectors.py
Made changes to '_pptx+converter.py' to save image blobs to disk using the correct file extension from the image content type.

Testing

All local tests pass

yousrae2004 · 2026-05-09T21:43:30Z

@microsoft-github-policy-service agree

feat: add DocConverter for legacy .doc files using unword parser

adf68e7

yousrae2004 added 2 commits May 16, 2026 15:36

feat: extract and save PPTX images to disk for issue microsoft#56

5890fb5

apply review suggestions for doc and pptx conversion code

3ad7625

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DocConverter for legacy .doc files using unword parser#1869

feat: add DocConverter for legacy .doc files using unword parser#1869
yousrae2004 wants to merge 3 commits into
microsoft:mainfrom
yousrae2004:main

yousrae2004 commented May 9, 2026 •

edited

Loading

Uh oh!

yousrae2004 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yousrae2004 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

yousrae2004 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yousrae2004 commented May 9, 2026 •

edited

Loading