feat: Add Azure Content Understanding converter by chienyuanchang · Pull Request #1865 · microsoft/markitdown

chienyuanchang · 2026-05-06T21:33:42Z

Add Azure Content Understanding converter

Summary

Adds a new ContentUnderstandingConverter that integrates Azure Content Understanding into MarkItDown, enabling high-quality cloud-based conversion for documents, images, audio, and video with structured field extraction via YAML front matter.

Motivation

MarkItDown's built-in converters are offline and format-specific. Azure Content Understanding provides:

Multimodal support — documents, images, audio, and video through a single API
Structured field extraction — custom analyzers extract domain-specific fields (invoice amounts, receipt dates, etc.) serialized as YAML front matter
Higher quality — cloud-based layout analysis, OCR, professional transcription, and video summarization
Zero-config defaults — auto-selects the right analyzer per file type (prebuilt-documentSearch, prebuilt-videoSearch, prebuilt-audioSearch)

This follows the same pattern as the existing Azure Document Intelligence integration.

What's included

File	Change
`converters/_cu_converter.py`	New converter: 34 extensions (27 unique types), smart routing, MIME alias normalization, lazy dependency loading
`_markitdown.py`	Register converter when `cu_endpoint` is provided (registered after Doc Intel so CU is tried first)
`__main__.py`	CLI flags: `--use-cu`, `--cu-endpoint`, `--cu-analyzer`, `--cu-file-types`
`pyproject.toml`	Optional dependency group `[az-content-understanding]`
`converters/__init__.py`	Export `ContentUnderstandingConverter` and `ContentUnderstandingFileType`
`README.md`	Usage docs with capability comparison table
`tests/test_cu_converter.py`	124 unit tests (no network calls)

Supported file types

Modality	Extensions	Default Analyzer
Documents	`.pdf`, `.docx`, `.pptx`, `.xlsx`, `.html`, `.txt`, `.md`, `.rtf`, `.xml`	`prebuilt-documentSearch`
Email	`.eml`, `.msg`	`prebuilt-documentSearch`
Images	`.jpg`, `.jpeg`, `.jpe`, `.png`, `.bmp`, `.tiff`, `.heif`, `.heic`	`prebuilt-documentSearch`
Video	`.mp4`, `.m4v`, `.mov`, `.avi`, `.mkv`, `.webm`, `.flv`, `.wmv`	`prebuilt-videoSearch`
Audio	`.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.aac`, `.wma`	`prebuilt-audioSearch`

Key design decisions

Smart routing: When cu_analyzer_id is set, the converter resolves the analyzer's base modality once at init — using a built-in cache for known prebuilt-* names (no API call) and falling back to get_analyzer() for custom analyzers or unknown prebuilts. Only compatible file types route to the custom analyzer; incompatible modalities auto-route to default prebuilts. Image-specific analyzers (prebuilt-image, prebuilt-imageSearch) route only image file types.
Fail-fast on bad analyzer ID: If get_analyzer() fails (e.g., typo in cu_analyzer_id, missing permissions), MarkItDown(...) construction raises ValueError immediately rather than failing on the first convert() call.
to_llm_input() delegation: Output formatting uses the CU SDK's to_llm_input() helper, which produces YAML front matter + page-numbered Markdown.
MIME normalization: Alias MIME types (e.g., audio/x-wav) are normalized to canonical types (e.g., audio/wav) before sending to the CU API. Extension-only inputs derive the content type from the file type.
Credential chain: Explicit cu_credential → AZURE_API_KEY env var → DefaultAzureCredential (same as Doc Intel).
Lazy imports: CU SDK dependencies are imported at module load with a try/except, raising MissingDependencyException only at converter instantiation, so users without the [az-content-understanding] extra are unaffected.
CLI mutual exclusion: --use-cu and --use-docintel are placed in an argparse mutually exclusive group so users get a clear error instead of silent precedence if both are specified.
Registration order: When both cu_endpoint and docintel_endpoint are provided, CU is registered after Doc Intel so it appears first in the converter chain and takes precedence for overlapping formats (PDF, DOCX, images).

Usage

# CLI
markitdown report.pdf --use-cu --cu-endpoint "https://..."

# Python
from markitdown import MarkItDown
md = MarkItDown(cu_endpoint="https://...")
result = md.convert("report.pdf")
print(result.markdown)

Testing

pip install -e "packages/markitdown[az-content-understanding,dev]"
python -m pytest packages/markitdown/tests/test_cu_converter.py -v

All 124 tests are unit tests using mocks — no Azure credentials or network calls required. Coverage includes:

accepts() for every supported and unsupported extension/MIME type, plus cu_file_types restriction
Smart routing for known prebuilts (cache hit), unknown prebuilts (fallback to get_analyzer()), and custom analyzers across all four modalities
convert() mock paths for document, image, audio, and video inputs (verifying analyzer selection, content-type, and to_llm_input() delegation)
CLI argument parsing for --use-cu, --cu-endpoint, --cu-analyzer, --cu-file-types, including the mutual exclusion with --use-docintel
MissingDependencyException when the optional extra is not installed
Registration priority — CU runs before Doc Intel when both are configured; unsupported formats (CSV, JSON, ZIP, EPub) fall through to built-in converters

Backward compatibility

No changes to existing converter behavior. The CU converter is only registered when cu_endpoint is explicitly provided. Core MarkItDown installs (without the [az-content-understanding] extra) are unaffected — the new module's imports are lazy and degrade to a clear MissingDependencyException only if a user instantiates the converter without installing the extra.

chienyuanchang added 7 commits May 6, 2026 12:41

inital version

bdcec69

improve mime type detection

1c70a82

prebuilt-image custom analzyer route to image

24ba4f2

enhance cu priority over di

d91d5dd

fix: apply black formatting

f5e7008

update cache of known prebuilt name and README improvement

e4b585a

add test cases, run black

6c7f5e7

chienyuanchang marked this pull request as ready for review May 7, 2026 21:11

yungshinlintw reviewed May 7, 2026

View reviewed changes

Comment thread README.md

yungshinlintw reviewed May 7, 2026

View reviewed changes

Comment thread README.md Outdated

yungshinlintw reviewed May 7, 2026

View reviewed changes

Comment thread packages/markitdown/tests/test_cu_converter.py

yungshinlintw approved these changes May 7, 2026

View reviewed changes

chienyuanchang added 2 commits May 7, 2026 16:58

update readme and deriving content_type from the resolved file_type

7a804cf

update readme

2ed5af7

afourney merged commit a01d74d into microsoft:main May 22, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Azure Content Understanding converter#1865

feat: Add Azure Content Understanding converter#1865
afourney merged 9 commits into
microsoft:mainfrom
chienyuanchang:feature/cu-converter

chienyuanchang commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chienyuanchang commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Azure Content Understanding converter

Summary

Motivation

What's included

Supported file types

Key design decisions

Usage

Testing

Backward compatibility

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chienyuanchang commented May 6, 2026 •

edited

Loading