Skip to content

feat: Add Azure Content Understanding converter#1865

Merged
afourney merged 9 commits into
microsoft:mainfrom
chienyuanchang:feature/cu-converter
May 22, 2026
Merged

feat: Add Azure Content Understanding converter#1865
afourney merged 9 commits into
microsoft:mainfrom
chienyuanchang:feature/cu-converter

Conversation

@chienyuanchang
Copy link
Copy Markdown
Member

@chienyuanchang chienyuanchang commented May 6, 2026

Add Azure Content Understanding converter

Summary

Adds a new ContentUnderstandingConverter that integrates Azure Content Understanding into MarkItDown, enabling high-quality cloud-based conversion for documents, images, audio, and video with structured field extraction via YAML front matter.

Motivation

MarkItDown's built-in converters are offline and format-specific. Azure Content Understanding provides:

  • Multimodal support — documents, images, audio, and video through a single API
  • Structured field extraction — custom analyzers extract domain-specific fields (invoice amounts, receipt dates, etc.) serialized as YAML front matter
  • Higher quality — cloud-based layout analysis, OCR, professional transcription, and video summarization
  • Zero-config defaults — auto-selects the right analyzer per file type (prebuilt-documentSearch, prebuilt-videoSearch, prebuilt-audioSearch)

This follows the same pattern as the existing Azure Document Intelligence integration.

What's included

File Change
converters/_cu_converter.py New converter: 34 extensions (27 unique types), smart routing, MIME alias normalization, lazy dependency loading
_markitdown.py Register converter when cu_endpoint is provided (registered after Doc Intel so CU is tried first)
__main__.py CLI flags: --use-cu, --cu-endpoint, --cu-analyzer, --cu-file-types
pyproject.toml Optional dependency group [az-content-understanding]
converters/__init__.py Export ContentUnderstandingConverter and ContentUnderstandingFileType
README.md Usage docs with capability comparison table
tests/test_cu_converter.py 124 unit tests (no network calls)

Supported file types

Modality Extensions Default Analyzer
Documents .pdf, .docx, .pptx, .xlsx, .html, .txt, .md, .rtf, .xml prebuilt-documentSearch
Email .eml, .msg prebuilt-documentSearch
Images .jpg, .jpeg, .jpe, .png, .bmp, .tiff, .heif, .heic prebuilt-documentSearch
Video .mp4, .m4v, .mov, .avi, .mkv, .webm, .flv, .wmv prebuilt-videoSearch
Audio .wav, .mp3, .m4a, .flac, .ogg, .aac, .wma prebuilt-audioSearch

Key design decisions

  • Smart routing: When cu_analyzer_id is set, the converter resolves the analyzer's base modality once at init — using a built-in cache for known prebuilt-* names (no API call) and falling back to get_analyzer() for custom analyzers or unknown prebuilts. Only compatible file types route to the custom analyzer; incompatible modalities auto-route to default prebuilts. Image-specific analyzers (prebuilt-image, prebuilt-imageSearch) route only image file types.
  • Fail-fast on bad analyzer ID: If get_analyzer() fails (e.g., typo in cu_analyzer_id, missing permissions), MarkItDown(...) construction raises ValueError immediately rather than failing on the first convert() call.
  • to_llm_input() delegation: Output formatting uses the CU SDK's to_llm_input() helper, which produces YAML front matter + page-numbered Markdown.
  • MIME normalization: Alias MIME types (e.g., audio/x-wav) are normalized to canonical types (e.g., audio/wav) before sending to the CU API. Extension-only inputs derive the content type from the file type.
  • Credential chain: Explicit cu_credentialAZURE_API_KEY env var → DefaultAzureCredential (same as Doc Intel).
  • Lazy imports: CU SDK dependencies are imported at module load with a try/except, raising MissingDependencyException only at converter instantiation, so users without the [az-content-understanding] extra are unaffected.
  • CLI mutual exclusion: --use-cu and --use-docintel are placed in an argparse mutually exclusive group so users get a clear error instead of silent precedence if both are specified.
  • Registration order: When both cu_endpoint and docintel_endpoint are provided, CU is registered after Doc Intel so it appears first in the converter chain and takes precedence for overlapping formats (PDF, DOCX, images).

Usage

# CLI
markitdown report.pdf --use-cu --cu-endpoint "https://..."

# Python
from markitdown import MarkItDown
md = MarkItDown(cu_endpoint="https://...")
result = md.convert("report.pdf")
print(result.markdown)

Testing

pip install -e "packages/markitdown[az-content-understanding,dev]"
python -m pytest packages/markitdown/tests/test_cu_converter.py -v

All 124 tests are unit tests using mocks — no Azure credentials or network calls required. Coverage includes:

  • accepts() for every supported and unsupported extension/MIME type, plus cu_file_types restriction
  • Smart routing for known prebuilts (cache hit), unknown prebuilts (fallback to get_analyzer()), and custom analyzers across all four modalities
  • convert() mock paths for document, image, audio, and video inputs (verifying analyzer selection, content-type, and to_llm_input() delegation)
  • CLI argument parsing for --use-cu, --cu-endpoint, --cu-analyzer, --cu-file-types, including the mutual exclusion with --use-docintel
  • MissingDependencyException when the optional extra is not installed
  • Registration priority — CU runs before Doc Intel when both are configured; unsupported formats (CSV, JSON, ZIP, EPub) fall through to built-in converters

Backward compatibility

No changes to existing converter behavior. The CU converter is only registered when cu_endpoint is explicitly provided. Core MarkItDown installs (without the [az-content-understanding] extra) are unaffected — the new module's imports are lazy and degrade to a clear MissingDependencyException only if a user instantiates the converter without installing the extra.

@chienyuanchang chienyuanchang marked this pull request as ready for review May 7, 2026 21:11
Comment thread README.md
Comment thread README.md Outdated
Comment thread packages/markitdown/tests/test_cu_converter.py
@afourney afourney merged commit a01d74d into microsoft:main May 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants