feat: Add Azure Content Understanding converter#1865
Merged
Conversation
yungshinlintw
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Azure Content Understanding converter
Summary
Adds a new
ContentUnderstandingConverterthat integrates Azure Content Understanding into MarkItDown, enabling high-quality cloud-based conversion for documents, images, audio, and video with structured field extraction via YAML front matter.Motivation
MarkItDown's built-in converters are offline and format-specific. Azure Content Understanding provides:
prebuilt-documentSearch,prebuilt-videoSearch,prebuilt-audioSearch)This follows the same pattern as the existing Azure Document Intelligence integration.
What's included
converters/_cu_converter.py_markitdown.pycu_endpointis provided (registered after Doc Intel so CU is tried first)__main__.py--use-cu,--cu-endpoint,--cu-analyzer,--cu-file-typespyproject.toml[az-content-understanding]converters/__init__.pyContentUnderstandingConverterandContentUnderstandingFileTypeREADME.mdtests/test_cu_converter.pySupported file types
.pdf,.docx,.pptx,.xlsx,.html,.txt,.md,.rtf,.xmlprebuilt-documentSearch.eml,.msgprebuilt-documentSearch.jpg,.jpeg,.jpe,.png,.bmp,.tiff,.heif,.heicprebuilt-documentSearch.mp4,.m4v,.mov,.avi,.mkv,.webm,.flv,.wmvprebuilt-videoSearch.wav,.mp3,.m4a,.flac,.ogg,.aac,.wmaprebuilt-audioSearchKey design decisions
cu_analyzer_idis set, the converter resolves the analyzer's base modality once at init — using a built-in cache for knownprebuilt-*names (no API call) and falling back toget_analyzer()for custom analyzers or unknown prebuilts. Only compatible file types route to the custom analyzer; incompatible modalities auto-route to default prebuilts. Image-specific analyzers (prebuilt-image,prebuilt-imageSearch) route only image file types.get_analyzer()fails (e.g., typo incu_analyzer_id, missing permissions),MarkItDown(...)construction raisesValueErrorimmediately rather than failing on the firstconvert()call.to_llm_input()delegation: Output formatting uses the CU SDK'sto_llm_input()helper, which produces YAML front matter + page-numbered Markdown.audio/x-wav) are normalized to canonical types (e.g.,audio/wav) before sending to the CU API. Extension-only inputs derive the content type from the file type.cu_credential→AZURE_API_KEYenv var →DefaultAzureCredential(same as Doc Intel).MissingDependencyExceptiononly at converter instantiation, so users without the[az-content-understanding]extra are unaffected.--use-cuand--use-docintelare placed in an argparse mutually exclusive group so users get a clear error instead of silent precedence if both are specified.cu_endpointanddocintel_endpointare provided, CU is registered after Doc Intel so it appears first in the converter chain and takes precedence for overlapping formats (PDF, DOCX, images).Usage
Testing
pip install -e "packages/markitdown[az-content-understanding,dev]" python -m pytest packages/markitdown/tests/test_cu_converter.py -vAll 124 tests are unit tests using mocks — no Azure credentials or network calls required. Coverage includes:
accepts()for every supported and unsupported extension/MIME type, pluscu_file_typesrestrictionget_analyzer()), and custom analyzers across all four modalitiesconvert()mock paths for document, image, audio, and video inputs (verifying analyzer selection, content-type, andto_llm_input()delegation)--use-cu,--cu-endpoint,--cu-analyzer,--cu-file-types, including the mutual exclusion with--use-docintelMissingDependencyExceptionwhen the optional extra is not installedBackward compatibility
No changes to existing converter behavior. The CU converter is only registered when
cu_endpointis explicitly provided. Core MarkItDown installs (without the[az-content-understanding]extra) are unaffected — the new module's imports are lazy and degrade to a clearMissingDependencyExceptiononly if a user instantiates the converter without installing the extra.