Clean up type hints and style for Chinese word segmentation by cary-rowen · Pull Request #20185 · nvaccess/nvda

cary-rowen · 2026-05-20T15:22:52Z

This PR addresses type hint and style issues found while reviewing PR #20183.

Completed:

Preserve initializerRegistry signatures.
Add missing type hints in word segmentation code.
Add missing type hints in related tests.
Add type hints for document navigation save hooks.
Add type hints for word segmentation labels.
Move a test import out of local scope.
Keep initializer registry state private.
Rename word segmentation locals to NVDA style.
Clean up cppjieba sconscript naming and formatting.

seanbudd

…#20185) This PR addresses type hint and style issues found while reviewing PR nvaccess#20183. Completed: - [x] Preserve `initializerRegistry` signatures. - [x] Add missing type hints in word segmentation code. - [x] Add missing type hints in related tests. - [x] Add type hints for document navigation save hooks. - [x] Add type hints for word segmentation labels. - [x] Move a test import out of local scope. - [x] Keep initializer registry state private. - [x] Rename word segmentation locals to NVDA style. - [x] Clean up cppjieba sconscript naming and formatting. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Squash commit of: - #18548 - #18735 - #18865 - #19324 - #19747 - #20041 - #20055 - #20106 - #20162 - #20178 - #20185 - #20205 - #20233 - #20242 - #20227 - #20279 - #20278 - #20288 Previous try branch commit history: #19166 --- This pull request introduces Chinese word segmentation support in NVDA through the integration of the `cppjieba` library. It adds the `cppjieba` submodule, builds and links the library into the NVDA build process, exposes new C++ and Python APIs for word segmentation, and updates braille and text handling to take advantage of improved segmentation for Chinese text. Configuration options and documentation are also updated to reflect the new dependency and features. **Integration of Chinese Word Segmentation (cppjieba):** * Added `cppjieba` as a submodule (`include/cppjieba`), included its license, and documented its usage and commit in the project documentation and `copying.txt`. [[1]](diffhunk://#diff-fe7afb5c9c916e521401d3fcfb4277d5071798c3baf83baf11d6071742823584R45-R47) [[2]](diffhunk://#diff-93d82d0c89b85c60d37ef8cb3828604e99efd8c53e20003a3214e8bbc715a638R1030) [[3]](diffhunk://#diff-3cb01a1174e7f0d6bcc2bee827d1f8df003c04469cea0fe88c5b4ac63aa06217R64-R70) [[4]](diffhunk://#diff-1728b4fa73cf927d66cd8f6fa2052dd1a7b33300e4f69e5a468761c3d4bd6390R100) * Implemented a SCons build script and build integration for `cppjieba`, ensuring the library and its dictionaries are built and installed as part of the NVDA build process. [[1]](diffhunk://#diff-94023871807359f67c40ba97760d6117d3e994483d29aa86e8d061bd8edadf21R1-R55) [[2]](diffhunk://#diff-a833c4b2ebcb5b6b4d112dee8dd790abd86a7cf30a463c1289c1adb7fc2a73ceR229-R232) [[3]](diffhunk://#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393eR52) * Added workflow improvements to ensure recursive submodule checkouts, so all dependencies (including nested ones) are fetched. **C++ and Python API Additions:** * Developed a thread-safe singleton wrapper and C API for `cppjieba` in `nvdaHelper/cppjieba`, exposing functions for initialization, segmentation, user word management, and memory management. [[1]](diffhunk://#diff-e445387d732c898ca23a1002fdd053848021a2c77ab71f870bbcc58f13cd6f53R1-R148) [[2]](diffhunk://#diff-af466fb9159a3862f17b7fc072d55c0024d413c73b2086cb922234953db6b7e0R1-R165) [[3]](diffhunk://#diff-14344d8042e55a42197012de94eeb0b8568c3ffb5e3a3876222e7cb1656a98daR1-R8) * Exposed new DLL and dictionary path properties in `NVDAState.py` for use by Python code. **Braille and Text Handling Enhancements:** * Updated braille output logic to use word segmentation for Chinese tables, applying a new `WordSegWithSeparatorOffsetConverter` to improve cursor and offset mapping. [[1]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL79-R81) [[2]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL609-R635) [[3]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL632-R650) * Modified edit text handling to enforce the use of Uniscribe for character and word segmentation, ensuring consistent behavior with the new segmentation logic. [[1]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaR29) [[2]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaR167-R173) * Updated copyright and contributor attributions. [[1]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaL2-R2) [[2]](diffhunk://#diff-63eadb2c933d4403ec73ca9e97c4314a4f89ed9f3d8fde080bfc11315583d348L4-R5) **Configuration and Documentation Updates:** * Added configuration options for eager initialization and selection of word segmentation standards in `configSpec.py`. * Updated documentation to reflect the new dependency and its usage. [[1]](diffhunk://#diff-3cb01a1174e7f0d6bcc2bee827d1f8df003c04469cea0fe88c5b4ac63aa06217R64-R70) [[2]](diffhunk://#diff-1728b4fa73cf927d66cd8f6fa2052dd1a7b33300e4f69e5a468761c3d4bd6390R100) **Summary of Most Important Changes:** **1. Integration of cppjieba for Chinese Word Segmentation** - Added `cppjieba` as a submodule, updated `.gitmodules`, and documented its license and commit. [[1]](diffhunk://#diff-fe7afb5c9c916e521401d3fcfb4277d5071798c3baf83baf11d6071742823584R45-R47) [[2]](diffhunk://#diff-93d82d0c89b85c60d37ef8cb3828604e99efd8c53e20003a3214e8bbc715a638R1030) [[3]](diffhunk://#diff-3cb01a1174e7f0d6bcc2bee827d1f8df003c04469cea0fe88c5b4ac63aa06217R64-R70) [[4]](diffhunk://#diff-1728b4fa73cf927d66cd8f6fa2052dd1a7b33300e4f69e5a468761c3d4bd6390R100) - Implemented SCons build scripts and build integration for `cppjieba`, including dictionary installation. [[1]](diffhunk://#diff-94023871807359f67c40ba97760d6117d3e994483d29aa86e8d061bd8edadf21R1-R55) [[2]](diffhunk://#diff-a833c4b2ebcb5b6b4d112dee8dd790abd86a7cf30a463c1289c1adb7fc2a73ceR229-R232) [[3]](diffhunk://#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393eR52) - Improved workflow to fetch all submodules recursively. **2. C++/Python API and Library Exposure** - Created thread-safe singleton and C API for `cppjieba`, exposing segmentation and user word management functions. [[1]](diffhunk://#diff-e445387d732c898ca23a1002fdd053848021a2c77ab71f870bbcc58f13cd6f53R1-R148) [[2]](diffhunk://#diff-af466fb9159a3862f17b7fc072d55c0024d413c73b2086cb922234953db6b7e0R1-R165) [[3]](diffhunk://#diff-14344d8042e55a42197012de94eeb0b8568c3ffb5e3a3876222e7cb1656a98daR1-R8) - Exposed DLL and dictionary paths in `NVDAState.py` for use by Python code. **3. Braille and Text Handling Improvements** - Enhanced braille output to use word segmentation for Chinese, improving offset and cursor mapping. [[1]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL79-R81) [[2]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL609-R635) [[3]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL632-R650) - Updated edit text handling to enforce Uniscribe for segmentation. [[1]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaR29) [[2]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaR167-R173) **4. Configuration and Documentation** - Added configuration options for word segmentation initialization and standards. - Updated documentation and copyright attributions. [[1]](diffhunk://#diff-3cb01a1174e7f0d6bcc2bee827d1f8df003c04469cea0fe88c5b4ac63aa06217R64-R70) [[2]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaL2-R2) [[3]](diffhunk://#diff-63eadb2c933d4403ec73ca9e97c4314a4f89ed9f3d8fde080bfc11315583d348L4-R5) This lays the groundwork for robust Chinese word navigation and segmentation in NVDA, improving accessibility for Chinese users.

cary-rowen requested a review from a team as a code owner May 20, 2026 15:22

cary-rowen requested review from SaschaCowley and removed request for a team May 20, 2026 15:22

cary-rowen force-pushed the try-chineseWordSegmentation-style-typehints branch from 2c4780d to c9f749e Compare May 20, 2026 15:40

Add word segmentation type hints and style cleanup

05882ba

cary-rowen force-pushed the try-chineseWordSegmentation-style-typehints branch from c9f749e to 05882ba Compare May 20, 2026 15:42

Pre-commit auto-fix

02e9625

seanbudd reviewed May 21, 2026

View reviewed changes

Comment thread source/textInfos/offsets.py Outdated

Comment thread source/textUtils/wordSeg/wordSegStrategy.py Outdated

cary-rowen and others added 4 commits May 21, 2026 10:23

Clarify Chinese segmenter fallback log

3ffa78a

Pre-commit auto-fix

cb773ec

Address Sean review comments

a9d65b5

Pre-commit auto-fix

029e79a

seanbudd approved these changes May 21, 2026

View reviewed changes

seanbudd merged commit 5373428 into nvaccess:try-chineseWordSegmentation-staging-2 May 21, 2026
4 checks passed

github-actions Bot added this to the 2026.3 milestone May 21, 2026

seanbudd mentioned this pull request May 21, 2026

Add Chinese Word Segmentation #20183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clean up type hints and style for Chinese word segmentation#20185

Clean up type hints and style for Chinese word segmentation#20185
seanbudd merged 6 commits into
nvaccess:try-chineseWordSegmentation-staging-2from
cary-rowen:try-chineseWordSegmentation-style-typehints

cary-rowen commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

seanbudd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cary-rowen commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanbudd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cary-rowen commented May 20, 2026 •

edited

Loading