Keep Chinese word segmentation helper modules private#20233
Merged
seanbudd merged 1 commit intoMay 29, 2026
Merged
Conversation
Contributor
Author
|
cc @seanbudd |
Member
|
Thanks this helps |
c391112
into
nvaccess:try-chineseWordSegmentation-staging-2
33 of 37 checks passed
seanbudd
added a commit
that referenced
this pull request
Jun 12, 2026
Squash commit of: - #18548 - #18735 - #18865 - #19324 - #19747 - #20041 - #20055 - #20106 - #20162 - #20178 - #20185 - #20205 - #20233 - #20242 - #20227 - #20279 - #20278 - #20288 Previous try branch commit history: #19166 --- This pull request introduces Chinese word segmentation support in NVDA through the integration of the `cppjieba` library. It adds the `cppjieba` submodule, builds and links the library into the NVDA build process, exposes new C++ and Python APIs for word segmentation, and updates braille and text handling to take advantage of improved segmentation for Chinese text. Configuration options and documentation are also updated to reflect the new dependency and features. **Integration of Chinese Word Segmentation (cppjieba):** * Added `cppjieba` as a submodule (`include/cppjieba`), included its license, and documented its usage and commit in the project documentation and `copying.txt`. [[1]](diffhunk://#diff-fe7afb5c9c916e521401d3fcfb4277d5071798c3baf83baf11d6071742823584R45-R47) [[2]](diffhunk://#diff-93d82d0c89b85c60d37ef8cb3828604e99efd8c53e20003a3214e8bbc715a638R1030) [[3]](diffhunk://#diff-3cb01a1174e7f0d6bcc2bee827d1f8df003c04469cea0fe88c5b4ac63aa06217R64-R70) [[4]](diffhunk://#diff-1728b4fa73cf927d66cd8f6fa2052dd1a7b33300e4f69e5a468761c3d4bd6390R100) * Implemented a SCons build script and build integration for `cppjieba`, ensuring the library and its dictionaries are built and installed as part of the NVDA build process. [[1]](diffhunk://#diff-94023871807359f67c40ba97760d6117d3e994483d29aa86e8d061bd8edadf21R1-R55) [[2]](diffhunk://#diff-a833c4b2ebcb5b6b4d112dee8dd790abd86a7cf30a463c1289c1adb7fc2a73ceR229-R232) [[3]](diffhunk://#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393eR52) * Added workflow improvements to ensure recursive submodule checkouts, so all dependencies (including nested ones) are fetched. **C++ and Python API Additions:** * Developed a thread-safe singleton wrapper and C API for `cppjieba` in `nvdaHelper/cppjieba`, exposing functions for initialization, segmentation, user word management, and memory management. [[1]](diffhunk://#diff-e445387d732c898ca23a1002fdd053848021a2c77ab71f870bbcc58f13cd6f53R1-R148) [[2]](diffhunk://#diff-af466fb9159a3862f17b7fc072d55c0024d413c73b2086cb922234953db6b7e0R1-R165) [[3]](diffhunk://#diff-14344d8042e55a42197012de94eeb0b8568c3ffb5e3a3876222e7cb1656a98daR1-R8) * Exposed new DLL and dictionary path properties in `NVDAState.py` for use by Python code. **Braille and Text Handling Enhancements:** * Updated braille output logic to use word segmentation for Chinese tables, applying a new `WordSegWithSeparatorOffsetConverter` to improve cursor and offset mapping. [[1]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL79-R81) [[2]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL609-R635) [[3]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL632-R650) * Modified edit text handling to enforce the use of Uniscribe for character and word segmentation, ensuring consistent behavior with the new segmentation logic. [[1]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaR29) [[2]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaR167-R173) * Updated copyright and contributor attributions. [[1]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaL2-R2) [[2]](diffhunk://#diff-63eadb2c933d4403ec73ca9e97c4314a4f89ed9f3d8fde080bfc11315583d348L4-R5) **Configuration and Documentation Updates:** * Added configuration options for eager initialization and selection of word segmentation standards in `configSpec.py`. * Updated documentation to reflect the new dependency and its usage. [[1]](diffhunk://#diff-3cb01a1174e7f0d6bcc2bee827d1f8df003c04469cea0fe88c5b4ac63aa06217R64-R70) [[2]](diffhunk://#diff-1728b4fa73cf927d66cd8f6fa2052dd1a7b33300e4f69e5a468761c3d4bd6390R100) **Summary of Most Important Changes:** **1. Integration of cppjieba for Chinese Word Segmentation** - Added `cppjieba` as a submodule, updated `.gitmodules`, and documented its license and commit. [[1]](diffhunk://#diff-fe7afb5c9c916e521401d3fcfb4277d5071798c3baf83baf11d6071742823584R45-R47) [[2]](diffhunk://#diff-93d82d0c89b85c60d37ef8cb3828604e99efd8c53e20003a3214e8bbc715a638R1030) [[3]](diffhunk://#diff-3cb01a1174e7f0d6bcc2bee827d1f8df003c04469cea0fe88c5b4ac63aa06217R64-R70) [[4]](diffhunk://#diff-1728b4fa73cf927d66cd8f6fa2052dd1a7b33300e4f69e5a468761c3d4bd6390R100) - Implemented SCons build scripts and build integration for `cppjieba`, including dictionary installation. [[1]](diffhunk://#diff-94023871807359f67c40ba97760d6117d3e994483d29aa86e8d061bd8edadf21R1-R55) [[2]](diffhunk://#diff-a833c4b2ebcb5b6b4d112dee8dd790abd86a7cf30a463c1289c1adb7fc2a73ceR229-R232) [[3]](diffhunk://#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393eR52) - Improved workflow to fetch all submodules recursively. **2. C++/Python API and Library Exposure** - Created thread-safe singleton and C API for `cppjieba`, exposing segmentation and user word management functions. [[1]](diffhunk://#diff-e445387d732c898ca23a1002fdd053848021a2c77ab71f870bbcc58f13cd6f53R1-R148) [[2]](diffhunk://#diff-af466fb9159a3862f17b7fc072d55c0024d413c73b2086cb922234953db6b7e0R1-R165) [[3]](diffhunk://#diff-14344d8042e55a42197012de94eeb0b8568c3ffb5e3a3876222e7cb1656a98daR1-R8) - Exposed DLL and dictionary paths in `NVDAState.py` for use by Python code. **3. Braille and Text Handling Improvements** - Enhanced braille output to use word segmentation for Chinese, improving offset and cursor mapping. [[1]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL79-R81) [[2]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL609-R635) [[3]](diffhunk://#diff-56f1d6d0f5f57f0e55ce1ce5914c0ea1c66977dcd1d8f7d53d8da1d70eda41eeL632-R650) - Updated edit text handling to enforce Uniscribe for segmentation. [[1]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaR29) [[2]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaR167-R173) **4. Configuration and Documentation** - Added configuration options for word segmentation initialization and standards. - Updated documentation and copyright attributions. [[1]](diffhunk://#diff-3cb01a1174e7f0d6bcc2bee827d1f8df003c04469cea0fe88c5b4ac63aa06217R64-R70) [[2]](diffhunk://#diff-75bd009f6ff4404204bbbf144c81b729e6c24dcfbb16572b9b2df07b4f1c08eaL2-R2) [[3]](diffhunk://#diff-63eadb2c933d4403ec73ca9e97c4314a4f89ed9f3d8fde080bfc11315583d348L4-R5) This lays the groundwork for robust Chinese word navigation and segmentation in NVDA, improving accessibility for Chinese users.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR splits the private module rename out from the remaining Chinese word segmentation review feedback.
Summary
source/textUtils/wordSegtosource/textUtils/_wordSeg.source/textUtils/braille.pytosource/textUtils/_braille.py.Details
The moved
_wordSegfiles are unchanged in this PR apart from their paths.The moved
_braille.pyfile is also unchanged apart from its path.The updated references are internal NVDA references in:
source/braille.pysource/core.pysource/gui/settingsDialogs.pysource/textInfos/offsets.pytests/unit/test_braille/test_routing.pytests/unit/test_wordSeg.pyRationale
The word segmentation implementation is new and is kept private by using the
_wordSegmodule name.The braille offset-converter helper is also an internal helper used by
braille.py, so it is kept private as_braille.This PR is intended to be behavior-neutral.
Follow-up
#20227 will be rebased after this PR is merged into #20183 so that it only contains the remaining review feedback.