Sort Chinese text the way Chinese readers expect.
hanzi-sort is a Rust CLI and library for sorting Hanzi by Hanyu Pinyin or by stroke count, with deterministic tie-breaking, phrase-level override rules for polyphonic characters, and terminal-friendly tabular output.
Migration note
pinyin-sorthas been renamed tohanzi-sort. This is a hard rename: there is no compatibility binary alias.
Unicode codepoint order is not Chinese sort order.
If you want useful output for names, glossaries, study lists, or publishing workflows, you usually need more than plain lexical comparison:
- pinyin order for alphabetic-style indexes
- stroke order for dictionary-style or teaching workflows
- phrase-level override rules for polyphonic characters like
重庆or银行 - stable tie-breaking so the same dataset always produces the same order
Sort by pinyin:
hanzi-sort -t 汉字 张三 赵四Sort by stroke count:
hanzi-sort -t 天 一 十 --sort-by strokes --columns 1 --entry-width 2 --blank-every 0Resolve a polyphonic phrase with an override file:
hanzi-sort -t 重庆 银行 --config ./override.tomlWrite the result to a file:
hanzi-sort -t 重庆 银行 -o ./sorted.txt- Sort by
pinyinorstrokes - Keep unknown characters in the comparison key instead of dropping them
- Break ties by original character so output stays deterministic
- Read repeated
--textvalues or one non-blank record per line from--file - Override single characters or full phrases with TOML
- Format output with configurable columns, alignment, padding, separators, and blank-line cadence
- Use the same core sorter from Rust via
PinyinCollator,StrokesCollator, and the genericCollatortrait - Opt in to additional collators (Cantonese Jyutping, Mandarin Zhuyin, Kangxi Radical) via cargo features
hanzi-sort is 3.8×–4.8× faster than icu_collator 2.x with zh-u-co-pinyin
on Chinese pinyin sort workloads (Apple Silicon, deterministic input):
| N | hanzi-sort | ICU zh-u-co-pinyin |
speedup |
|---|---|---|---|
| 1,000 | 188 µs | 759 µs | 4.0× |
| 10,000 | 2.51 ms | 11.99 ms | 4.8× |
| 100,000 | 34.9 ms | 131.5 ms | 3.8× |
The win comes from hanzi-sort trading ICU's full-locale generality for a
domain-specific compact representation: every primary pinyin syllable fits
in a u128 after byte-packed encoding, so per-character comparison is two
integer compares instead of multi-level CE table lookup. See
BENCHMARKS.md for methodology, caveats (output identity
is not preserved across the two collators), and reproduction steps.
# default: pinyin + strokes
cargo install hanzi-sort
# enable extra collators selectively
cargo install hanzi-sort --features collator-jyutping
cargo install hanzi-sort --features collator-zhuyin
cargo install hanzi-sort --features collator-radical
# everything on
cargo install hanzi-sort --all-featuresPrerequisites:
- Rust toolchain
- Python 3 and
pypinyinif you need to regeneratedata/pinyin.csv
Build:
cargo build --release
target/release/hanzi-sort -hnix develop
just buildBasic help:
hanzi-sort -hInput rules:
--textand--fileare mutually exclusive--filereads one non-blank line per record (-f -reads from stdin)- when neither
--filenor--textis given and stdin is piped, hanzi-sort reads stdin (cat names.txt | hanzi-sortworks) - directory inputs are rejected
- success exits with
0; invalid args, bad override files, and I/O failures exit non-zero
Always available:
--sort-by pinyinDefault. Compares the primary tone3 pinyin for each mapped character, then falls back to the original character.--sort-by strokesCompares total stroke count per character, then falls back to the original character.
Opt-in via cargo features (see Install above):
--sort-by jyutping(--features collator-jyutping) Compares the primary Cantonese Jyutping reading per character (UnihankCantonese).--sort-by zhuyin(--features collator-zhuyin) Compares the Mandarin Zhuyin / Bopomofo reading per character (derived from the bundled pinyin data).--sort-by radical(--features collator-radical) Compares the Kangxi radical index plus residual stroke count per character (UnihankRSUnicode).
-f, --file <FILE>: input file path, can be repeated;-reads stdin-t, --text <TEXT>...: inline text input, can be repeated-o, --output <PATH>: write output to a file instead of stdout-c, --config <PATH>: TOML override file (pinyin only)-r, --reverse: reverse the sorted output-u, --unique: remove adjacent duplicates after sorting (likesort -u)--sort-by <MODE>: see "Sort modes" above--columns <N>: entries per row, must be greater than0--blank-every <N>: insert a blank line everyNrows; use0to disable--entry-width <N>: target display width per entry, must be greater than0--align <MODE>:left,center,right, oreven--padding-char <CHAR>: padding character, must have display width1--separator <CHAR>: separator between entries--line-ending <CHAR>: line ending character
--config accepts TOML with either or both sections:
[char_override]
'重' = "chong2"
'行' = "xing2"
[phrase_override]
"重庆" = ["chong2", "qing4"]
"银行" = ["yin2", "hang2"]Rules:
phrase_overridetakes precedence overchar_override- each
phrase_overrideentry must provide exactly one pinyin syllable per character - omitted sections default to empty maps
The CLI is the primary product, but the sorter is available as a Rust library.
use hanzi_sort::{StrokesCollator, sort_strings_with};
let sorted = sort_strings_with(
vec!["一".into(), "十".into(), "天".into()],
&StrokesCollator,
);use hanzi_sort::AnyCollator;
let sorted = AnyCollator::pinyin().sort(vec!["赵四".into(), "汉字".into()]);Key APIs:
PinyinCollator::pinyin_of(&str) -> Vec<PinYinRecord>PinyinCollator::with_override(PinyinOverride) -> Result<PinyinCollator>StrokesCollator(zero-sized, just construct it)sort_strings_with<C: Collator>(Vec<String>, &C) -> Vec<String>AnyCollator::pinyin() / strokes() / pinyin_with_override(...)andAnyCollator::sort(...)
scripts/convert_pinyin_to_csv.pybuildsdata/pinyin.csvfrom the vendored pinyin datasetscripts/convert_strokes_to_csv.pybuildsdata/strokes.csvfrom UnicodekTotalStrokesdatabuild.rsgenerates static lookup tables insrc/generated/- the build validates representative codepoints such as
〇,汉,重, and一
cargo test
cargo clippy --all-targets --all-features -- -D warningsNix helpers:
nix develop
just prep-data
just buildAGPL-3.0-only. See LICENSE.