Here’s to the alphabet

I’ve been on Github for 17 years now, never bothered to upload or contribute anything. This changed last weekend, or at least, I decided to upload one of my many projects. 😅

This one was quite a wild ride actually. I wasn’t sure if I needed the algorithm to begin with, but I thought, well, “nice to have.”  And in the beginning it looked super straightforward and easy, then for a long time it looked super complicated, and in the end, the final version looks so simple, it almost looks like it was jotted down in a minute.

Four days, and I guess 40 hours of work later, here’s my repo:

https://github.com/alfons/PinyinAbcSort

PinyinAbcSort – Sort Hànyǔ Pīnyīn in alphabetical order (fast)

Description:

This project implements sorting Pīnyīn words into alphabetical word order, based on the rules outlined by John DeFrancis in ABC Chinese-English Dictionary, Page xiii, Reader’s Guide, I. Arrangement of Entries.

The sorting algorithm compares words letter by letter, not syllable by syllable. This approach reflects the fact that Hànyǔ Pīnyīn is written using the Latin alphabet — the key insight and algorithm design choice behind this implementation.

The ordering rules are:

  1. Alphabetical order: Base characters (a–z), compared letter by letter
  2. u before ü, U before Ü
  3. Tones: 0 < 1 < 2 < 3 < 4
  4. Case: lowercase and mixed-case before uppercase
  5. Separators: apostrophe < hyphen < space
  6. Since no rules for numbers 0–9 were given, they were added first. All other characters are appended according to their Unicode value.

Credits:

  • John DeFrancis (1911-2009): Original Pīnyīn alphabetical word order, in passionate acknowledgment of the advocates of writing reform Lù Zhuāngzhāng (陆璋章, 1854–1928), Lǔ Xùn (鲁迅, 1881–1936), Máo Dùn (Shěn Yànbīng, 茅盾, 沈雁冰, 1896–1981), Wáng Lì (王力, 1900–1988) and Lù Shūxiāng (吕叔湘, 1904–1998), and Zhōu Yǒuguāng (周有光, 1905–2017).
  • Mark Swofford of Banqiao, Taiwan: summarised the rules on the internet, and pointed out where to find them.
  • Alfons Grabher: Idea, concept, prompting, testing, and driving the development of pinyinAbcSort.
  • Grok (xAI), ChatGPT 4o: Coding the implementation with flair and precision.