I’m starting to be afraid of my project, and of myself. I just wanted to write a little tool that’s less annoying than Google translate, or ChatGPT, or DeepSeek, when it comes to transcribing Chinese characters into Hànyǔ Pīnyīn. I just didn’t want to see the same rookie mistakes made over and over again, stubbornly made without any hope for improvement, not any time soon anyway.
So I sat down to write this little tool. And with Grok xAi, and a bit of ChatGPT, and a couple of amazing open source libraries (such as rakutenMA and LibreTranslate) it’s actually quite convenient to lay down a bit of code.
But, alas. Lo-and-behold. What a monster I have created! Hundreds of hours of work, thousands of lines of code. Have a look at a snapshot of the user interface:
This is how it looks now. You drop your text written in Chinese characters in the top left, and get Hànyǔ Pīnyīn on the right, and all the tools you need to manipulate and re-arrange the final text. Terribly beautiful. A complex task made ease.
There’s still a few quirks, and quite a few rules to implement and to improve, to make it (almost) compliant with GB/T 16159-2012 — but I already love to use it, and love to use it over anything else.
However… it’s a lost cause from the start: modern text segmentation and tokenisation tools operate with 70 to 98% accuracy. Some sentences come out better than others, before I try to fix the mistakes. But it will never be 100%.
And some years later, when Baidu, Alibaba and Google, etc, finally jump on the Pīnyīn-train, my software will be obsolete. Yet, I started this, fully aware of its transitoriness. A little bit crazy. But beautiful.