Python: introducing icu4py, bindings to the Unicode ICU library

I made a new package! Thank you to my client Rippling for inspiring and sponsoring its development.
ICU (International Components for Unicode) is Unicode’s official library for Unicode and Globalization tools. It’s a de-facto standard for handling text in a locale-aware way, used by many major projects, including Chrome, Firefox, macOS, VS Code, and so on. ICU comes in two flavours: ICU4C (for C/C++) and ICU4J (for Java). (There’s also the newer ICU4X, built in Rust, but it’s a bit more of a work-in-progress.)
My new package, icu4py, exposes some ICU4C functionality in Python, translating the C++ API into a more Pythonic one. It currently supports boundary analysis (breaking text into words, sentences, etc.) and message formatting (using the ICU MessageFormat syntax). I’m open to adding more ICU features in the future, depending on demand.
Batteries included
Here’s a summary of the two features included in icu4py right now.
Boundary analysis
For example, ICU supports text boundary analysis: finding linguistic boundaries in text, such as words or sentences, based on per-language rules. These are useful for things like accurate word counts, or wrapping text for display. icu4py exposes this feature through “breaker” classes. For example, to split text into sentences using SentenceBreaker:
>>> from icu4py.breakers import SentenceBreaker
>>> text = 'You asked "Why?". We answered "Why not?"'
>>> breaker = SentenceBreaker(text, "en_GB")
>>> list(breaker)
['You asked "Why?". ', 'We answered "Why not?"']
Note the quoted sentences are kept within their outer ones.
Brilliant translation with MessageFormat
ICU also provides a flexible translation tool called MessageFormat. This allows translators to write patterns that capture the nuances of different languages, like varying pluralization rules. icu4py exposes this API through its MessageFormat class:
>>> from icu4py.messageformat import MessageFormat
>>> pattern = "{count,plural,one {# file} other {# files}}"
>>> fmt = MessageFormat(pattern, "en_GB")
>>> fmt.format({"count": 0})
'0 files'
>>> fmt.format({"count": 1})
'1 file'
>>> fmt.format({"count": 5})
'5 files'
In this case, our pluralization varies the output based on whether the count variable is one or any other number.
French has a subtle but important difference—it treats zero as singular, unlike English:
>>> pattern = "{count,plural,one {# fichier} other {# fichiers}}"
>>> fmt = MessageFormat(pattern, "fr_FR")
>>> fmt.format({"count": 0})
'0 fichier'
>>> fmt.format({"count": 1})
'1 fichier'
>>> fmt.format({"count": 2})
'2 fichiers'
This example shows an important point: the meaning of the one category is defined by the locale. In French, the “one” category matches both 0 and 1 because they both use the singular form.
Korean has no plural distinction at all—the same form is used regardless of count:
>>> pattern = "{count,plural,other {# 파일}}"
>>> fmt = MessageFormat(pattern, "ko_KR")
>>> fmt.format({"count": 0})
'0 파일'
>>> fmt.format({"count": 1})
'1 파일'
>>> fmt.format({"count": 5})
'5 파일'
ICU handles all these variations automatically based on the locale, so you don't need to know the rules for each language.
MessageFormat version 2 is in the works, complete with its own site. It’s in technical preview in ICU4C, and I aim to expose it in icu4py soon, tracked in Issue #17.
Backstory
Rippling is a global HR, Payroll, IT, and Finance platform, with over 20,000 customers all over the world. They work hard to make their product available in many languages and locales. Their Data Globalization team found that ICU Messageformat would present a great next step to handle complex localized messages in their product. Since they use Python (and Django), they needed to pick a Python library to integrate ICU.
This is where the challenge started. In a Slack thread, we surveyed the existing options, and found them all lacking, especially on two points:
- None provided compiled wheels, a strong necessity for smooth installation among Rippling’s 1,000+ engineers.
- They all depend on a system-based install of ICU, which could present different versions between environments.
In particular, we found these options:
pyicu - impressively maintained since 2007, but poorly documented, has a clunky C++-style API, and lacks some key Python modernization. For example, its C++ extensions don’t use multi-phase initialization, per PEP 489, which is required to support sub-interpreters, as Rippling may wish to adopt in the future.
Additionally, while working on this project, its self-hosted Gitlab went down, and it hasn’t been restored for some weeks. This does not bode well for long-term maintenance!
icupy - newer, modern, and better-maintained, but it still copies the C++ API directly, which is quite unwieldy.
pyseeyou - a pure-Python implementation of MessageFormat. While this library is appealing in terms of ease of installation, it performs all parsing and formatting in Python, making it fairly slow. Additionally, the package does not provide any locale-specific formatting.
After this survey, I opted to try my hand at creating a proof-of-concept package for calling the ICU4C’s C++ class MessageFormat from Python. Now, C++ is a beautifully haunted language, and I’m correspondingly uncomfortable with it. While I have developed some Python C extensions before, like time-machine, until this point I had not touched C++ since my university days.
Thankfully, we are living in the LLM era and this kind of “glue two things together” task is something that they excel at. Within a few hours, my mate Claude and I had a working prototype that could handle basic formatting:
from icupoc import MessageFormat
# English plurals
pattern = "{count, plural, one {# file} other {# files}}"
fmt = MessageFormat(pattern, "en_US")
print(fmt.format({"count": 1})) # "1 file"
print(fmt.format({"count": 5})) # "5 files"
(You can still see this prototype at icupoc.)
Happy with the PoC, Rippling gave me the go-ahead to build out a full ICU package.
Building
I continued to use Claude when building out icu4py proper, and I guess I was “agentic engineering” (or “vibe engineering”). While I used the LLM to generate code, I reviewed every line, making many edits before any step was ready. I also deployed my usual stack of tools to format code, check types, build robust documentation, etc.
Using an LLM essentially made this project feasible. A few years ago, I would have shied away from writing in C++ and it would probably have taken me a lot longer to get running. But instead of finding C++ big/scary, having many of the small details taken care of made the project fun!
Things weren’t all smooth sailing, though. Rippling have several “blessed” languages, including Rust, for which they have a monorepo with lots of tooling set up. Before deciding to build the open source package, I tried to build an internal Rust-based version. I made several different attempts to build Rust-to-C++ bindings, spending a silly amount of tokens and time going down rabbit holes to determine that various approaches don’t work or aren’t worth it. The LLM’s confidence and sycophany led me to trying ideas for a long time before cutting my losses. If I had known more about C++ or Rust, I think I wouldn’t have tried using Rust to begin with!
I also learned that LLMs are really poor at using all of their embedded knowledge at once. Even when writing the proof-of-concept, I prompted Claude to improve its own code and it came up with many changes (commit). That said, re-applying this trick more times made it adopt advanced C++ abstractions, completely unnecessarily. It seems that to get high quality code, you need to loop an LLM around a few times on the same code while still using your judgement to know when to stop.
Documentation was critical for getting good results. I often copied sections or whole pages from Python or ICU’s documentation straight into the LLM chat. For ICU’s auto-generated class documentation, I used Jina Reader to convert its dense HTML to Markdown. Docs are king, and even more so now.
That said, using LLM-generated code did free up my time to write and edit the documentation more than I usually would. I made sure to include sensible links to the relevant parts of ICU’s documentation, as well as great examples.
icu4c-builds
A large part of building icu4py was compiling ICU4C itself. It quickly became clear that I couldn’t build icu4py wheels that depended on the system ICU library because binding to C++ classes within a dynamic library is very tricky. Therefore, I built a second repository, icu4c-builds, which builds ICU4C in the exact setup I need for icu4py.
Getting icu4c-builds working across Linux, macOS, and Windows and multiple architectures was a challenge with many long-running failed builds. Claude was very useful here, though, since it has a lot of embedded knowledge around small details like compiler flags. At some points, I was just shovelling logs between GitHub Actions and Claude, wondering why I was even there.
Anyway, now the build system seems to be in a good place, and since ICU is pretty stable, the process is unlikely to break in the future.
Future work
Right now, icu4py only has bindings to two ICU modules: boundary analysis and message formatting. There are 40+ more to bind to, providing all kinds of features likely to be useful in globalizing Python apps. If there are modules that you’d like available, please open an issue!
I’m pretty confident that with a solid foundation to copy from, LLMs can churn through building further bindings pretty quickly. That said, I still think it’s key for icu4py’s success to design an API that makes sense in Python, for which I think human taste still prevails.
On the Django side, I think there could be a place for writing an integration for icu4py with Django’s existing internationalization and localization framework. While there’s a lot of overlap, ICU has a lot of extra features that could be useful.
Fin
Please check out icu4py for your text boundary analysis and translation needs!
Thanks again to Rippling for paying for this work. ⭐ Working with them is a fun challenge—even small changes can have a large impact, thanks to their massive scale. If you’re looking for your next opportunity, and feel smart and ready to go after hard problems on day one, check out Rippling’s open roles.
Peek-a-boo, I see you!
—Adam
😸😸😸 Check out my new book on using GitHub effectively, Boost Your GitHub DX! 😸😸😸
One summary email a week, no spam, I pinky promise.
Related posts:
Tags: python