Tom’s Extelopedia

Gadget Views and Reviews

Unicode in Linux

leave a comment »

Since there was a long discussion of Unicode and Python (with the, in my opinion, unfortunate outcome that the Python 3.1 specification explicitly allows the generation and use of malformed Unicode strings), I was thinking a bit about where I think Unicode is going.   Here’s my predictions for Unicode, Linux, and Python.

I think Unicode support in FOSS and Linux is in flux right now.  I expect that in 1-2 years, all Linux distributions will enforce UTF-8 file system encoding by default, and many “char*” interfaces will be redefined to take UTF-8 encodings, because that’s backwards compatible at the API level.

I suspect that for non-UTF-8 file systems, Linux desktop environments will probably sample the paths on a file system first to guess at its encoding before mounting it, maybe with a “safe” (in a certain sense) system-dependent fallback (iso8859-15 or so), so that things just work almost all the time.  Internally, libraries will probably use UCS-4 or UTF-8, depending on space/time tradeoffs.   PEP 383 will probably be pointless before Python 3 even is widespread.

UTF-8 and UCS-4 conversion routines will vary, but I think many of them will be picky about any non-conformant inputs, because debugging Unicode issues is hard enough without having to deal with “lenient” implementations.  They’ll probably not only check on conversion, but also start checking whether any UTF-8 inputs they get are valid, even if they don’t decode them and just pass them on.  It’s good software engineering, and it reduces bug submissions (“you gave a bad UTF-8 string to the library” is better than “the library gave a bad UTF-8 string to some other library”).

UTF-16 will probably just fall into disuse in the UNIX world.  Windows will be stuck with it, and 16 bit wchar_t, and a separate set of byte APIs, and it will probably also be stuck with a lot of programs that mistakenly treat UTF-16 like UCS-2.  Fortunately, only the Chinese will suffer from that, but they are responsible for this mess (and they are used to suffering).

Written by extelopedia

2009-04-30 at 798

Posted in General

Leave a Reply

You must be logged in to post a comment.