Proposal: Convert .mo files to UTF-8
Egmont Koblinger
egmont at uhulinux.hu
Thu Mar 1 20:51:23 UTC 2007
Hi,
I'd like to propose a small change that involves many Fedora packages.
(First I thought I'd put it in bugzilla, but I don't know what the right
component would be.)
The proposed change is the following: when building RPM packages, let's
convert all .mo files (gettext translations) to UTF-8.
Why?
- As Fedora is a fully UTF-8 system, applications are likely to request
translations in UTF-8. (There might be a few applications that are
exceptions, and some users may have special setup or special wrappers to
run certain applications in some other charset, but in the vast majority
of the cases gettext is required to return UTF-8 string.)
If the .mo file is already in UTF-8, the gettext() call simply returns a
pointer pointing somewhere in the area where the .mo file is mmap()ed to.
This can simply be checked with strace. This way no run-time conversion
happens and no per-proecess memory is involved; translations are shared by
all the processes that use the same message catalog.
If, however, .mo file uses a different encoding, gettext() has to allocate
memory for the converted string and has to perform the conversion. This
way if more processes display the same localized string, they all allocate
their own memory area to store the UTF-8 version of the string and they
all perform the charset conversion. And actually they all load the
corresponding gconv module which could be avoided, too.
To summarize, having all the .mo files in UTF-8 would save both memory and
CPU time.
- Currently the encoding of the .mo files is completely arbitrary; it is
always what the software developers or the translators happened to use.
With this change, it would be consistently UTF-8. This would make it
easier to find which package ships a particular translation. It often
happens that I want to locate which package a particular message comes
from. It might happen because a word is misspelled, or because the whole
message shouldn't appear and I'd like to fix the buggy package. The
obvious solution is to do a recursive grep on /usr/share/locale/<lang>. If
all the .mo files of the distribution are converted to UTF-8, I can do it
simply, without having to worry about accented characters. (grep in UTF-8
mode works fine and finds the matching UTF-8 .mo files even though they
are not fully valid UTF-8 files, the UTF-8 strings are surrounded by other
binary data.) However, if multiple encodings are used, there is no
straightforward way to find accented letters, it becomes a much harder
job.
How?
- Due to RPM's flexibility, none of the packages needs to be modified, only
the RPM macros. I recommend to perform the conversion on the .mo files
after the '%install' step (in '%__install_post' or whatever it's called),
this way this whole story is independent from the package's build
procedure (does it use autotools or not; does it re-generate .mo files
from .po or ships pre-built .mo files; no need to worry about faulty and
hence skipped .po files; no need to take care of non-standard places of
po/mo files within the source tree; etc...)
- The only thing that needs to be done is an "msgunfmt" followed by "msgconv
-t UTF-8" and finally "msgfmt" for all the .mo files under the standard
locale directories.
- So, after all, it is _very_ easy to implement it.
Is it safe?
- The encoding inside the .mo files is completely transparent to the
applications as gettext() and its friends always convert the strings to
the charset requested by the application. So applications won't notice any
change.
- We performed this step when building all the packages of the UHU-Linux 2.0
distribution, which was released a half year ago, and so far no known
problems arised. (During the test period there was only 1 package (namely
coreutils) where the converted .mo files were corrupted, but as it turned
out, it was caused by a bug in msgunfmt in gettext-0.15, which is already
fixed in gettext-0.16.)
Any drawbacks?
- Not known by me, except for a negligible growth in the packages' sizes.
Well, I hope you like my idea :-)
bye,
Egmont
More information about the devel
mailing list