Adding a C.UTF-8 locale to our glibc packages
by Mike FABIAN
This patch adds a C.UTF-8 locale as a folder /usr/lib/locale/C.utf8/
to our glibc packages.
This way, the locale is sort of “unremovable” because it is not
affected by the --install-langs option of build-locale-archive.
build-locale-archive completely ignores folders in /usr/lib/locale/
which do not have a “_” in their name.
This is similar to how Debian did it.
I added the LC_* sections which are missing in the Debian source
though because when sections are missing, localedef prints a warning
and I don’t want to use the “-c” option to force the output in spite
of the warnings.
This locale is very close in behaviour to C/POSIX but LC_CTYPE is just
copying from glibc/localedata/locales/i18n (which we have just updated
for Unicode 8.0.0).
So this C.UTF-8 gives us a locale which is very much like the C locale
but uses UTF-8 encoding and tools like "ls" will display all printable
characters from Unicode instead of displaying question marks for
everything non-ASCII.
Sorting (LC_COLLATE) is done strictly via Unicode code point order which
gives the same sorting for the ASCII range as the traditional C locale.
Debian does this like this:
LC_COLLATE
order_start forward
<U0000>
<U0001>
all code points listed individually leaving out the unassigned ranges
<U10FFFE>
<U10FFFF>
UNDEFINED
order_end
END LC_COLLATE
(more than 300000 lines of code points listed).
I used this instead:
LC_COLLATE
order_start forward
<U0000>
..
<UFFFF>
<U10000>
..
<U1FFFF>
<U20000>
..
<U2FFFF>
<UE0000>
..
<UEFFFF>
<UF0000>
..
<UFFFFF>
<U100000>
..
<U10FFFF>
UNDEFINED
order_end
END LC_COLLATE
Which makes the source much shorter and more readable, the result
in the binary is the same, the size of the binary is the same as well,
the complete binary locale needs about 1.8M almost all because
of LC_COLLATE (same on Debian).
Not skipping the ranges currently not assigned in Unicode would make
the locale about 6.5M big. This seems theoretically better to me
but it has probably little benefit sorting unassigned code points
by code point order as well.
Actually I think this should be enough:
LC_COLLATE
order_start forward
UNDEFINED
order_end
END LC_COLLATE
because of:
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
opengroup> The symbol UNDEFINED shall be interpreted as including all coded
opengroup> character set values not specified explicitly or via the ellipsis
opengroup> symbol. Such characters shall be inserted in the character collation
opengroup> order at the point indicated by the symbol, and in ascending order
opengroup> according to their coded character set values. If no UNDEFINED symbol is
opengroup> specified, and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a warning message and
opengroup> place such characters at the end of the character collation order.
But unfortunately it does not work like that in glibc, which is probably
a bug.
I tried to look at the code to find out why UNDEFINED does not work as
specified in the standard but could not figure it out yet.
(UNDEFINED currently does not work at all as specified, most locale
sources have UNDEFINED somewhere, but the characters not specified
explicitly do not get inserted where UNDEFINED is but instead right at
the top (before all other charcters).
--
Mike FABIAN <mfabian(a)redhat.com>
☏ Office: +49-69-365051027, internal 8875027
睡眠不足はいい仕事の敵だ。
8 years, 2 months
Tunables in rawhide?
by Siddhesh Poyarekar
Carlos, Florian,
How adventurous are you guys about me adding the tunables patches to
rawhide? It is relatively harmless since it does not have any ABI
implications and the patches can be backed out if they cause problems.
The only implication is that of API; we haven't decided on the method
to get input from users (GLIBC_TUNABLES vs envvar for each tunable vs
some other method) and that could change as discussions progress
upstream. I think we could manage that by specifically announcing on
the fedora-devel mailing list that it could change in future.
Or I could just add the first patch to mass-smoke-test it in rawhide
and hold on to the second patch till there is consensus upstream.
Thoughts?
Siddhesh
8 years, 3 months