On Mon, Feb 02, 2004 at 07:04:22PM +0000, Luciano Miguel Ferreira Rocha wrote:
On Mon, Feb 02, 2004 at 01:15:01PM -0500, Behdad Esfahbod wrote:
This is totally expected :). For what you want try LANG=C echo [A-Z]* ... In LANG=en_US.UTF-8 which is the default, both 'A' and 'a' sort before 'B' and 'b'.
But 'A' is *after* 'a'? And I thought utf-8 was supposed to be compatible with ascii... Well, if that's the standard, then it isn't broken...
UTF-8 is an encoding scheme that allows multi-byte characters to be packed in fewer characters. ASCII is compatible with UNICODE, not UTF-8. UTF-8 can be used to encode all ASCII characters (except 0x00) without using multiple bytes.
LANG really sets the locale information, which is a different beast, although a related beast. The locale information defines characters classes, and the relationship between the characters. For example, should e with an accent be outside the range a-z, just because its UNICODE value is outside the range 97-122? With LANG=C, e with an accent is outside the range a-z, just as people have been forced to live with for years. With LANG=en_US.UTF-8, e with an accent may be defined within the range a-z. As an added benefit, the sort order is more pleasing to an English user - case doesn't matter. This is far better than what we lived with in the previous decades, where one had to look in both places if one didn't know what case the filename was specified with...
I've never played with this stuff in-depth, so my terminology may be inaccurate.
Cheers, mark