Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
Summary: glibc: wcscoll/strcoll needs to check if the char is out of the locale collation.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
Summary: glibc: wcscoll/strcoll needs to check if the char is out of the locale collation. Product: Fedora Version: rawhide Platform: All OS/Version: Linux Status: NEW Severity: medium Priority: low Component: glibc AssignedTo: schwab@redhat.com ReportedBy: tfujiwar@redhat.com QAContact: extras-qa@fedoraproject.org CC: jakub@redhat.com, fedora-i18n-bugs@redhat.com, schwab@redhat.com Estimated Hours: 0.0 Classification: Fedora Target Release: ---
Created an attachment (id=369656) --> (https://bugzilla.redhat.com/attachment.cgi?id=369656) Example test case
Currently I'm thinking how to sort UTF-8 strings on GNOME/GDM.
GDM uses g_utf8_collate() to sort the language names. http://git.gnome.org./cgit/glib/tree/glib/gunicollate.c
g_utf8_collate() uses wcscoll() internally.
---------------------------- 70:gint 71: g_utf8_collate (const gchar *str1, 72: const gchar *str2) 73:{ ... 109: result = wcscoll ((wchar_t *)str1_norm, (wchar_t *)str2_norm); ... 154:} ----------------------------
However if the chars are not defined in the locale collation, the returned value is not correct. I'm attaching the test program. You could see the returned value is not correct on ja_JP.UTF-8.
Now I think it's better to set errno if the char is not defined in the collation. Regarding to WCSCOLL(3P):
---------------------------- RETURN VALUE On error, wcscoll() shall set errno, but no return value is reserved to indicate an error.
ERRORS The wcscoll() function may fail if:
EINVAL The ws1 or ws2 arguments contain wide-character codes outside the domain of the collating sequence. ----------------------------
If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance g_utf8_collate(_key) later.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
--- Comment #1 from fujiwara tfujiwar@redhat.com 2009-11-16 03:57:23 EDT --- Created an attachment (id=369658) --> (https://bugzilla.redhat.com/attachment.cgi?id=369658) Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c
The patch sets EINVAL if the value is out of the table.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
Andreas Schwab schwab@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flag| |needinfo?(tfujiwar@redhat.c | |om)
--- Comment #3 from Andreas Schwab schwab@redhat.com 2009-11-17 08:12:44 EDT --- In which way is the value not correct?
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
fujiwara tfujiwar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flag|needinfo?(tfujiwar@redhat.c | |om) |
--- Comment #4 from fujiwara tfujiwar@redhat.com 2009-11-17 19:41:13 EDT --- If the char is not defined in the locale collation, wcscoll() doesn't return the right value.
E.g. if you run the attachment 369656 on ja_JP.UTF-8, it compares Korean char and ASCII char on ja_JP.UTF-8 but the Korean char is not defined in the ja collation table and then wcscoll() returns the minus value but the returned value is not correct because the Korean char value is out of the ja collation table. Actually I would expect the returned value is a plus value in this case if the char value is not defined in the collation table.
As I noted comment #0, the man of wcscoll explains to set EINVAL if the char is not defined in the locale collating sequence.
I'd like to know whether the char is defined in the collation table or not but currently wcscoll()/strcoll() doesn't set errno. My understanding is, __collidx_table_lookup() checks if the char is defined in the collation table so my suggestion is to set errno if the char is not defined in the table.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
--- Comment #5 from fujiwara tfujiwar@redhat.com 2009-11-17 21:01:05 EDT --- (In reply to comment #4)
Actually I would expect the returned value is a plus value in this case if the char value is not defined in the collation table.
Correction: Actually I would expect the returned value is a plus value in this case if the char value is defined in the collation table.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
Andreas Schwab schwab@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution| |NOTABUG
--- Comment #6 from Andreas Schwab schwab@redhat.com 2009-11-18 05:46:47 EDT --- The ja_JP.UTF-8 collation table contains all UTF-8 characters, so wcscoll cannot return an error.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
fujiwara tfujiwar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords| |Reopened Status|CLOSED |ASSIGNED Resolution|NOTABUG |
--- Comment #7 from fujiwara tfujiwar@redhat.com 2009-11-18 07:02:36 EDT --- (In reply to comment #6)
The ja_JP.UTF-8 collation table contains all UTF-8 characters, so wcscoll cannot return an error.
I don't think so. E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C . Reopening the bug.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
fujiwara tfujiwar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEW
--- Comment #8 from fujiwara tfujiwar@redhat.com 2009-11-18 07:14:13 EDT --- If you think ja_JP should define all UTF-8 collation, this would be a bug of glibc/localedata/locales/ja_JP.
Please note: I don't say ja_JP only but all locales. I need to know the way to get whether the char is defined in the locale collation or not with C API.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
Andreas Schwab schwab@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution| |NOTABUG
--- Comment #9 from Andreas Schwab schwab@redhat.com 2009-11-18 07:28:04 EDT --- All character not explicitly mentioned are collated after all others.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
--- Comment #10 from fujiwara tfujiwar@redhat.com 2009-11-18 07:56:46 EDT --- (In reply to comment #9)
All character not explicitly mentioned are collated after all others.
I don't understand your mention. Do you mean if a Korean char is not defined in ja_JP, the char is collated after all Japanese chars?
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
fujiwara tfujiwar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|CLOSED |ASSIGNED Resolution|NOTABUG |
--- Comment #11 from fujiwara tfujiwar@redhat.com 2009-11-18 08:23:18 EDT --- Reset the status since I don't get the reply above.
If the answer of my question is yes, the actual result is different from your mention, i.e. I think this bus is still valid. Your reply would be too short for me. Can we talk with chat or email?
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
--- Comment #12 from fujiwara tfujiwar@redhat.com 2009-11-20 04:21:31 EDT --- Created an attachment (id=372439) --> (https://bugzilla.redhat.com/attachment.cgi?id=372439) Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c, glibc/wcsmbs/wcscoll_l.c, glibc/wcsmbs/wcsxfrm_l.c
This patch is another solution.
if a char is not defined in glib/localedata/locales/ja_JP, __collidx_table_lookup() returns 0. If we could use __collseq_table_lookup() instead, it can return the max value for the undefined char. But I think we need to use __collidx_table_lookup() for wcscoll() since the size of locale collation is unclear.
But the problem is when we receive 0, U+0 is actually defined in glib/localedata/locales/ja_JP LC_COLLATION and the result is undefined chars are always collated before defined chars in wcscoll().
E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean char, the collation would be c < a < b on ja_JP.UTF-8 since U+0 is defined in ja_JP file.
But if you look at ja_JP file, the file also defines "UNDEFINED" in LC_COLLATE. UNDEFINED char should be collated at last. But the word "UNDEFINED" seems to be used in localedef program only. If we run wcscoll(), we don't know which index of weight[] is the UNDEFINED value. So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use USTRING_MAX instead of weight[].
If I see zh_CN file, U+0 is not defined. The undefined chars are always collated before defined chars in wcscoll() because the following line effects the result in wcscoll():
result = seq1len == 0 ? -1 : 1;
seq1len is 0 but the string is not shorter than the other in this case. The string is not defined in the locale collation in this case actually.
I'd modified this part.
Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in the locale collation file and I think 'UNDEFINED' should be put in the last of the LC_COLLATE.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
fujiwara tfujiwar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #372439|0 |1 is obsolete| |
--- Comment #13 from fujiwara tfujiwar@redhat.com 2009-11-20 04:24:12 EDT --- Created an attachment (id=372440) --> (https://bugzilla.redhat.com/attachment.cgi?id=372440) Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c, glibc/wcsmbs/wcscoll_l.c, glibc/wcsmbs/wcsxfrm_l.c
Correct the file names in the patch slightly.
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
fujiwara tfujiwar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEW
Please do not reply directly to this email. All additional comments should be made in the comments box of this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=537753
fujiwara tfujiwar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution| |NOTABUG
--- Comment #14 from fujiwara tfujiwar@redhat.com 2010-07-01 00:58:58 EDT --- During the discussion in glibc community, I found there are several problems if we fix this in glibc side. Currently I'm discussing this issue in the upper layer application so closing this bug at the moment.
i18n-bugs@lists.fedoraproject.org