Please do not reply directly to this email. All additional comments should be made in the comments box of this bug report.
Summary: glibc or perl incorrect locale LC_CTYPE data
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166478
jvdias@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution| |NOTABUG
------- Additional Comments From jvdias@redhat.com 2005-11-02 11:45 EST ------- Sorry I submitted my previous comment before finishing it - then my machine rebooted (that's another story).
As I was saying in Comment #2 :
This version of your program shows the issue: --- #!/usr/bin/perl -w -C
use strict; use utf8; use locale; use Encode qw(decode);
my $str = decode('utf-8', "\xc3\x81\xc4\x8c"); # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)
print 'Is UTF-8:',utf8::is_utf8($str), ' is word:', $str =~ /^\w+$/, ' is UTF-8 word: ', $str =~ /^\p{IsWord}+$/, ' str:',$str, "\n"; ---
With the "en_US.UTF-8" locale in effect ( the default on Red Hat systems ) this prints: $ ./test.pl Is UTF-8:1 is word: is posix word: 1 is UTF-8 word: 1 str:ÁČ
The point is that \p{IsWord} or [[:word:]] matches UTF-8 word characters, while \w / \W do not.
As the perlre man-page states:
" The following equivalences to Unicode \p{} constructs and equivalent backslash character classes (if available), will hold:
[:...:] \p{...} backslash ... word IsWord ... " ie. the [:word:] / \p{IsWord} classes are NOT equivalent to \w .
As I said, I don't particularly agree with the way the upstream perl developers have done this, but this is intended behaviour.
RE: your comment #3:
Your statement that "\w matches any ASCII word char" is not true. See perlre(1): [...] If "use locale" is in effect, the list of alphabetic characters generated by "\w" is taken from the current locale.
Yes, that's alphabetic characters, not unicode sequences.
To match unicode sequences in the word class, you must use \p{IsWord} or [:word:] .
So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a character under the UTF-8 locale, while the same perl with glibc on Linux doesn't.
Possibly because the default locale for Red Hat systems is UTF-8 enabled ?
perl-devel@lists.fedoraproject.org