Please do not reply directly to this email. All additional comments should be made in the comments box of this bug report.
Summary: glibc or perl incorrect locale LC_CTYPE data
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166478
------- Additional Comments From jvdias@redhat.com 2005-11-01 20:29 EST ------- One has to carefully analyse the perl man-pages to find out what is going on here.
I don't particularly agree with the way the upstream perl maintainers have done this, but this is not a bug - it is the way perl is meant to behave.
The point is that /\w/ matches any ASCII word char, and /\W/ matches any ASCII non-word char.
To match a UTF-8 word character, you have to use \p{IsWord} .
The \w wildcard is a synonym for the POSIX character class [:word:].
So this version of your program : --- #!/usr/bin/perl -w -C
use strict; use utf8; use locale; use Encode qw(decode);
my $str = decode('utf-8', "\xc3\x81\xc4\x8c"); # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)
print 'Is UTF-8:',utf8::is_utf8($str), ' is word:', $str =~ /^\w+$/,' is UTF-8 word:', $str =~ /^\p{IsWord}+$/, ' str:',$str, "\n";
perl-devel@lists.fedoraproject.org