[Bug 166478] glibc or perl incorrect locale LC_CTYPE data

bugzilla@xxxxxxxxxx · Tue, 1 Nov 2005 20:29:34 -0500

Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug report.

Summary: glibc or perl incorrect locale LC_CTYPE data

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166478

------- Additional Comments From jvdias@xxxxxxxxxx  2005-11-01 20:29 EST -------
One has to carefully analyse the perl man-pages to find out what is going on here.

I don't particularly agree with the way the upstream perl maintainers have done
this, but this is not a bug - it is the way perl is meant to behave.

The point is that /\w/ matches any ASCII word char, and /\W/ matches any ASCII
non-word char.

To match a UTF-8 word character, you have to use \p{IsWord} .

The \w wildcard is a synonym for the POSIX character class [:word:]. 

So this version of your program :
---
#!/usr/bin/perl -w -C

use strict;
use utf8;
use locale;
use Encode qw(decode);

my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
        # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)

print 'Is UTF-8:',utf8::is_utf8($str), 
      ' is word:', $str =~ /^\w+$/,' 
is UTF-8 word:', 
      $str =~ /^\p{IsWord}+$/, ' str:',$str, "\n";

-- 
Configure bugmail: https://bugzilla.redhat.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.