Please do not reply directly to this email. All additional comments should be made in the comments box of this bug report. Summary: glibc or perl incorrect locale LC_CTYPE data https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166478 jvdias@xxxxxxxxxx changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution| |NOTABUG ------- Additional Comments From jvdias@xxxxxxxxxx 2005-11-02 11:45 EST ------- Sorry I submitted my previous comment before finishing it - then my machine rebooted (that's another story). As I was saying in Comment #2 : This version of your program shows the issue: --- #!/usr/bin/perl -w -C use strict; use utf8; use locale; use Encode qw(decode); my $str = decode('utf-8', "\xc3\x81\xc4\x8c"); # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8) print 'Is UTF-8:',utf8::is_utf8($str), ' is word:', $str =~ /^\w+$/, ' is UTF-8 word: ', $str =~ /^\p{IsWord}+$/, ' str:',$str, "\n"; --- With the "en_US.UTF-8" locale in effect ( the default on Red Hat systems ) this prints: $ ./test.pl Is UTF-8:1 is word: is posix word: 1 is UTF-8 word: 1 str:ÁČ The point is that \p{IsWord} or [[:word:]] matches UTF-8 word characters, while \w / \W do not. As the perlre man-page states: " The following equivalences to Unicode \p{} constructs and equivalent backslash character classes (if available), will hold: [:...:] \p{...} backslash ... word IsWord ... " ie. the [:word:] / \p{IsWord} classes are NOT equivalent to \w . As I said, I don't particularly agree with the way the upstream perl developers have done this, but this is intended behaviour. RE: your comment #3: > Your statement that "\w matches any ASCII word char" is not true. > See perlre(1): > [...] If "use locale" is in effect, the list of alphabetic characters > generated by "\w" is taken from the current locale. Yes, that's alphabetic characters, not unicode sequences. To match unicode sequences in the word class, you must use \p{IsWord} or [:word:] . > So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a > character under the UTF-8 locale, while the same perl with glibc on Linux > doesn't. Possibly because the default locale for Red Hat systems is UTF-8 enabled ? -- Configure bugmail: https://bugzilla.redhat.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.