[Bug 166478] glibc or perl incorrect locale LC_CTYPE data

bugzilla@xxxxxxxxxx · Wed, 2 Nov 2005 11:45:39 -0500

Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug report.

Summary: glibc or perl incorrect locale LC_CTYPE data

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166478

jvdias@xxxxxxxxxx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |CLOSED
         Resolution|                            |NOTABUG

------- Additional Comments From jvdias@xxxxxxxxxx  2005-11-02 11:45 EST -------
Sorry I submitted my previous comment before finishing it - 
then my machine rebooted (that's another story).

As I was saying in Comment #2 :

This version of your program shows the issue:
---
#!/usr/bin/perl -w -C

use strict;
use utf8;
use locale;
use Encode qw(decode);

my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
        # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)

print 'Is UTF-8:',utf8::is_utf8($str), 
      ' is word:', $str =~ /^\w+$/,
      ' is UTF-8 word: ', $str =~ /^\p{IsWord}+$/,
      ' str:',$str, "\n";
---

With the "en_US.UTF-8" locale in effect ( the default on Red Hat systems )
this prints:
$ ./test.pl 
Is UTF-8:1 is word: is posix word: 1 is UTF-8 word: 1 str:ÁČ

The point is that \p{IsWord} or [[:word:]] matches UTF-8 word characters, 
while \w / \W do not.

As the perlre man-page states:

"
       The following equivalences to Unicode \p{} constructs and equivalent
       backslash character classes (if available), will hold:

           [:...:]     \p{...}         backslash
       ...
           word        IsWord
       ...
"
ie. the [:word:] / \p{IsWord} classes are NOT equivalent to \w .

As I said, I don't particularly agree with the way the upstream perl
developers have done this, but this is intended behaviour.

RE: your comment #3: 
>  Your statement that "\w matches any ASCII word char" is not true. 
>  See perlre(1):
>       [...] If "use locale" is in effect, the list of alphabetic characters 
>             generated by "\w" is taken from the current locale.

Yes, that's alphabetic characters, not unicode sequences.

To match unicode sequences in the word class, you must use \p{IsWord} or 
[:word:] .

> So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a
> character under the UTF-8 locale, while the same perl with glibc on Linux 
> doesn't.

Possibly because the default locale for Red Hat systems is UTF-8 enabled ?

-- 
Configure bugmail: https://bugzilla.redhat.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.