On Wed, 5 Apr 2023 at 20:32, Paul Eggert <eggert@xxxxxxxxxxx> wrote: > > On 2023-04-04 12:31, Junio C Hamano wrote: > > > My personal inclination is to let Perl folks decide > > and follow them (even though I am skeptical about the wisdom of > > letting '\d' match anything other than [0-9]) > > I looked into what pcre2grep does. It has always done only 8-bit > processing unless you use the -u or --utf option, so plain "pcre2grep > '\d'" matches only ASCII digits. > > Although this causes pcre2grep to mishandle Unicode characters: > > $ echo 'Ævar' | pcre2grep '[Ssß]' > Ævar > > it mimics Perl 5.36: > > $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/' > Ævar > > so this seems to be what Perl users expect, despite its infelicities. Actually no, I think you have misunderstood what is happening at the different layers involved here. Your terminal is rendering ß as a glyph. But it is almost certainly actually the octets C3 9F (which is the UTF8 canonical representation of the codepoint U+DF). So the code you provided to perl is close to the equivalent of echo 'Ævar' | perl -ne 'print $_ if /[Ss\x{C3}\x{9F}]/' And if you check, you will see that U+C6 "Æ" in utf8 is represented as the octets C3 86. So what you have done is the equivalent of: perl -le'print "\x{C3}\x{86}"' | perl -ne'print $_ if /[Ss\x{C3}\x{9F}]/' which of course matches. \x{C3} matches \x{C3} always and everywhere. What you should have done is something like this: $ echo 'Ævar' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}]/u' $ echo 'baß' | perl -MEncode -ne 'utf8::decode($_); print encode_utf8($_) if /[Ss\x{DF}]/u' baß $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print encode_utf8($_) if /[Ss\x{C6}]/u' Ævar $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print encode_utf8($_) if /[Ss\x{e6}]/ui' Ævar The "utf8::decode($_)" tells perl to decode the input string as though it contained utf8 (which in this case it does). THe /u suffix tells the regex engine that you want Unicode semantics. I believe that the same thing is true of your pcre2grep example. You simply aren't checking what you think you are checking. You terminal renders UTF8 as glyphs, but the programs you are feeding those glyphs to aren't seeing glyphs, they are seeing UTF8 sequences as distinct octets, and are not decoding their input back as codepoints. You could have checked your assumptions by using the -Mre=debug option to perl: $ echo 'Ævar' | perl -Mre=debug -ne 'print $_ if /[Ssß]/' Compiling REx "[Ss%x{c3}%x{9f}]" Final program: 1: ANYOF[Ss\x9F\xC3] (11) 11: END (0) stclass ANYOF[Ss\x9F\xC3] minlen 1 Matching REx "[Ss%x{c3}%x{9f}]" against "%x{c3}%x{86}var%n" Matching stclass ANYOF[Ss\x9F\xC3] against "%x{c3}%x{86}var%n" (6 bytes) 0 <> <%x{c3}> | 0| 1:ANYOF[Ss\x9F\xC3](11) 1 <%x{c3}> <%x{86}var> | 0| 11:END(0) Match successful! Ævar Freeing REx: "[Ss%x{c3}%x{9f}]" The line: Matching REx "[Ss%x{c3}%x{9f}]" against "%x{c3}%x{86}var%n" basically says it all. Perl has not decoded the UTF8 into U+C6, and it has not decoded the UTF8 for U+DF either. Instead you have asked it if the UTF8 sequence that represents U+C6 contains any of the same octets as the UTF8 representation of U+53, U+73 and U+DF would. Which gives the common octet of \x{c3}. > For better Unicode handling one can use pcre2grep's -u or --utf option, > which causes pcre2grep to behave more like GNU grep -P and git grep -P: > "echo 'Ævar' | pcre2grep -u '[Ssß]'" outputs nothing, which I think is > what most people would expect (unless they're Perl users :-). It is what Perl users would expect also, assuming you actually wrote the character class [Ss\x{DF}] and asked for unicode semantics. \x{DF} is the Latin1 codepoint range, so perl will assume that you meant ASCII semantics unless you tell it otherwise. Basically these tests you have quoted here are just examples of garbage in garbage out. Perl has been working together with the Unicode consortium for over 20 years. Afaik we were and are the reference implementation for the spec on regular expression matching in Unicode and we have a long history of working together with the Unicode consortium to refine and implement the spec. You should assume that if Perl seems to have made a gross error in how it does Unicode matching that you are simply using it wrong, we take a great deal of pride in having the best Unicode support there is. https://unicode.org/reports/tr18/ FWIW, i think this email nicely illustrates the issues with git and regular expressions. To do regular expressions properly you need to know a) what semantics do you expect, b) how to decode the text you are matching against. If you want unicode semantics you need to have a way to ask for it. If you want to match against Unicode data then you need a way to determine which of the 6 possible encodings[1] of Unicode data you are using. If you get either wrong you will not get the results you expect. You may even want to deal with cases where you want Unicode semantics, but to match against non-unicode data. For instance Latin-1. In Latin-1 the codepoint U+DF is the *octet* 0xDF. Maybe you want that octet to match "ss" case-insensitively, as a German speaker would expect and as Unicode specifies is correct. Or vice versa, maybe you are like some of the posters to this thread who seem to expect that \d should not match U+16B51 (as a Hmong speaker might expect). Perl resolves these problems at the pattern level by supporting the suffixes /a and /u (for ascii and unicode), and at the string level it supports two type of string, unicode strings, and binary/ASCII strings. By default input is the latter but there are a variety of ways of saying that a file handle should decode to Unicode instead. cheers, Yves [1] UTF-EBCDIC, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. -- perl -Mre=debug -e "/just|another|perl|hacker/"