On Tue, 4 Apr 2023 at 21:31, Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Paul Eggert <eggert@xxxxxxxxxxx> writes: > > > This is an evolving area. Git master is fiddling with flags and > > options, and so is GNU grep master, and so is PCRE2, and there are > > bugs. If you're running bleeding-edge versions of this code you'll get > > different behavior than if you're running grep 3.8, pcregrep 8.45, > > Perl 5.36, and git 2.39.2 (which is what Fedora 37 has). > > > > What I'm fearing is that we may evolve into mutually incompatible > > interpretations of how Perl regular expressions deal with UTF-8 > > text. That'd be a recipe for confusion down the road. > > Nicely said. My personal inclination is to let Perl folks decide > and follow them (even though I am skeptical about the wisdom of > letting '\d' match anything other than [0-9]), but even in Git > circle there would be different opinions, so I am glad that the > discussion is visible on the list to those who are intrested. Perl matches Unicode text according to the rules specified by the Unicode consortium. It is the reference implementation for Unicode regular expression matching. Unicode specifies that \d match any digit in any script that it supports. Thus \d matches far more codepoints than \p{PosixDigit} or [0-9] would. Be aware that Unicode contains and separates numbers and digits, eg, \x{1EC9E} represents a Lakh, which is used in many Indian languages for 100,000, but which is not considered a *digit* for obvious reasons. FWIW, someone mentioned [[:digit:]] which matches the same as \d does on Unicode strings and under the /u matching flag for regexes in Perl. Arguably this was a mistake, [[:digit:]] is a POSIX character class, and POSIX doesn't support Unicode so it should have matched [0-9] or \p{PosixDigit}. But historically \d and [[:digit:]] in Perl were the same and when \d was extended to meet the Unicode specification [[:digit:]] came along for the ride likely inadvertently, thus \p{PosixDigit} is equivalent to [0-9], but \p{XPosixDigit} is equivalent to \d and [[:digit:]]. I notice that other posts in this thread have moved the conversation on, and covered most of the points I wanted to make here. However I wanted to say that there seem to be two different issues here. The first is "what semantics do i expect from my regular expressions", Unicode or legacy-ASCII, mostly this relates to case-insensitive matching, but things like \d also surface discrepancies. The second is "what encodings does the regular expression engine understand". Unfortunately on *nix there is no tradition of using BOM's to distinguish the 6 different possible encodings of Unicode (UTF-8, UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE), and there seems to be some level of desire of matching with unicode semantics against files that are not uniformly encoded in one of these formats. So the question comes up, A) how do you tell the regular expression engine what semantics you want and B) how does the regular expression library identify the encoding in the file, and how does it handle malformed content in that file. For instance if I have a file which contains snippets of UTF8 encoded data, *and* snippets of data that is illegal in UTF8, what should the regular expression engine do if it is asked to do a case insensitive match against that file. cheers, yves