Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)

demerphq <demerphq@xxxxxxxxx> · Thu, 2 Feb 2023 17:22:37 +0100

On Thu, 2 Feb 2023 at 00:03, Jeff King <peff@xxxxxxxx> wrote:
>
> On Wed, Feb 01, 2023 at 05:09:33PM +0100, demerphq wrote:
>
> > > Failure (using Zsh to produce the characters; I think there's a Bash
> > > equivalent):
> > > ```
> > > # git diff --word-diff --word-diff-regex=$'[\xc0-\xff][\x80-\xbf]+'
> > > fatal¬†: invalid regular expression: [¿-ˇ][Ä-ø]+
> > > ```
> >
> > FWIW that looks pretty weird to me, like the escapes in the charclass
> > were interpolated before being fed to the regex engine. Are you sure
> > you tested the right thing?
>
> I think the point is that he is feeding a raw \xc0 byte (not the escape
> sequence) to the regex engine, which is bogus UTF8. And the internal
> userdiff drivers do the same thing. They contain "[\xc0-\xff]", and
> those "\x" will be interpolated by the compiler into their actual bytes.

Thanks, that was the bit that threw me off. I had completely forgotten
that C supports \x escapes :-(. The Perl internals and regex engine is
where I do most of my C hacking and it uses octal exclusively AFAIK.
(I guess it uses octal because of the "where does the escape end"
problem that C seems to have with hex escapes). So I had assumed that
something else, or the regex engine itself was interpolating them. I
appreciate that you took the time to set me straight.

> So the regex engine is complaining that it is getting bytes with high
> bits set, but that are not part of a multi-byte character. I.e., it is
> not happy to do bytewise matching, but really wants valid UTF8 in the
> expression.

Yeah, that was my first thought too, but as I said above the hex
escapes threw me off.

> glibc's regex engine seems OK with this. Try:
>
>   git grep $'[\xc0-\xff]'
>
> in git.git, and it will find lots of multi-byte characters. But pcre,
> for example, is not:
>
>   $ git grep -P $'[\xc0-\xff]'
>   fatal: command line, '[<C0>-<FF>]': UTF-8 error: byte 2 top bits not 0x80

I expect that has something to do with how you are configuring PCRE,
and that with a slightly different config it would be fine with this.

> There you really want to feed the literal escapes (obviously dropping
> the '$ shell interpolation is a better solution, but for the sake of
> illustration):
>
>   git grep -P $'[\\xc0-\\xff]'
>
> But I don't think we can rely on the libc BRE supporting "\x" in
> character classes. Glibc certainly doesn't. I'm not sure what the
> portable solution is.

I've been lurking watching some of the regex discussion on the list
and personally I think it is asking for trouble to use "whatever regex
engine is traditional in a given environment" instead of just choosing
a good open source engine and using it consistently everywhere.  I
don't really buy the arguments I have seen to justify a policy of "use
the standard library version"; regex engines vary widely in
performance and implementation and feature set, and even the really
good ones do not entirely agree on every semantic[1], so if you don't
standardize you will be forever dealing with bugs related to those
differences.

I think the git project should choose the feature set[2] it thinks are
important, and then choose a regex engine that provides those features
and is well supported, and then use it consistently everywhere that
git needs to do regex based matching. Anything else is asking for
trouble at some level or another.

Cheers,
yves
[1] Leaving aside advanced features, even something as simple as
alternation can vary by engine.  Consider "foo"=~/f|fo|foo/. Some
regex engines will match "foo", and some will match "f", depending on
whether they implement "longest match" (as most NFA/DFA engines do),
or if they implement "leftmost longest match" (as Perl and other
backtracking engines tend to do).

[2] Personally I think that features like recursive patterns, named
capture, negative and positive lookahead and lookbehind and branch
reset are so useful that it would be wise to choose an engine that
supports them, but some might argue for other priorities, performance
being a likely candidate.

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"