Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)

Jeff King <peff@xxxxxxxx> · Wed, 1 Feb 2023 18:03:55 -0500

On Wed, Feb 01, 2023 at 05:09:33PM +0100, demerphq wrote:

> > Failure (using Zsh to produce the characters; I think there's a Bash
> > equivalent):
> > ```
> > # git diff --word-diff --word-diff-regex=$'[\xc0-\xff][\x80-\xbf]+'
> > fatal¬†: invalid regular expression: [¿-ˇ][Ä-ø]+
> > ```
> 
> FWIW that looks pretty weird to me, like the escapes in the charclass
> were interpolated before being fed to the regex engine. Are you sure
> you tested the right thing?

I think the point is that he is feeding a raw \xc0 byte (not the escape
sequence) to the regex engine, which is bogus UTF8. And the internal
userdiff drivers do the same thing. They contain "[\xc0-\xff]", and
those "\x" will be interpolated by the compiler into their actual bytes.

So the regex engine is complaining that it is getting bytes with high
bits set, but that are not part of a multi-byte character. I.e., it is
not happy to do bytewise matching, but really wants valid UTF8 in the
expression.

glibc's regex engine seems OK with this. Try:

  git grep $'[\xc0-\xff]'

in git.git, and it will find lots of multi-byte characters. But pcre,
for example, is not:

  $ git grep -P $'[\xc0-\xff]'
  fatal: command line, '[<C0>-<FF>]': UTF-8 error: byte 2 top bits not 0x80

There you really want to feed the literal escapes (obviously dropping
the '$ shell interpolation is a better solution, but for the sake of
illustration):

  git grep -P $'[\\xc0-\\xff]'

But I don't think we can rely on the libc BRE supporting "\x" in
character classes. Glibc certainly doesn't. I'm not sure what the
portable solution is.

-Peff