On Wed, Feb 01, 2023 at 05:09:33PM +0100, demerphq wrote: > > Failure (using Zsh to produce the characters; I think there's a Bash > > equivalent): > > ``` > > # git diff --word-diff --word-diff-regex=$'[\xc0-\xff][\x80-\xbf]+' > > fatal¬†: invalid regular expression: [¿-ˇ][Ä-ø]+ > > ``` > > FWIW that looks pretty weird to me, like the escapes in the charclass > were interpolated before being fed to the regex engine. Are you sure > you tested the right thing? I think the point is that he is feeding a raw \xc0 byte (not the escape sequence) to the regex engine, which is bogus UTF8. And the internal userdiff drivers do the same thing. They contain "[\xc0-\xff]", and those "\x" will be interpolated by the compiler into their actual bytes. So the regex engine is complaining that it is getting bytes with high bits set, but that are not part of a multi-byte character. I.e., it is not happy to do bytewise matching, but really wants valid UTF8 in the expression. glibc's regex engine seems OK with this. Try: git grep $'[\xc0-\xff]' in git.git, and it will find lots of multi-byte characters. But pcre, for example, is not: $ git grep -P $'[\xc0-\xff]' fatal: command line, '[<C0>-<FF>]': UTF-8 error: byte 2 top bits not 0x80 There you really want to feed the literal escapes (obviously dropping the '$ shell interpolation is a better solution, but for the sake of illustration): git grep -P $'[\\xc0-\\xff]' But I don't think we can rely on the libc BRE supporting "\x" in character classes. Glibc certainly doesn't. I'm not sure what the portable solution is. -Peff