Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)

"D. Ben Knoble" <ben.knoble@xxxxxxxxx> · Tue, 7 Feb 2023 17:27:14 -0500

CC'ing Jonathan Nieder

On Tue, Feb 7, 2023 at 1:23 PM Jeff King <peff@xxxxxxxx> wrote:
>
> On Sun, Feb 05, 2023 at 02:51:05PM -0500, D. Ben Knoble wrote:
>
> > Any thoughts on some sort of stop-gap measure to fix --word-diff while
> > Git decides how to handle the regex engine incompatibilities? How
> > important is the sequence of bytes at the end of --word-diff regexes
> > in userdiff.c?
>
> It comes from 664d44ee7f (userdiff: simplify word-diff safeguard,
> 2011-01-11). So presumably we'd want to figure out a way to accomplish
> the same thing in a portable way. I'm not sure that's possible, though,
> without making assumptions about the regex engine.

If "use the safeguard portably" implies "make assumptions about the
regex engine," that sounds like an argument for Git to ship its own
engine with exactly the necessary features. If that implementation
includes proper locale and UTF-8 support alongside support for the
high-byte character classes, I think we would be all set…

OTOH, perhaps there is a way to express the safeguard character
classes portably?

Jonathan, can you provide more context for the safeguard? I've read
this message several times

> git's diff-words support has a detail that can be a little dangerous:
> any text not matched by a given language's tokenization pattern is
> treated as whitespace and changes in such text would go unnoticed.
> Therefore each of the built-in regexes allows a special token type
> consisting of a single non-whitespace character [^[:space:]].
>
> To make sure UTF-8 sequences remain human readable, the builtin
> regexes also have a special token type for runs of bytes with the high
> bit set.  In English, non-ASCII characters are usually isolated so
> this is analogous to the [^[:space:]] pattern, except it matches a
> single _multibyte_ character despite use of the C locale.
>
> Unfortunately it is easy to make typos or forget entirely to include
> these catch-all token types when adding support for new languages (see
> v1.7.3.5~16, userdiff: fix typo in ruby and python word regexes,
> 2010-12-18).  Avoid this by including them automatically within the
> PATTERNS and IPATTERN macros.
>
> While at it, change the UTF-8 sequence token type to match exactly one
> non-ASCII multi-byte character, rather than an arbitrary run of them.

and I can hardly make heads or tails of it.

-- 
D. Ben Knoble