Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)

Jeff King <peff@xxxxxxxx> · Fri, 3 Feb 2023 12:01:37 -0500

On Thu, Feb 02, 2023 at 05:22:37PM +0100, demerphq wrote:

> I've been lurking watching some of the regex discussion on the list
> and personally I think it is asking for trouble to use "whatever regex
> engine is traditional in a given environment" instead of just choosing
> a good open source engine and using it consistently everywhere.  I
> don't really buy the arguments I have seen to justify a policy of "use
> the standard library version"; regex engines vary widely in
> performance and implementation and feature set, and even the really
> good ones do not entirely agree on every semantic[1], so if you don't
> standardize you will be forever dealing with bugs related to those
> differences.

I think this is a perennial question for portable software: is it better
to be consistent across platforms (by shipping our own regex engine), or
consistent with other programs on the same platform (by using the system
regex).

I don't have a strong opinion either way. The main concern I'd have is
handling dependencies. I like pcre a lot, but I'm not sure that I would
want building Git to require pcre on every platform. If there's an
engine we can ship as a vendored dependency that builds everywhere, that
helps. We have the engine imported from gawk in compat/regex. That
_probably_ builds everywhere (though we don't really know, because any
platform that doesn't set NO_REGEX has been happily using the system
routines). But it also may not be the best choice; avoiding its
multi-byte handling was the reason behind 1819ad327 in the first place.

> I think the git project should choose the feature set[2] it thinks are
> important, and then choose a regex engine that provides those features
> and is well supported, and then use it consistently everywhere that
> git needs to do regex based matching. Anything else is asking for
> trouble at some level or another.

IMHO the biggest issue here is that the built-in userdiff regexes are
doing something a bit questionable, which is embedding high-bit
characters directly into the regex. If we can avoid that, then having
consistency in multi-byte handling across platforms becomes a lot less
important.

-Peff