On Thu, Feb 02, 2023 at 05:22:37PM +0100, demerphq wrote: > I've been lurking watching some of the regex discussion on the list > and personally I think it is asking for trouble to use "whatever regex > engine is traditional in a given environment" instead of just choosing > a good open source engine and using it consistently everywhere. I > don't really buy the arguments I have seen to justify a policy of "use > the standard library version"; regex engines vary widely in > performance and implementation and feature set, and even the really > good ones do not entirely agree on every semantic[1], so if you don't > standardize you will be forever dealing with bugs related to those > differences. I think this is a perennial question for portable software: is it better to be consistent across platforms (by shipping our own regex engine), or consistent with other programs on the same platform (by using the system regex). I don't have a strong opinion either way. The main concern I'd have is handling dependencies. I like pcre a lot, but I'm not sure that I would want building Git to require pcre on every platform. If there's an engine we can ship as a vendored dependency that builds everywhere, that helps. We have the engine imported from gawk in compat/regex. That _probably_ builds everywhere (though we don't really know, because any platform that doesn't set NO_REGEX has been happily using the system routines). But it also may not be the best choice; avoiding its multi-byte handling was the reason behind 1819ad327 in the first place. > I think the git project should choose the feature set[2] it thinks are > important, and then choose a regex engine that provides those features > and is well supported, and then use it consistently everywhere that > git needs to do regex based matching. Anything else is asking for > trouble at some level or another. IMHO the biggest issue here is that the built-in userdiff regexes are doing something a bit questionable, which is embedding high-bit characters directly into the regex. If we can avoid that, then having consistency in multi-byte handling across platforms becomes a lot less important. -Peff