On Wed, Nov 17 2021, Carlo Arenas wrote: > On Wed, Nov 17, 2021 at 1:01 PM Ævar Arnfjörð Bjarmason > <avarab@xxxxxxxxx> wrote: >> >> PCRE2_UTF will also matter for literal patterns. Try to peel apart the >> two bytes in "é" and match them under -i with/without PCRE_UTF. > > Is there a real use case for why someone would do that? and how is > that "literal" valid UTF to warrant setting PCRE2_UTF? We don't check the validity of the input pattern as correct UTF-8 before feeding it to PCRE. We don't even check if you're under a UTF-8 locale, a call to is_utf8_locale() is one of the conditions we'll try (but check how loose we are under NO_GETTEXT=Y even then), but it's not a necessary one. So in practice we'll likely have users pasting say Big5, UTF-32, JIS encoding or whatever into "git grep" and then grepping arbitrary binary data. I don't really have a specific use-case in mind. I'm just trying to counter the apparent recurring misconception that's popped up more than once in these recent threads that the UTF-8 *mode* is only something we need to worry about if the pattern contains non-ASCII. The implications of UTF-8 as a matching mode go much deeper than that in Perl's and PCRE's regex engines, e.g. in this case what's considered a character boundary. > I would expect that someone including random bytes in the expression > is really more interested in binary matching anyway and the use of -i > with it probably should be an error. > > Indeed I suspect the fact that pcre2_compile lets it through might be > a bug in PCRE2 > > $ pcre2test > PCRE2 version 10.39 2021-10-29 > re> /\303/utf,caseless > data> \303 > 0: \x{c3} > data> é > No match It's really not "random bytes". We're not talking about someone doing a git grep where the argument is fed from a "cat" of /dev/urandom. Just plain old boring natural language encodings in common use can and will conflict with what's considered a character boundary in UTF-8, Anyway, I haven't had time to re-page this topic at large into my wetware and really don't know what we should be doing exactly at this point to move forward, except my previous suggestion to either revert the recent changes, or at least to narrow them down so that we get the old behavior for everything except the revision.c "git log" caller. I think a really good next step aside from dealing with that immediate problem would be to harden some of our test data in a way similar to what we've got in t7816-grep-binary-pattern.sh, but for partial matches, mixtures of valid/invalid/partial character patterns and subjects etc. We might be able to lift some tests from existing test suites, perl.git's t/re/re_tests and pcre2.git's RunGrepTest (And testdata/ directory) are probably good places to start looking. We don't need to test e.g. PCRE itself independently, but it is worth testing how whatever special-sauce we add on top (if and when to turn on its flags based on heuristics) impacts things. We've also got the problem that whatever code we write will target multiple PCREv2 versions, which isn't something PCREv2 itself needs to deal with.