Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 18 Nov 2021 01:00:26 +0100

On Wed, Nov 17 2021, Carlo Arenas wrote:

> On Wed, Nov 17, 2021 at 1:01 PM Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>>
>> PCRE2_UTF will also matter for literal patterns. Try to peel apart the
>> two bytes in "é" and match them under -i with/without PCRE_UTF.
>
> Is there a real use case for why someone would do that? and how is
> that "literal" valid UTF to warrant setting PCRE2_UTF?

We don't check the validity of the input pattern as correct UTF-8 before
feeding it to PCRE. We don't even check if you're under a UTF-8 locale,
a call to is_utf8_locale() is one of the conditions we'll try (but check
how loose we are under NO_GETTEXT=Y even then), but it's not a necessary
one.

So in practice we'll likely have users pasting say Big5, UTF-32, JIS
encoding or whatever into "git grep" and then grepping arbitrary binary
data.

I don't really have a specific use-case in mind. I'm just trying to
counter the apparent recurring misconception that's popped up more than
once in these recent threads that the UTF-8 *mode* is only something we
need to worry about if the pattern contains non-ASCII.

The implications of UTF-8 as a matching mode go much deeper than that in
Perl's and PCRE's regex engines, e.g. in this case what's considered a
character boundary.

> I would expect that someone including random bytes in the expression
> is really more interested in binary matching anyway and the use of -i
> with it probably should be an error.
>
> Indeed I suspect the fact that pcre2_compile lets it through might be
> a bug in PCRE2
>
> $ pcre2test
> PCRE2 version 10.39 2021-10-29
>   re> /\303/utf,caseless
> data> \303
>  0: \x{c3}
> data> é
> No match

It's really not "random bytes".

We're not talking about someone doing a git grep where the argument is
fed from a "cat" of /dev/urandom. Just plain old boring natural language
encodings in common use can and will conflict with what's considered a
character boundary in UTF-8,

Anyway, I haven't had time to re-page this topic at large into my
wetware and really don't know what we should be doing exactly at this
point to move forward, except my previous suggestion to either revert
the recent changes, or at least to narrow them down so that we get the
old behavior for everything except the revision.c "git log" caller.

I think a really good next step aside from dealing with that immediate
problem would be to harden some of our test data in a way similar to
what we've got in t7816-grep-binary-pattern.sh, but for partial matches,
mixtures of valid/invalid/partial character patterns and subjects etc.

We might be able to lift some tests from existing test suites,
perl.git's t/re/re_tests and pcre2.git's RunGrepTest (And testdata/
directory) are probably good places to start looking.

We don't need to test e.g. PCRE itself independently, but it is worth
testing how whatever special-sauce we add on top (if and when to turn on
its flags based on heuristics) impacts things.

We've also got the problem that whatever code we write will target
multiple PCREv2 versions, which isn't something PCREv2 itself needs to
deal with.