Re: [PATCH] grep: skip UTF8 checks explicitally

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Thu, 25 Jul 2019 11:48:16 +0200 (CEST)

Hi,

On Wed, 24 Jul 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:
>
> > The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
> > UTF-8 string so you, PCRE, don't need to re-check it".
>
> OK, in short, barfing and stopping is a problem, but that flag is
> not the right knob to tweak.  And the right knob ...
>
> >  1) We're oversupplying PCRE2_UTF now, and one such case is what's being
> >     reported here. I.e. there's no reason I can think of for why a
> >     fixed-string pattern should need PCRE2_UTF set when not combined
> >     with --ignore-case. We can just not do that, but maybe I'm missing
> >     something there.
> >
> >  2) We can do "try utf8, and fallback". A more advanced version of this
> >     is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
> >     does. I was thinking something closer to just carrying two compiled
> >     patterns, and falling back on the ~PCRE2_UTF one if we get a
> >     PCRE2_ERROR_UTF8_* error.
>
> ... lies somewhere along that line.  I think that is very sensible.

I am glad that everybody agrees with my original comment on ab/no-kwset
where I suggested that we should use our knowledge of the encoding of
the haystack and convert it to UTF-8 if we detect that the pattern is
UTF-8 encoded, and then pass the PCRE2_UTF flag only when applicable
(i.e. when we know that either needle or haystack is non-ASCII, and then
making sure that we convert to UTF-8 whenever necessary).

Okay, that came over a bit more sarcastic than I originally intended,
but if you try to filter that out, I think that is still the better
solution than to paper over the issue.

After all, PCRE2_MATCH_INVALID_UTF is only marginally better than what
Carlo suggested. _Marginally_. Not really worth considering, in my mind,
even.

> Let's make sure this gets sorted out soonish.

I agree,
Dscho