Re: [PATCH] grep: skip UTF8 checks explicitally

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Wed, 24 Jul 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:
>
> > The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
> > UTF-8 string so you, PCRE, don't need to re-check it".
>
> OK, in short, barfing and stopping is a problem, but that flag is
> not the right knob to tweak.  And the right knob ...
>
> >  1) We're oversupplying PCRE2_UTF now, and one such case is what's being
> >     reported here. I.e. there's no reason I can think of for why a
> >     fixed-string pattern should need PCRE2_UTF set when not combined
> >     with --ignore-case. We can just not do that, but maybe I'm missing
> >     something there.
> >
> >  2) We can do "try utf8, and fallback". A more advanced version of this
> >     is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
> >     does. I was thinking something closer to just carrying two compiled
> >     patterns, and falling back on the ~PCRE2_UTF one if we get a
> >     PCRE2_ERROR_UTF8_* error.
>
> ... lies somewhere along that line.  I think that is very sensible.

I am glad that everybody agrees with my original comment on ab/no-kwset
where I suggested that we should use our knowledge of the encoding of
the haystack and convert it to UTF-8 if we detect that the pattern is
UTF-8 encoded, and then pass the PCRE2_UTF flag only when applicable
(i.e. when we know that either needle or haystack is non-ASCII, and then
making sure that we convert to UTF-8 whenever necessary).

Okay, that came over a bit more sarcastic than I originally intended,
but if you try to filter that out, I think that is still the better
solution than to paper over the issue.

After all, PCRE2_MATCH_INVALID_UTF is only marginally better than what
Carlo suggested. _Marginally_. Not really worth considering, in my mind,
even.

> Let's make sure this gets sorted out soonish.

I agree,
Dscho

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux