Hi, On Wed, 24 Jul 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes: > > > The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid > > UTF-8 string so you, PCRE, don't need to re-check it". > > OK, in short, barfing and stopping is a problem, but that flag is > not the right knob to tweak. And the right knob ... > > > 1) We're oversupplying PCRE2_UTF now, and one such case is what's being > > reported here. I.e. there's no reason I can think of for why a > > fixed-string pattern should need PCRE2_UTF set when not combined > > with --ignore-case. We can just not do that, but maybe I'm missing > > something there. > > > > 2) We can do "try utf8, and fallback". A more advanced version of this > > is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread) > > does. I was thinking something closer to just carrying two compiled > > patterns, and falling back on the ~PCRE2_UTF one if we get a > > PCRE2_ERROR_UTF8_* error. > > ... lies somewhere along that line. I think that is very sensible. I am glad that everybody agrees with my original comment on ab/no-kwset where I suggested that we should use our knowledge of the encoding of the haystack and convert it to UTF-8 if we detect that the pattern is UTF-8 encoded, and then pass the PCRE2_UTF flag only when applicable (i.e. when we know that either needle or haystack is non-ASCII, and then making sure that we convert to UTF-8 whenever necessary). Okay, that came over a bit more sarcastic than I originally intended, but if you try to filter that out, I think that is still the better solution than to paper over the issue. After all, PCRE2_MATCH_INVALID_UTF is only marginally better than what Carlo suggested. _Marginally_. Not really worth considering, in my mind, even. > Let's make sure this gets sorted out soonish. I agree, Dscho