On Thu, Jul 25 2019, Johannes Schindelin wrote: > Hi Junio, > > On Thu, 25 Jul 2019, Junio C Hamano wrote: > >> Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes: >> >> >> OK, in short, barfing and stopping is a problem, but that flag is >> >> not the right knob to tweak. And the right knob ... >> >> >> >> > 1) We're oversupplying PCRE2_UTF now, and one such case is what's being >> >> > reported here. I.e. there's no reason I can think of for why a >> >> > fixed-string pattern should need PCRE2_UTF set when not combined >> >> > with --ignore-case. We can just not do that, but maybe I'm missing >> >> > something there. >> >> > >> >> > 2) We can do "try utf8, and fallback". A more advanced version of this >> >> > is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread) >> >> > does. I was thinking something closer to just carrying two compiled >> >> > patterns, and falling back on the ~PCRE2_UTF one if we get a >> >> > PCRE2_ERROR_UTF8_* error. >> >> >> >> ... lies somewhere along that line. I think that is very sensible. >> > >> > I am glad that everybody agrees with my original comment on ab/no-kwset >> > where I suggested that we should use our knowledge of the encoding of >> > the haystack and convert it to UTF-8 if we detect that the pattern is >> > UTF-8 encoded,... >> >> Please do not count me among "everybody", then. I did not think >> that Ævar meant to iconv the haystack when I wrote the message you >> are responding to, but if that was what he meant, I would not have >> said "very sensible". > > Okay, but in that case I cannot agree with your assessment that it is > very sensible. FWIW what I meant was not that we'd run around and iconv() things, it wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8 valid", which presumably would be the end result of something like that. Rather that this model of assuming that a UTF-8 pattern means we can consider everything in the repo UTF-8 in git-grep doesn't make sense. My kwset patches *revealed* that problem in a painful way, but it was there already. I'm not sure what a real fix for that is. Part of it is probably 8/8 in the series I mention below, but more generally we'd need to be more encoding aware at a much higher callsite than "grep". So e.g. we'd know that we match "binary" data as not-UTF-8. Now we just throw arbitrary bytes around and hope something sticks. > If we're already deciding to paper over things, I'd much rather prefer > the simpler patch, i.e. Carlo's. As I noted upthread PCRE's own docs promise undefined behavior and fire and brimstone if that patch is applied. Those last two not guaranteed. So we need another solution. I've submitted https://public-inbox.org/git/20190726150818.6373-1-avarab@xxxxxxxxx/ just now. See what you think about it.