On Mon, Jul 01 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes: > >> This v3 has a new patch (3/10) that I believe fixes the regression on >> MinGW Johannes noted in >> https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@xxxxxxxxxxxxxxxxx/ >> >> As noted in the updated commit message in 10/10 I believe just >> skipping this test & documenting this in a commit message is the least >> amount of suck for now. It's really an existing issue with us doing >> nothing sensible when the log/grep haystack encoding doesn't match the >> needle encoding supplied via the command line. > > Is that quite the case? If they do not match, not finding the match > is the right answer, because we are byte-for-byte matching/searching > IIUC. > >> We swept that under the carpet with the kwset backend, but PCRE v2 >> exposes it. > > Is it exposing, or just showing the limitation of the rewritten > implementation where it cannot do byte-for-byte matching/searching > as we used to be able to? > > Without having a way to know what encoding is used on the command > line, there is no sensible way to reencode them to match the > haystack encoding (even when it is known), so "you got to feed the > strings in the same encoding, as we are going to match/search > byte-for-byte" is the only sensible way to work, given the design > space, I would think. > > Not that it is all that useful to be able to match/search > byte-for-byte, of course, so I am OK if we punt with these tests, > but I'd prefer to see us admit we are punting when we do ;-). I'm guilty as charged in punting this larger encoding issue. As it pertains to this patch series it unearths an obscure case I think nobody cares about in practice, and I'd like to move on with the "remove kwset" optimization. But I strongly believe that the new behavior with the PCRE v2 optimization is the only sane thing to do, and to the extent we have anything left to do (#leftoverbits) it's that we should modify git more generally (aside from string searching) to do the same thing where appropriate. Remember, this only happens if the user has set a UTF-8 locale and thus promised that they're going to give us UTF-8. We then take that promise and make e.g. "æ" match "Æ" under --ignore-case. Just falling back on raw byte matching isn't going to cut it, because then "æ<invalid utf8>" won't match "Æ<same invalid utf8>" under --ignore-case, and there's other cases like that with matching word boundaries & other Unicode gotchas. The best that can be hoped for at that point is some "loose UTF-8" mode. I see both perl & GNU grep seem to support that (although I'm sure it falls apart at some point). GNU grep will also die in the same way that we now die with --perl-regexp (since it also use PCRE). I think that's saner, if the user thinks they're feeding us UTF-8 but they're not I think they'd like to know rather than having the string matching library fall back.