Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes: > FWIW what I meant was not that we'd run around and iconv() things, it > wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8 > valid", which presumably would be the end result of something like that. > > Rather that this model of assuming that a UTF-8 pattern means we can > consider everything in the repo UTF-8 in git-grep doesn't make sense. My > kwset patches *revealed* that problem in a painful way, but it was there > already. We already do assume that pathnames are UTF-8 (pathspecs on MacOS are converted and then they are matched assuming that property). Further, with the same mechanism, I think there is an assumption that anything that comes from the command line is UTF-8 (and if I recall correctly, doesn't the Windows port of Git force us to use the same assumption---I recall we needed tests tweak for that). In the very very longer term, I do not think we would want to keep the assumption that the text encoding of blobs is always UTF-8, and it would be nice to extend the system, so that blob data could be marked in some way to say "I'm in Big-5, and not in UTF-8, so please treat me as such" and magically the needle and the haystack can be made to agree, with iconv() either one of them. But I do not think the current topic to fix the immediate/imminent breakage should not be distracted by that. Let's keep assuming that any blob, when it is text, is UTF-8. And from that point of view, I think the two pieces of idea in your earlier message does make sense. We can try to match as binary most of the time, as UTF-8 would not let a valid UTF-8 needle match in the haystack starting in the middle of a character. When the user is trying to match case-insensitively, we know the haystack in which the user is interested in finding the needle is text, even though there may be non-text blobs as well. For example, "git grep -i 'foo' t/" may find a few png files under the t/ directory. We do not care if they happen to contain Foo and we do not mind if they appear or do not appear in the result. The only two things we care about are (1) foo, Foo, FOO are found in the text files under t/ and (2) the command does not die in the middle, before processing all the files, only because a png file it found were not UTF-8 valid.