Re: [PATCH] grep: skip UTF8 checks explicitally

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 26 Jul 2019 09:19:46 -0700

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:

> FWIW what I meant was not that we'd run around and iconv() things, it
> wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8
> valid", which presumably would be the end result of something like that.
>
> Rather that this model of assuming that a UTF-8 pattern means we can
> consider everything in the repo UTF-8 in git-grep doesn't make sense. My
> kwset patches *revealed* that problem in a painful way, but it was there
> already.

We already do assume that pathnames are UTF-8 (pathspecs on MacOS
are converted and then they are matched assuming that property).
Further, with the same mechanism, I think there is an assumption
that anything that comes from the command line is UTF-8 (and if I
recall correctly, doesn't the Windows port of Git force us to use
the same assumption---I recall we needed tests tweak for that).

In the very very longer term, I do not think we would want to keep
the assumption that the text encoding of blobs is always UTF-8, and
it would be nice to extend the system, so that blob data could be
marked in some way to say "I'm in Big-5, and not in UTF-8, so please
treat me as such" and magically the needle and the haystack can be
made to agree, with iconv() either one of them.  

But I do not think the current topic to fix the immediate/imminent
breakage should not be distracted by that.  Let's keep assuming that
any blob, when it is text, is UTF-8.

And from that point of view, I think the two pieces of idea in your
earlier message does make sense.  We can try to match as binary most
of the time, as UTF-8 would not let a valid UTF-8 needle match in
the haystack starting in the middle of a character.  When the user
is trying to match case-insensitively, we know the haystack in which
the user is interested in finding the needle is text, even though
there may be non-text blobs as well.

For example, "git grep -i 'foo' t/" may find a few png files under
the t/ directory.  We do not care if they happen to contain Foo and
we do not mind if they appear or do not appear in the result.  The
only two things we care about are (1) foo, Foo, FOO are found in the
text files under t/ and (2) the command does not die in the middle,
before processing all the files, only because a png file it found
were not UTF-8 valid.