Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 29 Nov 2022 20:32:58 +0900

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:

>> or (if chinese can not be displayed correctly)
>>
>> -  <E4><B8><BA>1
>> +  <E4><B8><BA>2
>>
>> Actual result of "git diff --color-words"
>>
>> <E4><B8>[-<BA>1-]{+<BA>2+}
>> ...
> I think we could provide new ways to do per-language diffs, right now
> you can use --word-diff-regex, but it would be handy to e.g. have a
> built-in collection of those (or other non-regex boundary algorithms)
> for Chinese etc.

I think you are thinking it with unnecessaarily complexity.  

The only thing that needs noticing in the above example, I think is,
that the three-byte sequence E4-B8-BA in the example is supposed to
be a single unicode character, and the actual result depicted can
happen only if we (incorrectly) chomp that single character in the
middle.

No matter what language we are using, we shouldn't do that.

I suspect that "--word-diff" internal is not even aware what a
character is, but if you assume UTF-8 (precomposed), then you should
be able to tell where the character boundary is by only looking at
the high-bit patterns to avoid producing such an output.