Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:

>> or (if chinese can not be displayed correctly)
>>
>> -  <E4><B8><BA>1
>> +  <E4><B8><BA>2
>>
>> Actual result of "git diff --color-words"
>>
>> <E4><B8>[-<BA>1-]{+<BA>2+}
>> ...
> I think we could provide new ways to do per-language diffs, right now
> you can use --word-diff-regex, but it would be handy to e.g. have a
> built-in collection of those (or other non-regex boundary algorithms)
> for Chinese etc.

I think you are thinking it with unnecessaarily complexity.  

The only thing that needs noticing in the above example, I think is,
that the three-byte sequence E4-B8-BA in the example is supposed to
be a single unicode character, and the actual result depicted can
happen only if we (incorrectly) chomp that single character in the
middle.

No matter what language we are using, we shouldn't do that.

I suspect that "--word-diff" internal is not even aware what a
character is, but if you assume UTF-8 (precomposed), then you should
be able to tell where the character boundary is by only looking at
the high-bit patterns to avoid producing such an output.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux