Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese

Jeff King <peff@xxxxxxxx> · Tue, 29 Nov 2022 13:23:27 -0500

On Tue, Nov 29, 2022 at 08:32:58PM +0900, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:
> 
> >> or (if chinese can not be displayed correctly)
> >>
> >> -  <E4><B8><BA>1
> >> +  <E4><B8><BA>2
> >>
> >> Actual result of "git diff --color-words"
> >>
> >> <E4><B8>[-<BA>1-]{+<BA>2+}
> >> ...
> > I think we could provide new ways to do per-language diffs, right now
> > you can use --word-diff-regex, but it would be handy to e.g. have a
> > built-in collection of those (or other non-regex boundary algorithms)
> > for Chinese etc.
> 
> I think you are thinking it with unnecessaarily complexity.  
> 
> The only thing that needs noticing in the above example, I think is,
> that the three-byte sequence E4-B8-BA in the example is supposed to
> be a single unicode character, and the actual result depicted can
> happen only if we (incorrectly) chomp that single character in the
> middle.
> 
> No matter what language we are using, we shouldn't do that.
> 
> I suspect that "--word-diff" internal is not even aware what a
> character is, but if you assume UTF-8 (precomposed), then you should
> be able to tell where the character boundary is by only looking at
> the high-bit patterns to avoid producing such an output.

Agreed that we should probably avoid breaking characters. But what
puzzles me more is that we break it between B8 and BA, and not
elsewhere. Why not between E4 and B8? Why not between BA and "1"?

If the rule is "break on ascii whitespace", then I'd have expected the
whole four-character sequence to be taken as a unit. In other words, it
does should not have to care that a character is, as long as the bytes
for space characters cannot appear inside other characters (which is
true of utf8).

-Peff