Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese

Jeff King <peff@xxxxxxxx> · Tue, 29 Nov 2022 13:54:21 -0500

On Tue, Nov 29, 2022 at 01:23:27PM -0500, Jeff King wrote:

> > I suspect that "--word-diff" internal is not even aware what a
> > character is, but if you assume UTF-8 (precomposed), then you should
> > be able to tell where the character boundary is by only looking at
> > the high-bit patterns to avoid producing such an output.
> 
> Agreed that we should probably avoid breaking characters. But what
> puzzles me more is that we break it between B8 and BA, and not
> elsewhere. Why not between E4 and B8? Why not between BA and "1"?
> 
> If the rule is "break on ascii whitespace", then I'd have expected the
> whole four-character sequence to be taken as a unit. In other words, it
> does should not have to care that a character is, as long as the bytes
> for space characters cannot appear inside other characters (which is
> true of utf8).

Even more puzzling is that it produces the expected output for me:

  [note that \x is a bash-ism]
  $ printf '\xe4\xb8\xba1' >one
  $ printf '\xe4\xb8\xba2' >two
  $ git diff --no-index --word-diff one two
  diff --git a/one b/two
  index 9ae469fc41..576e6e32d8 100644
  --- a/one
  +++ b/two
  @@ -1 +1 @@
  [-为1-]{+为2+}

I wonder if OP has diff.wordRegex config (or attributes triggering a
diff.*.wordRegex) that is doing something else.

-Peff