On Tue, Nov 29, 2022 at 01:23:27PM -0500, Jeff King wrote: > > I suspect that "--word-diff" internal is not even aware what a > > character is, but if you assume UTF-8 (precomposed), then you should > > be able to tell where the character boundary is by only looking at > > the high-bit patterns to avoid producing such an output. > > Agreed that we should probably avoid breaking characters. But what > puzzles me more is that we break it between B8 and BA, and not > elsewhere. Why not between E4 and B8? Why not between BA and "1"? > > If the rule is "break on ascii whitespace", then I'd have expected the > whole four-character sequence to be taken as a unit. In other words, it > does should not have to care that a character is, as long as the bytes > for space characters cannot appear inside other characters (which is > true of utf8). Even more puzzling is that it produces the expected output for me: [note that \x is a bash-ism] $ printf '\xe4\xb8\xba1' >one $ printf '\xe4\xb8\xba2' >two $ git diff --no-index --word-diff one two diff --git a/one b/two index 9ae469fc41..576e6e32d8 100644 --- a/one +++ b/two @@ -1 +1 @@ [-为1-]{+为2+} I wonder if OP has diff.wordRegex config (or attributes triggering a diff.*.wordRegex) that is doing something else. -Peff