On Tue, Nov 29 2022, Ping Yin wrote: > Result of "git diff" > > - 为1 > + 为2 > > or (if chinese can not be displayed correctly) > > - <E4><B8><BA>1 > + <E4><B8><BA>2 > > Actual result of "git diff --color-words" > > <E4><B8>[-<BA>1-]{+<BA>2+} > > Expected result of "git diff --color-words" > > 为[-1-]{+2+} > > or (if chinese can not be displayed correctly) I think we could provide new ways to do per-language diffs, right now you can use --word-diff-regex, but it would be handy to e.g. have a built-in collection of those (or other non-regex boundary algorithms) for Chinese etc. But as for considering this a bug, or changing the existing behavior I think we'd need to deal with: * We (approximately) split on space now, which is certainly ASCII-biased, and outside of CJK fairly somewhat universal. * If we're going to split on "real words" in some cross-language aware way, are we going to run into conflicts between what different languages would consider sensible rules? * We probably don't want to make the "diff" dependent on the user's locale, but e.g. saying "I want a Chinese diff" via a CLI option would be OK. * Even for say Chinese, there's probably interesting edge cases when it's combined with other languages or character sets (e.g. Chinese + HTML).