Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes: >> or (if chinese can not be displayed correctly) >> >> - <E4><B8><BA>1 >> + <E4><B8><BA>2 >> >> Actual result of "git diff --color-words" >> >> <E4><B8>[-<BA>1-]{+<BA>2+} >> ... > I think we could provide new ways to do per-language diffs, right now > you can use --word-diff-regex, but it would be handy to e.g. have a > built-in collection of those (or other non-regex boundary algorithms) > for Chinese etc. I think you are thinking it with unnecessaarily complexity. The only thing that needs noticing in the above example, I think is, that the three-byte sequence E4-B8-BA in the example is supposed to be a single unicode character, and the actual result depicted can happen only if we (incorrectly) chomp that single character in the middle. No matter what language we are using, we shouldn't do that. I suspect that "--word-diff" internal is not even aware what a character is, but if you assume UTF-8 (precomposed), then you should be able to tell where the character boundary is by only looking at the high-bit patterns to avoid producing such an output.