Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese

Jeff King <peff@xxxxxxxx> · Thu, 1 Dec 2022 15:06:48 -0500

On Thu, Dec 01, 2022 at 02:51:29PM +0000, Phillip Wood wrote:

> On 01/12/2022 07:33, Ping Yin wrote:
> > > > If the rule is "break on ascii whitespace",
> > 
> > Is there a way to achieve this: break english by word, and break
> > chinese by utf-8 character
> 
> You could extend your current regex so that it matches whole utf-8
> codepoints which is what git does for the builtin userdiff regexes. I've not
> tested it but I think
> 
> git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf
> '[\xc0-\xff][\x80-\xbf]+')"
> 
> should work. The downside is that you end up with a .gitconfig that is not
> valid utf-8. Perhaps someone else has a clever idea to get around that.

I think in more advanced regular expression engines you can do stuff
like matching "[\x{4e00}-\x{9fcc}]", or even "\p{Han}". But I don't know
that the stock libc regex is capable of anything like this, even with
EREs. That's the only option Git provides for matching word regexes, but
in theory we could support libpcre. We already can optionally build
against it; we would just need config/plumbing to get it into
diff.c:find_word_boundaries().

-Peff