On Thu, Dec 01, 2022 at 02:51:29PM +0000, Phillip Wood wrote: > On 01/12/2022 07:33, Ping Yin wrote: > > > > If the rule is "break on ascii whitespace", > > > > Is there a way to achieve this: break english by word, and break > > chinese by utf-8 character > > You could extend your current regex so that it matches whole utf-8 > codepoints which is what git does for the builtin userdiff regexes. I've not > tested it but I think > > git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf > '[\xc0-\xff][\x80-\xbf]+')" > > should work. The downside is that you end up with a .gitconfig that is not > valid utf-8. Perhaps someone else has a clever idea to get around that. I think in more advanced regular expression engines you can do stuff like matching "[\x{4e00}-\x{9fcc}]", or even "\p{Han}". But I don't know that the stock libc regex is capable of anything like this, even with EREs. That's the only option Git provides for matching word regexes, but in theory we could support libpcre. We already can optionally build against it; we would just need config/plumbing to get it into diff.c:find_word_boundaries(). -Peff