UTF-8-safe way for char-level-diff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Git has a diff.wordRegex config that allows the user to specify a
regex that defines a word. Setting diff.wordRegex to "." works well
for a char-level diff for ASCII chars, but not for UTF-8 chars.

For example, if a file (encoded by UTF-8) with text "一人" is changed to
"丁人", "git diff --word-diff=color" gets "<E4><B8><80><81>人" (where
"<80>" is red and "<81>" is green) instead of desired "一丁人" (where "一"
is red and "丁" is green). This could be very annoying when diff-ing
files containing CJK chars.

Git diff.wordRegex seems to implement a very basic regex that doesn't
support matching char range by encoding such as "\x41" for "a". Is
there a way to make the char-level diff work correctly? If not, maybe
we should implement a way to allow it.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux