On Mar 30, 2015, at 15:16, Jeff King wrote: > Yeah, I agree the current output is not ideal, and this should address > the problem. I was worried that multi-byte splitting would make things > slower, but in my tests, it actually speeds things up! [...] > Unfortunately, String::Multibyte is not a standard module, and is not > even packed for Debian systems (I got mine from CPAN). Can we make > this > a conditional include (e.g., 'eval "require String::Multibyte"' in > get_mbcs, and return undef if that fails?). Then people without it can > still use the script. [...] > Yuck. This is a lot more intimate with String::Multibyte's > implementation than I'd like to be. So I was curious about this and played with it and was able to reproduce the problem as described. Here's an alternate fix that should work for everyone with Perl 5.8 or later. -Kyle -- 8< -- Subject: [PATCH v2] diff-highlight: do not split multibyte characters When the input is UTF-8 and Perl is operating on bytes instead of characters, a diff that changes one multibyte character to another that shares an initial byte sequence will result in a broken diff display as the common byte sequence prefix will be separated from the rest of the bytes in the multibyte character. For example, if a single line contains only the unicode character U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that line is then changed to the unicode character U+C9C0 (encoded as UTF-8 0xEC, 0xA7, 0x80), when operating on bytes diff-highlight will show only the single byte change from 0x84 to 0x80 thus creating invalid UTF-8 and a broken diff display. Fix this by putting Perl into character mode when splitting the line and then back into byte mode after the split is finished. While the utf8::xxx functions are built-in and do not require any 'use' statement, the utf8::is_utf8 function did not appear until Perl 5.8.1, but is identical to the Encode::is_utf8 function which is available in 5.8 so we use that instead of utf8::is_utf8. Reported-by: Yi EungJun <semtlenori@xxxxxxxxx> Signed-off-by: Kyle J. McKay <mackyle@xxxxxxxxx> --- contrib/diff-highlight/diff-highlight | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index 08c88bbc..8e9b5ada 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -2,6 +2,7 @@ use warnings FATAL => 'all'; use strict; +use Encode (); # Highlight by reversing foreground and background. You could do # other things like bold or underline if you prefer. @@ -164,8 +165,10 @@ sub highlight_pair { sub split_line { local $_ = shift; - return map { /$COLOR/ ? $_ : (split //) } - split /($COLOR*)/; + utf8::decode($_); + return map { utf8::encode($_) if Encode::is_utf8($_); $_ } + map { /$COLOR/ ? $_ : (split //) } + split /($COLOR*)/; } sub highlight_line { --- -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html