Re: [PATCH] diff-highlight: Fix broken multibyte string

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mar 30, 2015, at 15:16, Jeff King wrote:

> Yeah, I agree the current output is not ideal, and this should address
> the problem. I was worried that multi-byte splitting would make things
> slower, but in my tests, it actually speeds things up!

[...]

> Unfortunately, String::Multibyte is not a standard module, and is not
> even packed for Debian systems (I got mine from CPAN). Can we make
> this
> a conditional include (e.g., 'eval "require String::Multibyte"' in
> get_mbcs, and return undef if that fails?). Then people without it can
> still use the script.

[...]

> Yuck. This is a lot more intimate with String::Multibyte's
> implementation than I'd like to be.

So I was curious about this and played with it and was able to
reproduce the problem as described.

Here's an alternate fix that should work for everyone with Perl 5.8
or later.

-Kyle

-- 8< --
Subject: [PATCH v2] diff-highlight: do not split multibyte characters

When the input is UTF-8 and Perl is operating on bytes instead
of characters, a diff that changes one multibyte character to
another that shares an initial byte sequence will result in a
broken diff display as the common byte sequence prefix will be
separated from the rest of the bytes in the multibyte character.

For example, if a single line contains only the unicode
character U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that
line is then changed to the unicode character U+C9C0 (encoded as
UTF-8 0xEC, 0xA7, 0x80), when operating on bytes diff-highlight
will show only the single byte change from 0x84 to 0x80 thus
creating invalid UTF-8 and a broken diff display.

Fix this by putting Perl into character mode when splitting the
line and then back into byte mode after the split is finished.

While the utf8::xxx functions are built-in and do not require
any 'use' statement, the utf8::is_utf8 function did not appear
until Perl 5.8.1, but is identical to the Encode::is_utf8
function which is available in 5.8 so we use that instead of
utf8::is_utf8.

Reported-by: Yi EungJun <semtlenori@xxxxxxxxx>
Signed-off-by: Kyle J. McKay <mackyle@xxxxxxxxx>
---
 contrib/diff-highlight/diff-highlight | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 08c88bbc..8e9b5ada 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -2,6 +2,7 @@
 
 use warnings FATAL => 'all';
 use strict;
+use Encode ();
 
 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -164,8 +165,10 @@ sub highlight_pair {
 
 sub split_line {
 	local $_ = shift;
-	return map { /$COLOR/ ? $_ : (split //) }
-	       split /($COLOR*)/;
+	utf8::decode($_);
+	return map { utf8::encode($_) if Encode::is_utf8($_); $_ }
+		map { /$COLOR/ ? $_ : (split //) }
+		split /($COLOR*)/;
 }
 
 sub highlight_line {
---
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]