Re: [PATCH] contrib/diff-highlight: multibyte characters diff

Jeff King <peff@xxxxxxxx> · Wed, 12 Feb 2014 15:59:48 -0500

On Tue, Feb 11, 2014 at 06:09:10PM +0900, Yoshihiro Sugi wrote:

> diff-highlight split each hunks and compare them as byte sequences.
> it causes problems when diff hunks include multibyte characters.
> This change enable to work on such cases by decoding inputs and encoding output as utf8 string.

Thanks for looking at this. I didn't consider multibyte characters at
all when I wrote the original.

Sadly, applying your patch seems to cause diff-highlight to take about
1.5x as much CPU (I just tried it on the complete output of "git log
--all -p" in my git.git repository, which went from ~60s to ~90s). That
is not necessarily a deal-breaker, as it is more important to be
correct. But I wonder if there is any way we can optimize it. From my
understanding, it may not be the encoding/decoding itself, but rather
that the utf8-aware text routines are slower. More on that below.

>  # Highlight by reversing foreground and background. You could do
>  # other things like bold or underline if you prefer.
> @@ -15,8 +16,9 @@ my @added;
>  my $in_hunk;
>  
>  while (<>) {
> +	$_ = decode_utf8($_);

What happens when the diff content is not utf8 at all? The default
failure mode for decode_utf8 seems to be to replace non-utf8 sequences
with a substitution character. Which means we would corrupt something
like iso8859-1, which works fine in the current code.

>  	if (!$in_hunk) {
> -		print;
> +		print encode_utf8($_);

And we have the opposite question here. We know that what we have in the
string is perl's internal format. But how do we know that what the
user's terminal wants is utf8?

I think in the most general case, we would need to auto-detect the
encoding for each string (since there is no guarantee that a file even
retains the same encoding through its history). And then probably punt on
highlighting when comparing two lines in different encodings, and if
they are in the same encoding, treat them as a sequence of code points
rather than octets when comparing.

That's probably way too slow and complicated to worry about. And
frankly, I am OK with making the tool work _best_ with UTF-8, as long as
it can fail gracefully for other encodings (i.e., if you can turn off
the utf8 magic when you have another encoding, and we will pass it
through blindly as octets).

Would it be enough to do something like the patch below? The specific
things I am shooting for are:

  - the only part of the code that cares about the sequence is the
    split/highlight bit, and most lines do not get highlighted at all.
    Therefore it tries to lazily do the decode step, which gets back our
    lost performance (I measured ~12% slowdown).

  - we try the utf8 decode and if it fails, fall back to treating it
    like a sequence of bytes. That doesn't help people with other
    multibyte encodings, but at least lets single-byte encodings besides
    utf8 continue to work.

I constructed a very simple test of:

  git init &&
  printf 'cont\xc3\xa9nt\n' >file &&
  git add file &&
  printf 'cont\xc3\xb6nt\n' >file &&
  git diff

and it seemed to work. Can you confirm that it does the right thing on
your more complicated cases?

-Peff

---

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index c4404d4..9447ba2 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -2,6 +2,7 @@
 
 use warnings FATAL => 'all';
 use strict;
+use Encode qw(decode_utf8 encode_utf8);
 
 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -73,13 +74,23 @@ sub show_hunk {
 
 	my @queue;
 	for (my $i = 0; $i < @$a; $i++) {
-		my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]);
-		print $rm;
-		push @queue, $add;
+		my ($a_dec, $encode_rm) = decode($a->[$i]);
+		my ($b_dec, $encode_add) = decode($b->[$i]);
+		my ($rm, $add) = highlight_pair($a_dec, $b_dec);
+		print $encode_rm->($rm);
+		push @queue, $encode_add->($add);
 	}
 	print @queue;
 }
 
+sub decode {
+	my $orig = shift;
+	my $decoded = eval { decode_utf8($orig, Encode::FB_CROAK) };
+	return defined $decoded ?
+	       ($decoded, sub { encode_utf8(shift) }) :
+	       ($orig, sub { shift });
+}
+
 sub highlight_pair {
 	my @a = split_line(shift);
 	my @b = split_line(shift);
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html