Re: [PATCH v2 7/8] gitweb: Highlight interesting parts of diff

Jakub Narebski <jnareb@xxxxxxxxx> · Thu, 29 Mar 2012 20:42:20 +0100

On Fri, 23 Mar 2012, Michał Kiedrowicz wrote:

> Reading diff output is sometimes very hard, even if it's colored,
> especially if lines differ only in few characters.  This is often true
> when a commit fixes a typo or renames some variables or functions.
> 
> This commit teaches gitweb to highlight characters that are different
> between old and new line with a light green/red background.  This should
> work in the similar manner as in Trac or GitHub.
> 
> The algorithm that compares lines is based on contrib/diff-highlight.
> Basically, it works by determining common prefix/suffix of corresponding
> lines and highlightning only the middle part of lines.  For more
> information, see contrib/diff-highlight/README.
>
Nice description.

> Combined diffs are not supported but a following commit will change it.
> 
O.K.

> Two changes in format_diff_line() were needed to allow diff refinement
> highlightning to work.  Firstly, format_diff_line() was taught to not
> esc_html() line that was passed as a reference.  This was needed because
> refining the highlight must be performed on unescaped lines and it uses
> a <span> element to mark interesting parts of the line.

Well, actually we just need to make format_diff_line to not esc_html
passed line which is already esc_html'ed or esc_html_hl_regions'ed,
i.e. to avoid double escaping.  Passing line as reference if it is
to be treated as HTML is one possibility, and I think it is a good
convention to start to use.

>                                                         Secondly, the 
> lines are untabified before comparing because calling untabify()
> after inserting <span>'s could improperly convert tabs to spaces.

Well, it is actually because untabify() must work on original text to
work correctly, i.e. to convert tabs to spaces according to tab stops.
It must precede esc_html'ing, and even more esc_html_hl_regions'ing.

> 
> Signed-off-by: Michał Kiedrowicz <michal.kiedrowicz@xxxxxxxxx>

Anyway this is a good change, much cleaner than previous version

  Acked-by: Jakub Narębski <jnareb@xxxxxxxxx>

> ---
>  gitweb/gitweb.perl       |   84 ++++++++++++++++++++++++++++++++++++++++++----
>  gitweb/static/gitweb.css |    8 ++++
>  2 files changed, 85 insertions(+), 7 deletions(-)
> 
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index db32588..872ba12 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -2426,14 +2426,14 @@ sub format_cc_diff_chunk_header {
>  }
>  
>  # process patch (diff) line (not to be used for diff headers),
> -# returning HTML-formatted (but not wrapped) line
> +# returning HTML-formatted (but not wrapped) line.
> +# If the line is already esc_html()'ed, pass it as a reference.

Errr... I think the explanation here must be in reverse:

  +# If the line is passed as a reference, it is treated as HTML,
  +# and not esc_html()'ed.

>  sub format_diff_line {
>  	my ($line, $diff_class, $from, $to) = @_;
>  
> -	chomp $line;
> -	$line = untabify($line);
> -
> -	if ($from && $to && $line =~ m/^\@{2} /) {
> +	if (ref($line)) {
> +		$line = $$line;
> +	} elsif ($from && $to && $line =~ m/^\@{2} /) {
>  		$line = format_unidiff_chunk_header($line, $from, $to);
>  	} elsif ($from && $to && $line =~ m/^\@{3}/) {
>  		$line = format_cc_diff_chunk_header($line, $from, $to);
> @@ -5054,10 +5054,80 @@ sub print_inline_diff_lines {
>  	print foreach (@$add);
>  }
>  
> +# Format removed and added line, mark changed part and HTML-format them.
> +# Impementation is based on contrib/diff-highlight
> +sub format_rem_add_line {

Wouldn't a better name be format_rem_add_pair() or format_rem_add_lines(),
or format_rem_add_lines_pair()?

> +	my ($rem, $add) = @_;
> +	my @rem = split(//, $rem);
> +	my @add = split(//, $add);
> +	my ($esc_rem, $esc_add);
> +	# Ignore +/- character, thus $prefix_len is set to 1.
> +	my ($prefix_len, $suffix_len) = (1, 0);

I wonder if it wouldn't be slightly cleaner to count $prefix_len from
+/- rather than start of line.  It means

  +	# Prefix does not count initial +/- character.
  +	my ($prefix_len, $suffix_len) = (0, 0);

Perhaps even remove initial +/- from @add and @rem.

  +	shift @rem;
  +	shift @add;

Ehh... now I see that starting $prefix_len with 1 might be not so bad
idea after all.

> +	my ($prefix_is_space, $suffix_is_space) = (1, 1);

It is not entirely true: $prefix_is_space is really $prefix_is_space_or_empty.
It might be a better idea to use

  +	my ($prefix_has_nonspace, $suffix_has_nonspace);

using the fact that undefined value is false.

> +
> +	while ($prefix_len < @rem && $prefix_len < @add) {
> +		last if ($rem[$prefix_len] ne $add[$prefix_len]);
> +
> +		$prefix_is_space = 0 if ($rem[$prefix_len] !~ /\s/);

  +		$prefix_has_nonspace = 1 if ($rem[$prefix_len] =~ /\s/);

> +		$prefix_len++;
> +	}
> +
> +	my $shorter = (@rem < @add) ? @rem : @add;
> +	while ($prefix_len + $suffix_len < $shorter) {
> +		last if ($rem[-1 - $suffix_len] ne $add[-1 - $suffix_len]);
> +
> +		$suffix_is_space = 0 if ($rem[-1 - $suffix_len] !~ /\s/);

  +		$suffix_has_nonspace = 1 if ($rem[-1 - $suffix_len] =~ /\s/);

> +		$suffix_len++;
> +	}
> +
> +	# Mark lines that are different from each other, but have some common
> +	# part that isn't whitespace.  If lines are completely different, don't
> +	# mark them because that would make output unreadable, especially if
> +	# diff consists of multiple lines.
> +	if (($prefix_len == 1 && $suffix_len == 0) ||
> +	    ($prefix_is_space && $suffix_is_space)) {

Actually ($prefix_is_space && $suffix_is_space) is enough, but cryptic.
With $prefix_is_space (== $prefix_is_space_or_empty) -> $prefix_has_nonspace
it would be

  +	if (not ($prefix_has_nonspace || $suffix_has_nonspace)) {

Anyway, it is not as if it is not clear enough.

> +		$esc_rem = esc_html($rem, -nbsp=>1);
> +		$esc_add = esc_html($add, -nbsp=>1);
> +	} else {
> +		$esc_rem = esc_html_hl_regions($rem, 'marked', [$prefix_len, @rem - $suffix_len], -nbsp=>1);
> +		$esc_add = esc_html_hl_regions($add, 'marked', [$prefix_len, @add - $suffix_len], -nbsp=>1);

  +		$esc_rem = esc_html_hl_regions($rem, 'marked', [$prefix_len+1, @rem+1 - $suffix_len], -nbsp=>1);
  +		$esc_add = esc_html_hl_regions($add, 'marked', [$prefix_len+1, @add+1 - $suffix_len], -nbsp=>1);

So probably not worth it.

> +	}
> +
> +	return format_diff_line(\$esc_rem, 'rem'),
> +	        format_diff_line(\$esc_add, 'add');

  +	return format_diff_line(\$esc_rem, 'rem'),
  +	       format_diff_line(\$esc_add, 'add');

> +}
> +
> +# HTML-format diff context, removed and added lines.
> +sub format_ctx_rem_add_lines {
> +	my ($ctx, $rem, $add, $is_combined) = @_;
> +	my (@new_ctx, @new_rem, @new_add);
> +
> +	# Highlight if every removed line has a corresponding added line.
> +	# Combined diffs are not supported ATM.

ATM == at this moment?  Please say so.

> +	if (!$is_combined && @$add > 0 && @$add == @$rem) {
> +		for (my $i = 0; $i < @$add; $i++) {
> +			my ($line_rem, $line_add) = format_rem_add_line(
> +			        $rem->[$i], $add->[$i]);

  +			my ($line_rem, $line_add) =
  +			        format_rem_add_line($rem->[$i], $add->[$i]);

> +			push @new_rem, $line_rem;
> +			push @new_add, $line_add;
> +		}
> +	} else {
> +		@new_rem = map { format_diff_line($_, 'rem') } @$rem;
> +		@new_add = map { format_diff_line($_, 'add') } @$add;
> +	}
> +
> +	@new_ctx = map { format_diff_line($_, 'ctx') } @$ctx;
> +
> +	return (\@new_ctx, \@new_rem, \@new_add);
> +}

Nice.

> +
>  # Print context lines and then rem/add lines.
>  sub print_diff_lines {
>  	my ($ctx, $rem, $add, $diff_style, $is_combined) = @_;
>  
> +	($ctx, $rem, $add) = format_ctx_rem_add_lines($ctx, $rem, $add,
> +		$is_combined);
> +

  +	($ctx, $rem, $add) = format_ctx_rem_add_lines($ctx, $rem, $add,	$is_combined);
  +

Unless the line is too long.

>  	if ($diff_style eq 'sidebyside' && !$is_combined) {
>  		print_sidebyside_diff_lines($ctx, $rem, $add);
>  	} else {
> @@ -5089,11 +5159,11 @@ sub print_diff_chunk {
>  	foreach my $line_info (@chunk) {
>  		my ($class, $line) = @$line_info;
>  
> -		$line = format_diff_line($line, $class, $from, $to);
> +		$line = untabify($line);

O.K. you move untabify() out of format_diff_line(), and must now
ensure that caller uses it before calling format_diff_line() or 
format_ctx_rem_add_lines() (which uses format_diff_line() itself).

I wonder if leaving untabify() (and chomp) in format_diff_line(),
but not running it if passed reference, and running untabify() in
format_ctx_rem_add_lines() wouldn't be a better, less fragile code.

>  
>  		# print chunk headers
>  		if ($class && $class eq 'chunk_header') {
> -			print $line;
> +			print format_diff_line($line, $class, $from, $to);

O.K., only 'chunk_header' are not formatted.

>  			next;
>  		}

[I have to take a look how final version of print_diff_lines() looks
 like in this commit; the patch alone is not enough]

Right, so we format just before printing, and print_* do formatting
itself before printing.  Nice and clean.

>  
> diff --git a/gitweb/static/gitweb.css b/gitweb/static/gitweb.css
> index c530355..3c4a3c9 100644
> --- a/gitweb/static/gitweb.css
> +++ b/gitweb/static/gitweb.css
> @@ -438,6 +438,10 @@ div.diff.add {
>  	color: #008800;
>  }
>  
> +div.diff.add span.marked {
> +	background-color: #77ff77;
> +}
> +
>  div.diff.from_file a.path,
>  div.diff.from_file {
>  	color: #aa0000;
> @@ -447,6 +451,10 @@ div.diff.rem {
>  	color: #cc0000;
>  }
>  
> +div.diff.rem span.marked {
> +	background-color: #ff7777;
> +}
> +
>  div.diff.chunk_header a,
>  div.diff.chunk_header {
>  	color: #990099;
> -- 

Nice styling.

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html