Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 16 Jun 2016 23:10:57 +0200

On 06/16/2016 07:46 PM, Stefan Beller wrote:
> The compaction heuristic for diffs was deemed quite good, but in 5580b27
> we have an example demonstrates a common case, that fails with the existing
> heuristic. That is why we disabled the heuristic in the v2.9.0 release.
> 
> With this patch a diff looks like:
> 
>          def bar
>                  do_bar_stuff()
> 
>                  common_ending()
>          end
> 
> +        def bal
> +                do_bal_stuff()
> +
> +                common_ending()
> +        end
> +
>          def baz
>                  do_baz_stuff()
> 
>                  common_ending()
>          end
> 
> whereas before we had:
> 
>   WIP (TODO: ask peff to provide an example that actually triggers, I seem to be
>        unable to write one, though I thought the above was one)
> 
> 
> The way we do it, is by inspecting the neighboring lines and see how
> much indent there is and we choose the line that has the shortest amount
> of blanks in the neighboring lines.
> 
> (TODO: update comments in the file)
> 
> Signed-off-by: Stefan Beller <sbeller@xxxxxxxxxx>
> ---
> 
> Hi Jeff, hi Michael,
> 
> thanks for pointing out the flaw in this ruby code base before the 2.9 release.
> I think this patch would fix your finding, though I cannot reproduce it.
> Can you provide an example repo/patch that looks ugly on current origin/master,
> so I can be sure this fixes the issue?
> 
> This can also be a start of a discussion if this is a too short-sighted enhancement
> of the heuristic. Essentially we'd want to prefer 
> 
> +        end
> +
> +        def bal
> 
> over:
> 
> +                do_bal_stuff()
> +
> +                common_ending()
> 
> because there is less space between line start and {end, def bal}
> than for {do_bal_stuff, common_ending}.

It's really cool that you are trying to implement the indentation
heuristic. I'm curious how it works in practice (something we can
probably only determine by applying it to a corpus of diffs to see how
often it outperforms/underperforms the other possible approaches).

The heuristic I proposed was

* Prefer to start and end hunks *following* lines with the least
  indentation.

* Define the "indentation" of a blank line to be the indentation of
  the previous non-blank line minus epsilon.

* In the case of a tie, prefer to slide the hunk down as far as
  possible.

If that's what you are trying to implement, then notice that the first
rule says that the hunk should start *following* the line with the least
indentation. So the "score" for

> +        end
> +
> +        def bal

would be the indentation of the line preceding "end", which is larger
than the indentation of "end". And the score for

> +                do_bal_stuff()
> +
> +                common_ending()

would come from the line preceding "do_bal_stuff()", namely "def bal()",
which is indented the same as "end". So the latter would be preferred.

But actually, I don't understand how your example is meaningful. I think
this heuristic is only used to slide hunks around; i.e., when the line
following the hunk is the same as the first lines of the hunk, or when
the line preceding the hunk is the same as the last line of the hunk.
Right? So your snippets would never compete against each other. Let's
take the full example. The five possible placements for the `+`
characters are marked with the digits 1 through 5 here:

>              def bar
>                      do_bar_stuff()
> 1
> 12                   common_ending()
> 123          end
> 1234
> 12345        def bal
> 12345                do_bal_stuff()
>  2345
>   345                common_ending()
>    45        end
>     5
>              def baz
>                      do_baz_stuff()
>
>                      common_ending()
>              end

Of the lines preceding the candidate hunks, the blank line preceding the
hunk marked "5" has the smallest indent, namely "epsilon" less than the
indentation of the "end" line preceding it. So placement "5" would be
chosen.

I don't know enough about this area of the code to review your patch in
detail, but I did note below a typo that jumped out at me.

Michael

>  xdiff/xdiffi.c | 41 +++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index b3c6848..58adc74 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> [...]
> @@ -485,7 +515,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  			 * the group.
>  			 */
>  			while (ix < nrec && recs_match(recs, ixs, ix, flags)) {
> -				blank_lines += is_blank_line(recs, ix, flags);
> +				if (is_blank_line(recs, ix, flags)) {
> +					unsigned int bl_neigh_indent =
> +						surrounding_leading_blank(recs, ix, flags, nrec);
> +					if (min_bl_neigh_indent > bl_neigh_indent)
> +						min_bl_neigh_indent = min_bl_neigh_indent;

The line above has no effect (same variable on both sides of the =).

> +					blank_lines++;
> +				}
>  
>  				rchg[ixs++] = 0;
>  				rchg[ix++] = 1;
> @@ -525,6 +561,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>  			while (ixs > 0 &&
>  			       !is_blank_line(recs, ix - 1, flags) &&
> +			       surrounding_leading_blank(recs, ix - 1, flags, nrec) > min_bl_neigh_indent &&
>  			       recs_match(recs, ixs - 1, ix - 1, flags)) {
>  				rchg[--ixs] = 1;
>  				rchg[--ix] = 0;
> 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html