Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm

Phillip Wood <phillip.wood123@xxxxxxxxx> · Wed, 15 Feb 2023 14:44:59 +0000

Hi Peff

Thanks for the counter example

On 11/02/2023 01:59, Jeff King wrote:
Just a small counter-point, since I happened to be looking at myers vs
patience for something elsewhere in the thread, but:

   git show 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f builtin/add.c

looks slightly better to me with myers, even though it is 2 lines
longer. The issue is that patience and histogram are very eager to use
blank lines as anchor points, so a diff like:

   -some words
   -
   -and some more
   +unrelated content
   +
   +but it happens to also be two paragraphs

in myers becomes:

   -some words
   +unrelated content

   -and some more
   +but it happens to also be two paragraphs

in patience (here I'm using single lines, but in practice these may be
paragraphs, or stanzas of code). I think that's also the _strength_ of
patience in many cases, but it really depends on the content.

Indeed. Ironically as there are no unique context lines in that example 
the blank lines are being matched by patience implementation falling 
back to the myers algorithm. Normally the myers implementation tries to 
avoid matching common context lines between two blocks of changed lines 
but I think because in this case it is only called on a small part of 
the file the blank lines are not common enough to trigger that 
heuristic. I've got a patch[1] that stops the patience implementation 
falling back to the myers algorithm and just trims any leading and 
trailing context. On the whole it I think it gives more readable diffs 
but I've not got any systematic data to back that up. I also suspect 
there are pathological cases such as each line in the file being 
duplicated where the falling back to the myers algorithm gives a much 
better result.

Replacing
a multi-stanza block with another one may be the best explanation for
what happened. Or the two stanzas may be independent, and showing the
change for each one may be better.
>
I'm not sure which one happens more often. And you'd probably want to
weight it by how good/bad the change is. In the example I showed I don't
find patience very much worse, since it's already a pretty ugly diff.
But in cases where patience shines, it may be making things
significantly more readable.

I agree that having some data would be useful if we're going to change 
the default but collecting it would entail quite a bit of work and as 
the scoring is subjective we'd want a few people doing it. It's great 
that someone has done that for the histogram algorithm in the paper 
Elijah cited.

I don't have a super strong opinion, but I just wanted to chime in that
it is not clear to me that patience/histogram is always a win over myers
(yes, I know your examples were comparing patience vs histogram, but the
larger thread is discussing the other).

Agreed, there are definitely cases where myers gives more readable 
diffs, I think if we're going to change the default the question we need 
to answer is which algorithm gives the best result most of the time.

Best Wishes

Phillip

[1] https://github.com/phillipwood/git/commits/pure-patience-diff