Hi Peff
Thanks for the counter example
On 11/02/2023 01:59, Jeff King wrote:
Just a small counter-point, since I happened to be looking at myers vs
patience for something elsewhere in the thread, but:
git show 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f builtin/add.c
looks slightly better to me with myers, even though it is 2 lines
longer. The issue is that patience and histogram are very eager to use
blank lines as anchor points, so a diff like:
-some words
-
-and some more
+unrelated content
+
+but it happens to also be two paragraphs
in myers becomes:
-some words
+unrelated content
-and some more
+but it happens to also be two paragraphs
in patience (here I'm using single lines, but in practice these may be
paragraphs, or stanzas of code). I think that's also the _strength_ of
patience in many cases, but it really depends on the content.
Indeed. Ironically as there are no unique context lines in that example
the blank lines are being matched by patience implementation falling
back to the myers algorithm. Normally the myers implementation tries to
avoid matching common context lines between two blocks of changed lines
but I think because in this case it is only called on a small part of
the file the blank lines are not common enough to trigger that
heuristic. I've got a patch[1] that stops the patience implementation
falling back to the myers algorithm and just trims any leading and
trailing context. On the whole it I think it gives more readable diffs
but I've not got any systematic data to back that up. I also suspect
there are pathological cases such as each line in the file being
duplicated where the falling back to the myers algorithm gives a much
better result.
Replacing
a multi-stanza block with another one may be the best explanation for
what happened. Or the two stanzas may be independent, and showing the
change for each one may be better.
>
I'm not sure which one happens more often. And you'd probably want to
weight it by how good/bad the change is. In the example I showed I don't
find patience very much worse, since it's already a pretty ugly diff.
But in cases where patience shines, it may be making things
significantly more readable.
I agree that having some data would be useful if we're going to change
the default but collecting it would entail quite a bit of work and as
the scoring is subjective we'd want a few people doing it. It's great
that someone has done that for the histogram algorithm in the paper
Elijah cited.
I don't have a super strong opinion, but I just wanted to chime in that
it is not clear to me that patience/histogram is always a win over myers
(yes, I know your examples were comparing patience vs histogram, but the
larger thread is discussing the other).
Agreed, there are definitely cases where myers gives more readable
diffs, I think if we're going to change the default the question we need
to answer is which algorithm gives the best result most of the time.
Best Wishes
Phillip
[1] https://github.com/phillipwood/git/commits/pure-patience-diff