Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Peff

Thanks for the counter example

On 11/02/2023 01:59, Jeff King wrote:
Just a small counter-point, since I happened to be looking at myers vs
patience for something elsewhere in the thread, but:

   git show 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f builtin/add.c

looks slightly better to me with myers, even though it is 2 lines
longer. The issue is that patience and histogram are very eager to use
blank lines as anchor points, so a diff like:

   -some words
   -
   -and some more
   +unrelated content
   +
   +but it happens to also be two paragraphs

in myers becomes:

   -some words
   +unrelated content
-and some more
   +but it happens to also be two paragraphs

in patience (here I'm using single lines, but in practice these may be
paragraphs, or stanzas of code). I think that's also the _strength_ of
patience in many cases, but it really depends on the content.

Indeed. Ironically as there are no unique context lines in that example the blank lines are being matched by patience implementation falling back to the myers algorithm. Normally the myers implementation tries to avoid matching common context lines between two blocks of changed lines but I think because in this case it is only called on a small part of the file the blank lines are not common enough to trigger that heuristic. I've got a patch[1] that stops the patience implementation falling back to the myers algorithm and just trims any leading and trailing context. On the whole it I think it gives more readable diffs but I've not got any systematic data to back that up. I also suspect there are pathological cases such as each line in the file being duplicated where the falling back to the myers algorithm gives a much better result.

Replacing
a multi-stanza block with another one may be the best explanation for
what happened. Or the two stanzas may be independent, and showing the
change for each one may be better.
>
I'm not sure which one happens more often. And you'd probably want to
weight it by how good/bad the change is. In the example I showed I don't
find patience very much worse, since it's already a pretty ugly diff.
But in cases where patience shines, it may be making things
significantly more readable.

I agree that having some data would be useful if we're going to change the default but collecting it would entail quite a bit of work and as the scoring is subjective we'd want a few people doing it. It's great that someone has done that for the histogram algorithm in the paper Elijah cited.

I don't have a super strong opinion, but I just wanted to chime in that
it is not clear to me that patience/histogram is always a win over myers
(yes, I know your examples were comparing patience vs histogram, but the
larger thread is discussing the other).

Agreed, there are definitely cases where myers gives more readable diffs, I think if we're going to change the default the question we need to answer is which algorithm gives the best result most of the time.

Best Wishes

Phillip

[1] https://github.com/phillipwood/git/commits/pure-patience-diff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux