Re: [PATCH] Documentation/diff-options: explain different diff algorithms

Stefan Beller <sbeller@xxxxxxxxxx> · Tue, 24 Jul 2018 10:38:03 -0700

On Mon, Jul 23, 2018 at 9:41 PM Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
>
> Hi,
>
> Stefan Beller wrote:
>
> > As a user I wondered what the diff algorithms are about. Offer at least
> > a basic explanation on the differences of the diff algorithms.
>
> Interesting.  Let's see.
>
> [...]
> > --- a/Documentation/diff-options.txt
> > +++ b/Documentation/diff-options.txt
> > @@ -94,16 +94,34 @@ diff" algorithm internally.
> >       Choose a diff algorithm. The variants are as follows:
> >  +
> >  --
> > -`default`, `myers`;;
> > -     The basic greedy diff algorithm. Currently, this is the default.
> >  `minimal`;;
> > +     A diff as produced by the basic greedy algorithm described in
>
> Why this reordering?

because Myers (below) is constructed from minimal + heuristic.
Before this patch we say Myers is myers and minimal "spends
extra cycles", which is true if you read the code, as it just is

    for (...) {
        test_path
        if (need_min)
            continue;
        if (heuristic_good_enough())
            break;
    }

https://github.com/git/git/blob/master/xdiff/xdiffi.c#L152

So the "minimal" algorithm is the basic myers algorithm,
and the "myers" algorithm is the extended version (extended
with a heuristic to be fast in corner cases, still producing good
enough diffs).

>
> > +     link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]
>
> This URL doesn't seem likely to stay stable.  Are appearances
> deceiving?  (Maybe so, since that's the same host as libxdiff's
> homepage <http://www.xmailserver.org/xdiff-lib.html>.)  It might be
> worth mentioning the author and date just in case.

I though about it and did not mention it, as the name
"An O(ND) Difference Algorithm and its Variations" is already
enough to find that paper quickly.

> > -     Spend extra time to make sure the smallest possible diff is
> > -     produced.
> > +`default`, `myers`;;
> > +     The same algorithm as `minimal`, extended with a heuristic to
> > +     reduce extensive searches. Currently, this is the default.
>
> Amusing --- this means that the Myers diff is spelled as "minimal"
> instead of "myers".
>
> >  `patience`;;
> > -     Use "patience diff" algorithm when generating patches.
> > +     Use "patience diff" algorithm when generating patches. This
> > +     matches the longest common subsequence of unique lines on
> > +     both sides, recursively. It obtained its name by the way the
> > +     longest subsequence is found, as that is a byproduct of the
> > +     patience sorting algorithm. If there are no unique lines left
> > +     it falls back to `myers`. Empirically this algorithm produces
> > +     a more readable output for code, but it does not garantuee
> > +     the shortest output.
>
> Probably also worth mentioning that this was invented by Bram Cohen
> for the bazaar vcs.

I tried to balance the part that reads like a history lesson and the
immediately useful part (why is this better, when should I use it?)

Mentioning bazaar probably makes sense. Not sure about the author,
but I'll include him in a reroll of this patch.

> >  `histogram`;;
> > -     This algorithm extends the patience algorithm to "support
> > -     low-occurrence common elements".
> > +     This algorithm re-implements the `patience` algorithm with
>
> What does reimplements mean here?  Do you mean that it is a variation
> on patience?

It is supposed to be a variation of patience, but as it comes with its
own implementation which I would not fully trust (as it was ported
from JGit, and reading the comments over there, a misunderstanding
of LCS made it possible to have only one midpoint before recursing)

So it is not just taking the patience algorithm and adding support for
"low-occurrence common elements"  (what does that even mean?
patience already distinguishes uniques and non-uniques), but
its own implementation, hence it is not bug-for-bug compatible.

> > +     "support of low-occurrence common elements" and only picks
> > +     one element of the LCS for the recursion. It is often the
>
> Does LCS mean longest common subsequence or longest common substring
> here?  Probably best to spell it out to avoid confusion.

When writing the patch I meant to refer to longest common subsequence
from above, but by picking just one element, this algorithm understands it
as longest common string.

>
> > +     fastest, but in cornercases (when there are many longest
>
> s/cornercases/corner cases/
>
> > +     common subsequences of the same length) it produces bad
> > +     results as seen in:
> > +
> > +             seq 1 100 >one
> > +             echo 99 > two
> > +             seq 1 2 98 >>two
> > +             git diff --no-index --histogram one two
>
> I wonder if these details (about all the algorithms) should go in a
> separate Diff Algorithms section and this section could focus on
> what use cases warrant using each, referring to that section for
> details.  What do you think?

Yeah I thought about putting together a
Documentation/technical/diffs.txt that talk about all kinds of diffing
(basic diff algorithms, options, how to diff trees), but considered
rolling this as a band aid until that document fully materializes.

Thanks,
Stefan