On Mon, Jul 23, 2018 at 9:41 PM Jonathan Nieder <jrnieder@xxxxxxxxx> wrote: > > Hi, > > Stefan Beller wrote: > > > As a user I wondered what the diff algorithms are about. Offer at least > > a basic explanation on the differences of the diff algorithms. > > Interesting. Let's see. > > [...] > > --- a/Documentation/diff-options.txt > > +++ b/Documentation/diff-options.txt > > @@ -94,16 +94,34 @@ diff" algorithm internally. > > Choose a diff algorithm. The variants are as follows: > > + > > -- > > -`default`, `myers`;; > > - The basic greedy diff algorithm. Currently, this is the default. > > `minimal`;; > > + A diff as produced by the basic greedy algorithm described in > > Why this reordering? because Myers (below) is constructed from minimal + heuristic. Before this patch we say Myers is myers and minimal "spends extra cycles", which is true if you read the code, as it just is for (...) { test_path if (need_min) continue; if (heuristic_good_enough()) break; } https://github.com/git/git/blob/master/xdiff/xdiffi.c#L152 So the "minimal" algorithm is the basic myers algorithm, and the "myers" algorithm is the extended version (extended with a heuristic to be fast in corner cases, still producing good enough diffs). > > > + link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations] > > This URL doesn't seem likely to stay stable. Are appearances > deceiving? (Maybe so, since that's the same host as libxdiff's > homepage <http://www.xmailserver.org/xdiff-lib.html>.) It might be > worth mentioning the author and date just in case. I though about it and did not mention it, as the name "An O(ND) Difference Algorithm and its Variations" is already enough to find that paper quickly. > > - Spend extra time to make sure the smallest possible diff is > > - produced. > > +`default`, `myers`;; > > + The same algorithm as `minimal`, extended with a heuristic to > > + reduce extensive searches. Currently, this is the default. > > Amusing --- this means that the Myers diff is spelled as "minimal" > instead of "myers". > > > `patience`;; > > - Use "patience diff" algorithm when generating patches. > > + Use "patience diff" algorithm when generating patches. This > > + matches the longest common subsequence of unique lines on > > + both sides, recursively. It obtained its name by the way the > > + longest subsequence is found, as that is a byproduct of the > > + patience sorting algorithm. If there are no unique lines left > > + it falls back to `myers`. Empirically this algorithm produces > > + a more readable output for code, but it does not garantuee > > + the shortest output. > > Probably also worth mentioning that this was invented by Bram Cohen > for the bazaar vcs. I tried to balance the part that reads like a history lesson and the immediately useful part (why is this better, when should I use it?) Mentioning bazaar probably makes sense. Not sure about the author, but I'll include him in a reroll of this patch. > > `histogram`;; > > - This algorithm extends the patience algorithm to "support > > - low-occurrence common elements". > > + This algorithm re-implements the `patience` algorithm with > > What does reimplements mean here? Do you mean that it is a variation > on patience? It is supposed to be a variation of patience, but as it comes with its own implementation which I would not fully trust (as it was ported from JGit, and reading the comments over there, a misunderstanding of LCS made it possible to have only one midpoint before recursing) So it is not just taking the patience algorithm and adding support for "low-occurrence common elements" (what does that even mean? patience already distinguishes uniques and non-uniques), but its own implementation, hence it is not bug-for-bug compatible. > > + "support of low-occurrence common elements" and only picks > > + one element of the LCS for the recursion. It is often the > > Does LCS mean longest common subsequence or longest common substring > here? Probably best to spell it out to avoid confusion. When writing the patch I meant to refer to longest common subsequence from above, but by picking just one element, this algorithm understands it as longest common string. > > > + fastest, but in cornercases (when there are many longest > > s/cornercases/corner cases/ > > > + common subsequences of the same length) it produces bad > > + results as seen in: > > + > > + seq 1 100 >one > > + echo 99 > two > > + seq 1 2 98 >>two > > + git diff --no-index --histogram one two > > I wonder if these details (about all the algorithms) should go in a > separate Diff Algorithms section and this section could focus on > what use cases warrant using each, referring to that section for > details. What do you think? Yeah I thought about putting together a Documentation/technical/diffs.txt that talk about all kinds of diffing (basic diff algorithms, options, how to diff trees), but considered rolling this as a band aid until that document fully materializes. Thanks, Stefan