Re: [PATCH] Documentation/diff-options: explain different diff algorithms

Jonathan Nieder <jrnieder@xxxxxxxxx> · Mon, 23 Jul 2018 21:40:52 -0700

Hi,

Stefan Beller wrote:

> As a user I wondered what the diff algorithms are about. Offer at least
> a basic explanation on the differences of the diff algorithms.

Interesting.  Let's see.

[...]
> --- a/Documentation/diff-options.txt
> +++ b/Documentation/diff-options.txt
> @@ -94,16 +94,34 @@ diff" algorithm internally.
>  	Choose a diff algorithm. The variants are as follows:
>  +
>  --
> -`default`, `myers`;;
> -	The basic greedy diff algorithm. Currently, this is the default.
>  `minimal`;;
> +	A diff as produced by the basic greedy algorithm described in

Why this reordering?

> +	link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]

This URL doesn't seem likely to stay stable.  Are appearances
deceiving?  (Maybe so, since that's the same host as libxdiff's
homepage <http://www.xmailserver.org/xdiff-lib.html>.)  It might be
worth mentioning the author and date just in case.

> -	Spend extra time to make sure the smallest possible diff is
> -	produced.
> +`default`, `myers`;;
> +	The same algorithm as `minimal`, extended with a heuristic to
> +	reduce extensive searches. Currently, this is the default.

Amusing --- this means that the Myers diff is spelled as "minimal"
instead of "myers".

>  `patience`;;
> -	Use "patience diff" algorithm when generating patches.
> +	Use "patience diff" algorithm when generating patches. This
> +	matches the longest common subsequence of unique lines on
> +	both sides, recursively. It obtained its name by the way the
> +	longest subsequence is found, as that is a byproduct of the
> +	patience sorting algorithm. If there are no unique lines left
> +	it falls back to `myers`. Empirically this algorithm produces
> +	a more readable output for code, but it does not garantuee
> +	the shortest output.

Probably also worth mentioning that this was invented by Bram Cohen
for the bazaar vcs.

>  `histogram`;;
> -	This algorithm extends the patience algorithm to "support
> -	low-occurrence common elements".
> +	This algorithm re-implements the `patience` algorithm with

What does reimplements mean here?  Do you mean that it is a variation
on patience?

> +	"support of low-occurrence common elements" and only picks
> +	one element of the LCS for the recursion. It is often the

Does LCS mean longest common subsequence or longest common substring
here?  Probably best to spell it out to avoid confusion.

> +	fastest, but in cornercases (when there are many longest

s/cornercases/corner cases/

> +	common subsequences of the same length) it produces bad
> +	results as seen in:
> +
> +		seq 1 100 >one
> +		echo 99 > two
> +		seq 1 2 98 >>two
> +		git diff --no-index --histogram one two

I wonder if these details (about all the algorithms) should go in a
separate Diff Algorithms section and this section could focus on
what use cases warrant using each, referring to that section for
details.  What do you think?

Thanks and hope that helps,
Jonathan