Re: [PATCH] Documentation/diff-options: explain different diff algorithms

Jonathan Nieder <jrnieder@xxxxxxxxx> · Mon, 6 Aug 2018 16:18:43 -0700

Hi,

Stefan Beller wrote:
> --- a/Documentation/diff-options.txt
> +++ b/Documentation/diff-options.txt
> @@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience
>  diff" algorithm internally.
>  
>  --diff-algorithm={patience|minimal|histogram|myers}::
> -	Choose a diff algorithm. The variants are as follows:
> +	Choose a diff algorithm. See the discussion of DIFF ALGORITHMS
> +ifndef::git-diff[]
> +	in linkgit:git-diff[1]
> +endif::git-diff[]
> +	. The variants are as follows:

This means outside of git-diff(1), I'd see

	See the discussion of DIFF ALGORITHMS in git-diff(1) .

And in git-diff(1), I'd see

	See the discussion of DIFF ALGORITHMS .

Both don't seem quite right, since they have an extra space before the
period.  The git-diff(1) seems especially not quite right --- does it
intend to say something like "See the DIFF ALGORITHMS section for more
discussion"?

[...]
> --- a/Documentation/git-diff.txt
> +++ b/Documentation/git-diff.txt
> @@ -119,6 +119,40 @@ include::diff-options.txt[]
>  
>  include::diff-format.txt[]
>  
> +DIFF ALGORITHMS
> +---------------

Please add some introductory words about what the headings refer to.

> +`Myers`
> +
> +A diff as produced by the basic greedy algorithm described in
> +link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations].
> +with a run time of O(M + N + D^2). It employs a heuristic to allow for
> +a faster diff at the small cost of diff size.
> +The `minimal` algorithm has that heuristic turned off.
> +
> +`Patience`
> +
> +This algorithm by Bram Cohen matches the longest common subsequence
> +of unique lines on both sides, recursively. It obtained its name by
> +the way the longest subsequence is found, as that is a byproduct of
> +the patience sorting algorithm. If there are no unique lines left
> +it falls back to `myers`. Empirically this algorithm produces
> +a more readable output for code, but it does not garantuee

nit: s/garantuee/guarantee/

> +the shortest output.

Trivia: the `minimal` variant of Myers doesn't guarantee shortest
output, either: what it minimizes is the number of lines marked as
added or removed.  If you want to minimize context lines too, then
that would be a new variant. ;-)

[...]
> +`Histogram`
> +
> +This algorithm finds the longest common substring and recursively
> +diffs the content before and after the longest common substring.

optional: may be worth a short aside in the text about the distinction
between LCS and LCS. ;-)

It would be especially useful here, since the alphabet used in these
strings is *lines* instead of characters, so the first-time reader
could probably use some help in building their intuition.

> +If there are no common substrings left, fallback to `myers`.

nit: fallback is the noun, fall back is the verb.

> +This is often the fastest, but in corner cases (when there are
> +many common substrings of the same length) it produces bad

Can you clarify what "bad" means?  E.g. would "unexpected", or "poorly
aligned", match what you mean?

> +results as seen in:
> +
> +	seq 1 100 >one
> +	echo 99 > two
> +	seq 1 2 98 >>two
> +	git diff --no-index --histogram one two
> +
>  EXAMPLES
>  --------

Thanks,
Jonathan