Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm

Elijah Newren <newren@xxxxxxxxx> · Thu, 9 Feb 2023 00:26:44 -0800

Hi John,

On Mon, Feb 6, 2023 at 12:02 PM John Cai <johncai86@xxxxxxxxx> wrote:
>
> Hi Phillip,
>
> On 6 Feb 2023, at 11:27, Phillip Wood wrote:
>
> > Hi John
> >
> > On 05/02/2023 03:46, John Cai via GitGitGadget wrote:
> >> From: John Cai <johncai86@xxxxxxxxx>
> >>
> >> It can be useful to specify diff algorithms per file type. For example,
> >> one may want to use the minimal diff algorithm for .json files, another
> >> for .c files, etc.
> >
> > Have you got any examples of why this is useful? I find myself occasionally changing the algorithm when the default gives a sub-optimal diff but I've not really noticed any pattern with respect to file types.
>
> At $DAYJOB, there has been a discussion and request for a feature like this [1].
> One use case that came up was to be able to set a different diff algorithm for
> .json files.
>
> 1. https://gitlab.com/gitlab-org/gitaly/-/issues/2591

A couple points:

First, there seems to be a misunderstanding in that issue.  In
particular, the merge algorithm does call into the xdiff library to do
the three-way content merge of individual files, and when it does so,
it has to specify the diff algorithm (or take the default, currently
myers).  merge-recursive allows the diff algorithm to be specified by
the user (there are
-Xdiff-algorithm={histogram,minimal,patience,myers} flags to
merge/rebase for it), while merge-ort uses histogram (though it uses
the same parser as merge-recursive and thus gets the variables set
from the -Xdiff-algorithm flag, it just ignores those values and
hardcodes histogram).

Second, I also think the user request got converted to a particular
solution without looking at the wider problem space:  The idea seemed
to assume "myers" is default for a good reason, and thus asked for an
option to use something else.  I'm not sure the assumption is valid; I
think "myers" is default for historical reasons and histogram is
better not just for special Salesforce xml files, but code files too.
The output makes more sense to users.  So much so that even though my
simple testing suggested it had a 2% performance penalty compared to
myers, I forced ort to use it[1] even though I designed  everything
else in that algorithm around eking out maximum performance.  Others
who have tested the diff algorithms have also found histogram has very
similar performance to myers, and oftentimes even beats it[2][3].
Also, worries about invalidating rerere caches[4] was real, but we
already paid that price when we switched to ort.  And if performance
is still a worry, [3] gives me reason to believe we can make our
histogram implementation faster.  Finally, for the period of time when
Palantir was letting me make an internal git distribution (mostly for
testing ort), I also carried a patch that changed the default diff
algorithm to histogram (not just for ort, but for diff/log/etc. as
well).  Never had any complaints from the users from it.  Perhaps you
could do the same in your local version of git used by gitaly?

[1] See c8017176ac ("merge-ort: use histogram diff", 2020-12-13)
[2] From 85551232b5 ("perf: compare diff algorithms", 2012-03-06):
"This does indeed show that histogram diff slightly beats Myers, while
patience is much slower than the others."
[3] https://github.com/pascalkuthe/imara-diff
[4] https://lore.kernel.org/git/20120307114714.GA14990@xxxxxxxxxxxxxxxxxxxxx/