Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm

Elijah Newren <newren@xxxxxxxxx> · Tue, 14 Feb 2023 18:35:00 -0800

On Fri, Feb 10, 2023 at 5:59 PM Jeff King <peff@xxxxxxxx> wrote:
>
> On Thu, Feb 09, 2023 at 02:44:15PM +0000, Phillip Wood wrote:
>
> > To see the differences between the output of patience and histogram
> > algorithms I diffed the output of "git log -p --no-merges
> > --diff-algorithm=patience" and "git log -p --no-merges
> > --diff-algorithm=histogram". The first three differences are
> >
> > - 6c065f72b8 (http: support CURLOPT_PROTOCOLS_STR, 2023-01-16)
> >   In get_curl_allowed_protocols() the patience algorithm shows the
> >   change in the return statement more clearly
> >
> > - 47cfc9bd7d (attr: add flag `--source` to work with tree-ish, 2023-01-14)
> >    The histogram algorithm shows read_attr_from_index() being moved
> >    whereas the patience algorithm does not making the diff easier to
> >    follow.
> >
> > - b0226007f0 (fsmonitor: eliminate call to deprecated FSEventStream
> > function, 2022-12-14)
> >   In fsm_listen__stop_async() the histogram algorithm shows
> >   data->shutdown_style = SHUTDOWN_EVENT;
> >   being moved, which is not as clear as the patience output which
> >   shows it as a context line.
>
> Just a small counter-point, since I happened to be looking at myers vs
> patience for something elsewhere in the thread, but:
>
>   git show 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f builtin/add.c

"fatal: bad object 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f"

Is that a local commit of yours?

> looks slightly better to me with myers, even though it is 2 lines
> longer. The issue is that patience and histogram are very eager to use
> blank lines as anchor points, so a diff like:
>
>   -some words
>   -
>   -and some more
>   +unrelated content
>   +
>   +but it happens to also be two paragraphs
>
> in myers becomes:
>
>   -some words
>   +unrelated content
>
>   -and some more
>   +but it happens to also be two paragraphs
>
> in patience (here I'm using single lines, but in practice these may be
> paragraphs, or stanzas of code). I think that's also the _strength_ of
> patience in many cases, but it really depends on the content. Replacing
> a multi-stanza block with another one may be the best explanation for
> what happened. Or the two stanzas may be independent, and showing the
> change for each one may be better.
>
> I'm not sure which one happens more often. And you'd probably want to
> weight it by how good/bad the change is. In the example I showed I don't
> find patience very much worse, since it's already a pretty ugly diff.
> But in cases where patience shines, it may be making things
> significantly more readable.
>
> I don't have a super strong opinion, but I just wanted to chime in that
> it is not clear to me that patience/histogram is always a win over myers
> (yes, I know your examples were comparing patience vs histogram, but the
> larger thread is discussing the other).

Oh, I agree histogram is not always a win over myers.  I just feel it
is the majority of the time.  But if you want more than "feels",
here's some solid data to back that up...

I found a study on the subject over at
https://link.springer.com/article/10.1007/s10664-019-09772-z.  They
were particularly interested in whether other academic studies could
have been affected by git's different diff algorithms, and came away
with the answer that it did.  They looked at a few hundred thousand
commits across two dozen different repositories and found (note that
they only looked at myers and histogram, ignoring patience and
minimal):

   * 92.4% - 98.6% of the diffs (depending on repo) are identical
whether you use myers or histogram
   * 93.8% - 99.2% of the diffs (depending on repo) have the same
number of added/deleted lines with myers and histogram
   * Of the >20k diffs that were not the identical, they selected a
random sample of 377 diffs (taking care to make sure they were
statistically representative)
   * They divided the 377 diffs into "code" and "non-code" diffs, i.e.
those modifying source code and those modifying other textual files
   * They had two people annotating the diffs and independently
scoring them, and then checked for agreement between their answers
afterwards.  (No, they didn't always agree, but they did have
substantial agreement.)

For the (again, non-identical) diffs modifying non-code, they found
(see table 11) that:
   * 14.9% of the myers diffs are better
   * 13.4% of the histogram diffs are better
   * 71.6% of the diffs have equal quality

For the (non-identical) diffs modifying code, they found (again, see
table 11) that:
   * 16.9% of the myers diffs are better
   * 62.6% of the histogram diffs are better
   * 20.6% of the diffs have equal quality

A ratio of 4 to 1 for histogram being better on code diffs is pretty
weighty to me.

It's possible these results would have been even better were it not
for a couple of bugs in the histogram code (ported from the original
in jgit).  Phillip pointed me to a problematic testcase that Stefan
Beller found, and in attempting to fix it (I'm on fix #4 or so), I
believe I found another issue.  However, I don't want to go into too
much detail yet, as I found problems with some of my previous fixes
and already invalidated things I told Phillip just last week.