Re: Possible bug: identical lines added/removed in git diff

Stefan Beller <sbeller@xxxxxxxxxx> · Sat, 1 Sep 2018 08:23:26 -0700

On Fri, Aug 31, 2018 at 2:39 PM Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
>
> Hi,
>
> On Thu, 30 Aug 2018, Stefan Beller wrote:
>
> > On Thu, Aug 30, 2018 at 12:20 PM Jeff King <peff@xxxxxxxx> wrote:
> > >
> > > [...] Myers does not promise to find the absolute minimal diff. [...]
> >
> > The `Myers` (our default) diff algorithm is really the Myers algorithm +
> > a heuristic that cuts off the long tail when it is very costly to compute
> > the minimal diff.
>
> IIRC Myers

(the original, which we spell `minimal`)

> promises minimal diffs only for -U0. As soon as you add
> context, Myers might in fact end up with a suboptimal diff, even without
> that heuristic.

ah, the last 4 words made it clear.

I have debated the cost function for diffs some time ago and
came to the conclusion that having one line added/removed costing
a flat price of 1 in the search of the 'lowest cost diff' is pretty good
as it does an ok job in the broad world, it is not 'overfitting' assumptions
on a problem.

For example in some patches I pay more attention to the lines
removed than to the lines added, or vice versa, so assigning
different costs to added/removed lines would make sense.
(It depend on the type of patch, but that cannot be deduced
at the low level of diff machinery)

Starting a new hunk (i.e. add cost to -U<n> for n >0) could be
costly. In fact we have an after-Myers optimization in the xdiff
code that checks if hunks can be coalesced together, but
if we could have that cost at diff computation time, this might
make for better diffs already.

Another example is number of changes between
added/removed/context parts. If I have to review only
one large added part and one large removed part, it may
be more understandable than having interleaved adds/removes.
(I think --patience goes in that direction, but does optimize for
a different metric)

Stefan