Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames

Elijah Newren <newren@xxxxxxxxx> · Mon, 8 Feb 2021 15:52:45 -0800

On Mon, Feb 8, 2021 at 3:43 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Elijah Newren <newren@xxxxxxxxx> writes:
>
> > I'm sorry, but I'm not following you.  As best I can tell, you seem to
> > be suggesting that if we were to use a higher similarity bar for
> > checking same-basename files, that such a difference would end up not
> > accelerating the diffcore-rename algorithm at all?
>
> No.  If we assume we use the minimum similarity threashold in the
> new middle step that consider only the files that were moved across
> directories without changing their names, and the last "full matrix"
> step sees a src that did *not* pair with a dst of the same name in a
> different directory surviving, we know that the pair would not be
> similar enough (because we are using the same "minimum similarity"
> in the middle step and the full matrix step) without comparing them
> again.  But if we used higher similarity in the middle step, the
> fact that such a src/dst pair surviving the middle step without
> producing a match only means that the pair was not similar enough
> with a raised bar used in the middle, and the full-matrix step need
> to consider the possibility that they may still be similar enough
> when using "minimum similarity" used for all the other pairs.
>
> And because I was assuming that requiring higher similarity in the
> middle step would be a prudent thing to do to avoid false matches
> that discard better matches elsewhere, my conclusion was that it
> would not be a useful optimization to do in the final full-matrix
> step to see if a pair is something that was a candidate in the
> middle step but did not match well enough (because the fact that the
> pair did not compare well enough with higher bar does not mean it
> would not compare well to pass the lower "minimum" bar).

Ah, gotcha!  Thanks for clarifying.  Yes, yet another reason to not
even try to avoid "redoing" the O(N) spanhash comparisons.