On Mon, Feb 8, 2021 at 3:43 PM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Elijah Newren <newren@xxxxxxxxx> writes: > > > I'm sorry, but I'm not following you. As best I can tell, you seem to > > be suggesting that if we were to use a higher similarity bar for > > checking same-basename files, that such a difference would end up not > > accelerating the diffcore-rename algorithm at all? > > No. If we assume we use the minimum similarity threashold in the > new middle step that consider only the files that were moved across > directories without changing their names, and the last "full matrix" > step sees a src that did *not* pair with a dst of the same name in a > different directory surviving, we know that the pair would not be > similar enough (because we are using the same "minimum similarity" > in the middle step and the full matrix step) without comparing them > again. But if we used higher similarity in the middle step, the > fact that such a src/dst pair surviving the middle step without > producing a match only means that the pair was not similar enough > with a raised bar used in the middle, and the full-matrix step need > to consider the possibility that they may still be similar enough > when using "minimum similarity" used for all the other pairs. > > And because I was assuming that requiring higher similarity in the > middle step would be a prudent thing to do to avoid false matches > that discard better matches elsewhere, my conclusion was that it > would not be a useful optimization to do in the final full-matrix > step to see if a pair is something that was a candidate in the > middle step but did not match well enough (because the fact that the > pair did not compare well enough with higher bar does not mean it > would not compare well to pass the lower "minimum" bar). Ah, gotcha! Thanks for clarifying. Yes, yet another reason to not even try to avoid "redoing" the O(N) spanhash comparisons.