Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity

Elijah Newren <newren@xxxxxxxxx> · Sat, 13 Feb 2021 17:24:50 -0800

On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Elijah Newren <newren@xxxxxxxxx> writes:
>
> > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > 0% similar to the src/main.c in the new commit, we match the old
> > src/main.c to the new src/main.c despite being far more similar
> > src/foo.c.  Unless break detection is turned on, we do not allow
> > content similarity to trump (full) filename equality.
>
> Absolutely.  And we are talking about a new optimization that kicks
> in only when there is no break or no copy detection going on, no?

Yes, precisely, we are only considering cases without break
detection...and thus we are considering cases where for the last 15
years or more, sufficiently large filename similarity (an exact
fullname match) trumps any level of content similarity.  I think it is
useful to note that while my optimization is adding more
considerations that can overrule maximal content similarity, it is not
the first such code choice to do that.

But let me back up a bit...

When I submitted the series, you and Stolee went into a long
discussion about an optimization that I didn't submit, one that feels
looser on "matching" than anything I submitted, and which I think
might counter-intuitively reduce performance rather than aid it.  (The
performance side only comes into view in combination with later
series, but it was why I harped so much since then on only comparing
against at most one other file in the steps before full inexact rename
detection.)  I was quite surprised by the diversion, but it made it
clear to me that my descriptions and commit messages were far too
vague and could be read to imply a completely different algorithm than
I intended.  So, I tried to be far more careful in subsequent
iterations by adding wider context and contrasts.

Further, after I wrote various things to try to clarify the
misunderstandings, I noticed that Stolee picked out one thing and
stated that "This idea of optimizing first for 100% filename
similarity is a good perspective on Git's rename detection algorithm."
(see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@xxxxxxxxx/)
 So, that particular point seemed to help him understand more, and
thus might be useful extra context for others reading along now or in
the future.

Given all the above, I was trying to address earlier misunderstandings
and provide more context.  Perhaps I swung the pendulum too far and
talked too much about other cases, or perhaps I just worded things
poorly again.  All I was attempting to do in the commit message was
point out the multiple basic rules with filename and content
similarity, to lay the groundwork for new rules that do alternative
weightings.

Anyway, I've added a few more tweaks to try to improve the wording for
the next round I'll submit today.  Given my track record so far, it
would not be surprising if it still needed more tweaks.