Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity

Junio C Hamano <gitster@xxxxxxxxx> · Sat, 13 Feb 2021 17:32:40 -0800

I do not consider "the same file changed in place" the same as "we
seem to have lost a file in the old tree, ah, we found one that has
the same basename in a different directory" at all, so your argument
still does not make any sense to me, sorry.

2021年2月13日(土) 17:25 Elijah Newren <newren@xxxxxxxxx>:
>
> On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
> >
> > Elijah Newren <newren@xxxxxxxxx> writes:
> >
> > > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > > 0% similar to the src/main.c in the new commit, we match the old
> > > src/main.c to the new src/main.c despite being far more similar
> > > src/foo.c.  Unless break detection is turned on, we do not allow
> > > content similarity to trump (full) filename equality.
> >
> > Absolutely.  And we are talking about a new optimization that kicks
> > in only when there is no break or no copy detection going on, no?
>
> Yes, precisely, we are only considering cases without break
> detection...and thus we are considering cases where for the last 15
> years or more, sufficiently large filename similarity (an exact
> fullname match) trumps any level of content similarity.  I think it is
> useful to note that while my optimization is adding more
> considerations that can overrule maximal content similarity, it is not
> the first such code choice to do that.
>
> But let me back up a bit...
>
> When I submitted the series, you and Stolee went into a long
> discussion about an optimization that I didn't submit, one that feels
> looser on "matching" than anything I submitted, and which I think
> might counter-intuitively reduce performance rather than aid it.  (The
> performance side only comes into view in combination with later
> series, but it was why I harped so much since then on only comparing
> against at most one other file in the steps before full inexact rename
> detection.)  I was quite surprised by the diversion, but it made it
> clear to me that my descriptions and commit messages were far too
> vague and could be read to imply a completely different algorithm than
> I intended.  So, I tried to be far more careful in subsequent
> iterations by adding wider context and contrasts.
>
> Further, after I wrote various things to try to clarify the
> misunderstandings, I noticed that Stolee picked out one thing and
> stated that "This idea of optimizing first for 100% filename
> similarity is a good perspective on Git's rename detection algorithm."
> (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@xxxxxxxxx/)
>  So, that particular point seemed to help him understand more, and
> thus might be useful extra context for others reading along now or in
> the future.
>
> Given all the above, I was trying to address earlier misunderstandings
> and provide more context.  Perhaps I swung the pendulum too far and
> talked too much about other cases, or perhaps I just worded things
> poorly again.  All I was attempting to do in the commit message was
> point out the multiple basic rules with filename and content
> similarity, to lay the groundwork for new rules that do alternative
> weightings.
>
> Anyway, I've added a few more tweaks to try to improve the wording for
> the next round I'll submit today.  Given my track record so far, it
> would not be surprising if it still needed more tweaks.