Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity

Elijah Newren <newren@xxxxxxxxx> · Sat, 13 Feb 2021 19:14:09 -0800

On Sat, Feb 13, 2021 at 5:32 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> I do not consider "the same file changed in place" the same as "we
> seem to have lost a file in the old tree, ah, we found one that has
> the same basename in a different directory" at all, so your argument
> still does not make any sense to me, sorry.

I'm not set on the commit message wording, you asked why I had used
the terms I did and I tried to explain.  I also explained how the
wording seemed to have helped Stolee understand.

If you'd like to suggest an alternative commit message, I'm happy to take it.

> 2021年2月13日(土) 17:25 Elijah Newren <newren@xxxxxxxxx>:
> >
> > On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
> > >
> > > Elijah Newren <newren@xxxxxxxxx> writes:
> > >
> > > > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > > > 0% similar to the src/main.c in the new commit, we match the old
> > > > src/main.c to the new src/main.c despite being far more similar
> > > > src/foo.c.  Unless break detection is turned on, we do not allow
> > > > content similarity to trump (full) filename equality.
> > >
> > > Absolutely.  And we are talking about a new optimization that kicks
> > > in only when there is no break or no copy detection going on, no?
> >
> > Yes, precisely, we are only considering cases without break
> > detection...and thus we are considering cases where for the last 15
> > years or more, sufficiently large filename similarity (an exact
> > fullname match) trumps any level of content similarity.  I think it is
> > useful to note that while my optimization is adding more
> > considerations that can overrule maximal content similarity, it is not
> > the first such code choice to do that.
> >
> > But let me back up a bit...
> >
> > When I submitted the series, you and Stolee went into a long
> > discussion about an optimization that I didn't submit, one that feels
> > looser on "matching" than anything I submitted, and which I think
> > might counter-intuitively reduce performance rather than aid it.  (The
> > performance side only comes into view in combination with later
> > series, but it was why I harped so much since then on only comparing
> > against at most one other file in the steps before full inexact rename
> > detection.)  I was quite surprised by the diversion, but it made it
> > clear to me that my descriptions and commit messages were far too
> > vague and could be read to imply a completely different algorithm than
> > I intended.  So, I tried to be far more careful in subsequent
> > iterations by adding wider context and contrasts.
> >
> > Further, after I wrote various things to try to clarify the
> > misunderstandings, I noticed that Stolee picked out one thing and
> > stated that "This idea of optimizing first for 100% filename
> > similarity is a good perspective on Git's rename detection algorithm."
> > (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@xxxxxxxxx/)
> >  So, that particular point seemed to help him understand more, and
> > thus might be useful extra context for others reading along now or in
> > the future.
> >
> > Given all the above, I was trying to address earlier misunderstandings
> > and provide more context.  Perhaps I swung the pendulum too far and
> > talked too much about other cases, or perhaps I just worded things
> > poorly again.  All I was attempting to do in the commit message was
> > point out the multiple basic rules with filename and content
> > similarity, to lay the groundwork for new rules that do alternative
> > weightings.
> >
> > Anyway, I've added a few more tweaks to try to improve the wording for
> > the next round I'll submit today.  Given my track record so far, it
> > would not be surprising if it still needed more tweaks.