Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames

Elijah Newren <newren@xxxxxxxxx> · Wed, 3 Feb 2021 16:05:32 -0800

On Wed, Feb 3, 2021 at 3:37 PM Jeff King <peff@xxxxxxxx> wrote:
>
> On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote:
>
> > >    In an early attempt, I tried to retire rename_src[j], once
> > >    rename_dst[i] has been found to be a "good enough" match for it,
> > >    from the pool of rename src candidates to find a good match for
> > >    rename_dst[k] for i < k, but naive implementation of it would not
> > >    work well for obvious reasons---rename_src[j] may match a lot
> > >    better with rename_dst[k] than rename_dst[i] but we do not know
> > >    that until we try to estimate similarity with rename_dst[k].
> >
> > You may really like the next two series I submit.  I have a smarter
> > way to find a "good enough" match (comparing to exactly one other file
> > and often finding sufficient similarity), and one that'll make
> > intuitive sense to users.
>
> Here's a really old thread with an approach that may or may not be
> similar to what you're thinking of:
>
>   https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@xxxxxxxxxxxxxx/
>
> Though maybe start with this summary message:
>
>   https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@xxxxxxxxxxxxxx/

Interesting thread; thanks for the link.  It's not remotely similar to
what I have done, but a brief glance through it reminds me of the
ideas at https://github.com/gitgitgadget/git/issues/519.

> I remember experimenting some with it at the time, but never making
> things faster. It's entirely possible (likely, even) that I was simply
> not doing it well enough.
>
> It's also been long enough since I looked at the rename code that I'm
> not sure how different it is in practice. It still has a quadratic
> matrix in the end. We basically do a similar strategy of
> rolling-hash-fingerprint-and-see-where-things-collide, but I think we
> may end up with more work during the quadratic part (again, it's been
> a while, so I may just be totally off-base).

I'm not sure if I should spoil the surprise for the patchsets I
haven't yet submitted but... when I get done, for the testcases I have
looked at, rename detection is no longer the slowest part of
merging/rebasing/cherry-picking -- not even when there are tens of
thousands of renames on one side of history.  And I didn't achieve
that by making other parts of the code slower.

If someone can come up with a real-world testcase where rename
detection is still really slow in a merge/rebase/cherry-pick with my
implemented optimizations, I've got at least one more optimization
that I hadn't bothered implementing because everything was fast enough
for me already.  And the idea you linked above and the ideas in the
gitgitgadget issue linked above all have more possibilities that are
complementary to what I have done that might help further.

> I've also wondered if something similar might be helpful for delta
> compression (which is now done with an O(m*n) sliding window).