Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames

Jeff King <peff@xxxxxxxx> · Wed, 3 Feb 2021 18:36:59 -0500

On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote:

> >    In an early attempt, I tried to retire rename_src[j], once
> >    rename_dst[i] has been found to be a "good enough" match for it,
> >    from the pool of rename src candidates to find a good match for
> >    rename_dst[k] for i < k, but naive implementation of it would not
> >    work well for obvious reasons---rename_src[j] may match a lot
> >    better with rename_dst[k] than rename_dst[i] but we do not know
> >    that until we try to estimate similarity with rename_dst[k].
> 
> You may really like the next two series I submit.  I have a smarter
> way to find a "good enough" match (comparing to exactly one other file
> and often finding sufficient similarity), and one that'll make
> intuitive sense to users.

Here's a really old thread with an approach that may or may not be
similar to what you're thinking of:

  https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@xxxxxxxxxxxxxx/

Though maybe start with this summary message:

  https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@xxxxxxxxxxxxxx/

I remember experimenting some with it at the time, but never making
things faster. It's entirely possible (likely, even) that I was simply
not doing it well enough.

It's also been long enough since I looked at the rename code that I'm
not sure how different it is in practice. It still has a quadratic
matrix in the end. We basically do a similar strategy of
rolling-hash-fingerprint-and-see-where-things-collide, but I think we
may end up with more work during the quadratic part (again, it's been
a while, so I may just be totally off-base).

I've also wondered if something similar might be helpful for delta
compression (which is now done with an O(m*n) sliding window).

-Peff