On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote: > > In an early attempt, I tried to retire rename_src[j], once > > rename_dst[i] has been found to be a "good enough" match for it, > > from the pool of rename src candidates to find a good match for > > rename_dst[k] for i < k, but naive implementation of it would not > > work well for obvious reasons---rename_src[j] may match a lot > > better with rename_dst[k] than rename_dst[i] but we do not know > > that until we try to estimate similarity with rename_dst[k]. > > You may really like the next two series I submit. I have a smarter > way to find a "good enough" match (comparing to exactly one other file > and often finding sufficient similarity), and one that'll make > intuitive sense to users. Here's a really old thread with an approach that may or may not be similar to what you're thinking of: https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@xxxxxxxxxxxxxx/ Though maybe start with this summary message: https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@xxxxxxxxxxxxxx/ I remember experimenting some with it at the time, but never making things faster. It's entirely possible (likely, even) that I was simply not doing it well enough. It's also been long enough since I looked at the rename code that I'm not sure how different it is in practice. It still has a quadratic matrix in the end. We basically do a similar strategy of rolling-hash-fingerprint-and-see-where-things-collide, but I think we may end up with more work during the quadratic part (again, it's been a while, so I may just be totally off-base). I've also wondered if something similar might be helpful for delta compression (which is now done with an O(m*n) sliding window). -Peff