Elijah Newren <newren@xxxxxxxxx> writes: > idea is still possible. For example, A.txt could have been compared > to source/some-module/A.txt. And I don't do anything in the final > "full matrix" stage to avoid re-comparing those two files again. > However, it is worth noting that A.txt will have been compared to at > most one other file, not N files. Sorry, but where does this "at most one other file" come from? "It is rare to remove source/some-other-module/A.txt at the same time while the above is happening"? If so, yes, that sounds like a sensible thing. > 1) The most expensive comparison is the first one,... Yes. we keep the spanhash table across comparison. > 2) This would only save us from at most N comparisons in the N x M > matrix (since no file in this optimization is compared to more than > one other) True, but doesn't rename_src[] and rename_dst[] entries have the original pathname, where you can see A.txt and some-module/A.txt share the same filename part cheaply? Is that more expensive than comparing spanhash tables? Having asked these, I do think it is not worth pursuing, especially because I agree with Derrick that this "we see a new file whose name is the same as the one deleted from a different directory, so if they are similar enough, let's declare victory and not bother finding a better match" needs to be used with higher similarity bar than the normal one. If -M60 says "only consider pairs that are with at least 60% similarity index", finding one at 60% similarity and stopping at it only because the pair looks to move a file from one directory to another directory while retaining the same name, rejecting other paring, feels a bit too crude a heuristics. And if we require higher similarity levels to short-circuit, the later full matrix stage won't be helped with "we must have already rejected" logic. A.txt and some-module/A.txt may not have been similar enough to short-circuit and reject others in the earlier part, but the full-matrix part work at a lower bar, which may consider the pair good enough to keep as match candidates. Thanks.