Jeff King <peff@xxxxxxxx> writes: > On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote: > >> - As entries in rename cache that record high scores have names of >> "similar" blobs, pack-objects may be able to take advantage of >> this information. > > Yeah, although I suspect it is not as big a win as you might hope. In > practice, you're going to diff a lot fewer objects than pack-objects > would consider, so most of the pack-objects candidate pairs will not > have a cache entry. Which means that you really need to not rely on the > cache (since it is very likely that the best candidate is still to be > found), and you still do lots of computation. > > You could cache the result of comparisons done by pack-objects, but that > cache ends up being large. You might remember that one of my very first > patches was a cache for recording negative delta finds (so we try to > delta A against B and find that it is not a good match, and record that > information). Even that cache was huge, and we ended up discarding it in > favor of Linus's much more sensible "if two things are in the same pack > and not delta'd, then either they are not a good match, or something > else is in better" heuristic. But one with full-on similarity scores > would be even bigger. Yeah, I forgot about your negative cache. Also the older companion heuristics "if a delta and its base are in an existing pack, use that delta" works well enough to make a positive cache unnecessary (and the "rename similarity cache" can only be a subset of such a cache). C.f. http://thread.gmane.org/gmane.comp.version-control.git/16223/focus=16267 >> - If you declare blobs A and B are similar, it is likely that blobs >> C, D, E, ... that are created by making a series of small tweaks >> to B are also similar. Would it make more sense to introduce a >> concept of "set of similar blobs" instead of recording pairwise >> scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...? If so, >> the body of per-dst loop in diffcore_rename() may become: > > Yes, this is the transitive bit I mentioned elsewhere. I think there are > dragons lurking there, as you end up with the edges of the set not > _really_ being that close to each other. You'd probably want some kind > of weighted association, like if A->B had 80% similarity, and B->C had > 90% similarity, then A->C would be 72%. I recall we discussed the transitivity when we were designing the basic rename thing, and rejected it after we realized that it was unworkable. You may have a blob A, perform a large-ish edit to produce B, and A and B may be similar by only 60%. You then may start from the same A, perform another large-ish edit to produce C, and A and C may be similar by only 55%. Depending on the nature of these two large-ish changes, B and C may be totally dissimilar, or if the changes are more or less in the same "direction", they may be 95% similar. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html