Jeff King <peff@xxxxxxxx> writes: > @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src, > if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE) > return 0; > > + hashcpy(pair.one, src->sha1); > + hashcpy(pair.two, dst->sha1); > + if (rename_cache_get(&pair, &score)) > + return score; > + Random thoughts. Even though your "rename cache" could be used to reject pairing that the similarity estimator would otherwise give high score, I would imagine that in practice, people would always use the mechanism to boost the similarity score of desired pairing. This conjecture has a few interesting implications. - As we track of only the top NUM_CANDIDATE_PER_DST rename src for each dst (see record_if_better()), you should be able to first see if pairs that have dst exist in your rename cache, and iterate over the <src,dst> pairs, filling m[] with srcs that appear in this particular invocation of diff. - If you find NUM_CANDIDATE_PER_DST srcs from your rename cache, you wouldn't have to run estimate_similarity() at all, but that is very unlikely. We could however declare that user configured similarity boost always wins computed ones, and skip estimation for a dst for which you find an entry in the rename cache. - As entries in rename cache that record high scores have names of "similar" blobs, pack-objects may be able to take advantage of this information. - If you declare blobs A and B are similar, it is likely that blobs C, D, E, ... that are created by making a series of small tweaks to B are also similar. Would it make more sense to introduce a concept of "set of similar blobs" instead of recording pairwise scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...? If so, the body of per-dst loop in diffcore_rename() may become: if (we know where dst came from) continue; if (dst belongs to a known blob family) { for (each src in rename_src[]) { if (src belongs to the same blob family as dst) record it in m[]; } } if (the above didn't record anything in m[]) { ... existing estimate_similarity() code ... } Regarding your rename-and-tweak-exif photo sets, is the issue that there are too many rename src/dst candidates and filling a large matrix takes a lot of time, or tweaking exif makes the contents unnecessarily dissimilar and causes the similarity detection to fail? As we still have the pathname in this codepath, I am wondering if we would benefit from custom "content hash" that knows the nature of payload than the built-in similarity estimator, driven by the attribute mechanism (if the latter is the case, that is). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html