Re: [WIP PATCH] Manual rename correction

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 31 Jul 2012 23:01:27 -0700

Jeff King <peff@xxxxxxxx> writes:

> @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src,
>  	if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
>  		return 0;
>  
> +	hashcpy(pair.one, src->sha1);
> +	hashcpy(pair.two, dst->sha1);
> +	if (rename_cache_get(&pair, &score))
> +		return score;
> +

Random thoughts.

Even though your "rename cache" could be used to reject pairing that
the similarity estimator would otherwise give high score, I would
imagine that in practice, people would always use the mechanism to
boost the similarity score of desired pairing.  This conjecture has
a few interesting implications.

 - As we track of only the top NUM_CANDIDATE_PER_DST rename src for
   each dst (see record_if_better()), you should be able to first
   see if pairs that have dst exist in your rename cache, and
   iterate over the <src,dst> pairs, filling m[] with srcs that
   appear in this particular invocation of diff.

 - If you find NUM_CANDIDATE_PER_DST srcs from your rename cache,
   you wouldn't have to run estimate_similarity() at all, but that
   is very unlikely.  We could however declare that user configured
   similarity boost always wins computed ones, and skip estimation
   for a dst for which you find an entry in the rename cache.

 - As entries in rename cache that record high scores have names of
   "similar" blobs, pack-objects may be able to take advantage of
   this information.

 - If you declare blobs A and B are similar, it is likely that blobs
   C, D, E, ... that are created by making a series of small tweaks
   to B are also similar.  Would it make more sense to introduce a
   concept of "set of similar blobs" instead of recording pairwise
   scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...?  If so,
   the body of per-dst loop in diffcore_rename() may become:

	if (we know where dst came from)
		continue;
	if (dst belongs to a known blob family) {
		for (each src in rename_src[]) {
			if (src belongs to the same blob family as dst)
				record it in m[];
                }
	}
	if (the above didn't record anything in m[]) {
        	... existing estimate_similarity() code ...
	}

Regarding your rename-and-tweak-exif photo sets, is the issue that
there are too many rename src/dst candidates and filling a large
matrix takes a lot of time, or tweaking exif makes the contents
unnecessarily dissimilar and causes the similarity detection to
fail?  As we still have the pathname in this codepath, I am
wondering if we would benefit from custom "content hash" that knows
the nature of payload than the built-in similarity estimator, driven
by the attribute mechanism (if the latter is the case, that is).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html