Re: [WIP PATCH] Manual rename correction

Jeff King <peff@xxxxxxxx> · Wed, 1 Aug 2012 17:54:14 -0400

On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote:

> > @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src,
> >  	if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
> >  		return 0;
> >  
> > +	hashcpy(pair.one, src->sha1);
> > +	hashcpy(pair.two, dst->sha1);
> > +	if (rename_cache_get(&pair, &score))
> > +		return score;
> > +
> 
> Random thoughts.
> 
> Even though your "rename cache" could be used to reject pairing that
> the similarity estimator would otherwise give high score, I would
> imagine that in practice, people would always use the mechanism to
> boost the similarity score of desired pairing.  This conjecture has
> a few interesting implications.

I agree that is the probable use. But low scores might also have a
purpose in preventing useless work from being done (IOW, storing a "0"
score is unlikely to happen manually, but it does let us avoid repeating
the calculation).

As you might have guessed, I am really more interested in caching and
performance gains than in manual overrides.

>  - As we track of only the top NUM_CANDIDATE_PER_DST rename src for
>    each dst (see record_if_better()), you should be able to first
>    see if pairs that have dst exist in your rename cache, and
>    iterate over the <src,dst> pairs, filling m[] with srcs that
>    appear in this particular invocation of diff.

Yes, although I don't know if it makes a big difference. We still run
estimate_similarity on the candidates to see if they make it into
record_if_better, and that is the expensive part. I don't think you want
to change that either (e.g., by assuming that cached scores will always
be in the top N; it would depend on the particular tree that the cached
scores came from).

Of course, that changes if it is not a cache at all, and simply a manual
override. Or if manual overrides are stored separately and take effect
first.

But I do not think you even want to care about NUM_CANDIDATE_PER_DST for
manual overrides. You simply want to take the highest scored one,
consider the it done, and not run estimate_similarity at all. Even if
you only had 1 (and NUM_CANDIDATE_PER_DST were 4), there is still no
point in looking at the others. The manual override should beat it,
regardless of score.

>  - If you find NUM_CANDIDATE_PER_DST srcs from your rename cache,
>    you wouldn't have to run estimate_similarity() at all, but that
>    is very unlikely.  We could however declare that user configured
>    similarity boost always wins computed ones, and skip estimation
>    for a dst for which you find an entry in the rename cache.

Right. See above.

>  - As entries in rename cache that record high scores have names of
>    "similar" blobs, pack-objects may be able to take advantage of
>    this information.

Yeah, although I suspect it is not as big a win as you might hope.  In
practice, you're going to diff a lot fewer objects than pack-objects
would consider, so most of the pack-objects candidate pairs will not
have a cache entry.  Which means that you really need to not rely on the
cache (since it is very likely that the best candidate is still to be
found), and you still do lots of computation.

You could cache the result of comparisons done by pack-objects, but that
cache ends up being large. You might remember that one of my very first
patches was a cache for recording negative delta finds (so we try to
delta A against B and find that it is not a good match, and record that
information). Even that cache was huge, and we ended up discarding it in
favor of Linus's much more sensible "if two things are in the same pack
and not delta'd, then either they are not a good match, or something
else is in better" heuristic. But one with full-on similarity scores
would be even bigger.

>  - If you declare blobs A and B are similar, it is likely that blobs
>    C, D, E, ... that are created by making a series of small tweaks
>    to B are also similar.  Would it make more sense to introduce a
>    concept of "set of similar blobs" instead of recording pairwise
>    scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...?  If so,
>    the body of per-dst loop in diffcore_rename() may become:

Yes, this is the transitive bit I mentioned elsewhere. I think there are
dragons lurking there, as you end up with the edges of the set not
_really_ being that close to each other. You'd probably want some kind
of weighted association, like if A->B had 80% similarity, and B->C had
90% similarity, then A->C would be 72%. But even that is not accurate;
it is just a lower bound. The differences between A->B and B->C might
overlap, and A->C might be much more similar.

You can't know the true value without actually finding the score for
A->C.  And since we output the scores, they should probably be accurate.

> Regarding your rename-and-tweak-exif photo sets, is the issue that
> there are too many rename src/dst candidates and filling a large
> matrix takes a lot of time, or tweaking exif makes the contents
> unnecessarily dissimilar and causes the similarity detection to
> fail?

The former. We do find the renames; it is just that there are a lot of
them (and worse, they are large blobs, so estimate_similarity is slow to
run). I have to use "-l0" just to get the rename to run at all.

> As we still have the pathname in this codepath, I am wondering if we
> would benefit from custom "content hash" that knows the nature of
> payload than the built-in similarity estimator, driven by the
> attribute mechanism (if the latter is the case, that is).

Hmm. Interesting. But I don't think that attributes are a good fit here.
They are pathname based, so how do I apply anything related to
similarity of a particular version by pathname? IOW, how does it apply
in one tree but not another?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html