Re: [WIP PATCH] Manual rename correction

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 01 Aug 2012 22:33:44 -0700

Jeff King <peff@xxxxxxxx> writes:

> On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote:
>
>>  - As entries in rename cache that record high scores have names of
>>    "similar" blobs, pack-objects may be able to take advantage of
>>    this information.
>
> Yeah, although I suspect it is not as big a win as you might hope.  In
> practice, you're going to diff a lot fewer objects than pack-objects
> would consider, so most of the pack-objects candidate pairs will not
> have a cache entry.  Which means that you really need to not rely on the
> cache (since it is very likely that the best candidate is still to be
> found), and you still do lots of computation.
>
> You could cache the result of comparisons done by pack-objects, but that
> cache ends up being large. You might remember that one of my very first
> patches was a cache for recording negative delta finds (so we try to
> delta A against B and find that it is not a good match, and record that
> information). Even that cache was huge, and we ended up discarding it in
> favor of Linus's much more sensible "if two things are in the same pack
> and not delta'd, then either they are not a good match, or something
> else is in better" heuristic. But one with full-on similarity scores
> would be even bigger.

Yeah, I forgot about your negative cache.  Also the older companion
heuristics "if a delta and its base are in an existing pack, use
that delta" works well enough to make a positive cache unnecessary
(and the "rename similarity cache" can only be a subset of such a
cache).

C.f. http://thread.gmane.org/gmane.comp.version-control.git/16223/focus=16267

>>  - If you declare blobs A and B are similar, it is likely that blobs
>>    C, D, E, ... that are created by making a series of small tweaks
>>    to B are also similar.  Would it make more sense to introduce a
>>    concept of "set of similar blobs" instead of recording pairwise
>>    scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...?  If so,
>>    the body of per-dst loop in diffcore_rename() may become:
>
> Yes, this is the transitive bit I mentioned elsewhere. I think there are
> dragons lurking there, as you end up with the edges of the set not
> _really_ being that close to each other. You'd probably want some kind
> of weighted association, like if A->B had 80% similarity, and B->C had
> 90% similarity, then A->C would be 72%.

I recall we discussed the transitivity when we were designing the
basic rename thing, and rejected it after we realized that it was
unworkable.  You may have a blob A, perform a large-ish edit to
produce B, and A and B may be similar by only 60%.  You then may
start from the same A, perform another large-ish edit to produce C,
and A and C may be similar by only 55%.  Depending on the nature of
these two large-ish changes, B and C may be totally dissimilar, or
if the changes are more or less in the same "direction", they may be
95% similar.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html