Re: [WIP PATCH] Manual rename correction

Jeff King <peff@xxxxxxxx> · Tue, 31 Jul 2012 15:23:42 -0400

On Tue, Jul 31, 2012 at 09:32:49AM -0700, Junio C Hamano wrote:

> Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes:
> 
> > The above output is done with "git diff --manual-rename=foo A B"
> > and "foo" contains (probably not in the best format though)
> >
> > -- 8< --
> > attr.c dir.c
> > dir.c attr.c
> > -- 8< --
> > ...
> > Comments?
> 
> It is a good direction to go in, I would think, to give users a way
> to explicitly tell that "in comparison between these two trees, I
> know path B in the postimage corresponds to path A in the preimage".

I do not think that is the right direction. Let's imagine that I have a
commit "A" and I annotate it (via notes or whatever) to say "between
A^^{tree} and A^{tree}, foo.c became bar.c". That will help me when
doing "git show" or "git log". But it will not help me when I later try
to merge "A" (or its descendent). In that case, I will compute the diff
between "A" and the merge-base (or worse, some descendent of "A" and the
merge-base), and I will miss this hint entirely.

A much better hint is to annotate pairs of sha1s, to say "do not bother
doing inexact rename correlation on this pair; I promise that they have
value N". Then it will find that pair no matter which trees or commits
are being diffed, and it will do so relatively inexpensively[1].

That is not fool-proof, of course. You might have a manual rename from
sha1 X to sha1 Y, and then a slight modification to Y to make Z. So you
would want some kind of transitivity to notice that X and Z correlate.
I think you could model it as a graph problem; sha1s are nodes, and each
"this is a rename" pair of annotated sha1s has an edge between them.
They are the "same file" if there is a path.

Of course that gives you bizarre and counter-intuitive results, because
X and Z might not actually be that similar. And that is why we have
rename detection in the first place. The idea of file identity (which
this fundamentally is) leads to these sorts of weird results.

I'm sure you could get better results by weakening the transitivity
according to the rename score, or something like that. But now you are
getting pretty complex.

-Peff

[1] We could actually cache rename results by storing pairs of sha1s
    along with their rename score, and should be able to get a good
    speedup (we are still src*dst in comparing, but now the comparison
    is a simple table lookup rather than loading the blobs and computing
    the differences). If we had such a cache, then manually marking a
    rename would just be a matter of priming the cache with your manual
    entries.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html