Re: Following renames

Paul Jakma <paul@xxxxxxxx> · Mon, 27 Mar 2006 07:00:45 +0100 (IST)

On Sun, 26 Mar 2006, Jakub Narebski wrote:

I think one of the better ideas/suggestions about *recording* filenames was
in the "impure renames / history tracking" thread
http://marc.theaimsgroup.com/?l=git&m=114122175216489&w=2
<Pine.LNX.4.64.0603011343170.13612@xxxxxxxxxxxxxxx>

For the record, the responses I received were educational ;). 
Sufficiently so I no longer think renames should be recorded. At 
least, definitely not as renames.

I now grok the reasoning for doing it by 'similarity' - it is indeed 
a *much* more useful concept. (E.g. the 'pickaxe' idea people keep 
alluding though sounds amazingly useful).

So the question really is what, if any, weaknesses does the current 
similarity estimation have, and how to solve them. I can think of two 
weaknesses:

1. the similarity algorithms can be expensive potentially, and they
   essentially get run a lot with the same inputs, to produce the
   same results - over and over as one works with a git repo. (there
   was a thread a while ago on this I think).

2. Some 'similarities' are just not deducible by current software
   state of the art. E.g. where some code is rewritten in another
   language:

	foo.X -> foo.Y

   The high-level algorithms may remain the exact same, but the code
   may be unrecognisable as similar except to a human. However,
   tracking history back across this rewrite probably would still be
   valuable to the human.

So I think what /might/ be interesting is to have a 'similarity 
cache', which would help 1, and to allow for manual injection of such 
hints (into a seperate and stronger cache most likely) - which would 
help 2.

Something to record the following information:

(tree1,tree2)[1]:
	Id1 <-> Id1'
	.
	.
	.
	Idn <-> Idn'

That would allow:

1. Performance repercussions of similarity estimation to be one-time,
   cached there-after. (throw-away information, if a better
   similarity estimation heuristic comes along, you can rebuild this
   cache)

2. The user to inject their own 'hints' into similarity estimation,
   particularly for cases that just aren't obvious and probably never
   will be to software estimators (e.g. the rewrite cases), but where
   the user sees value in being able to follow back the history.

Avoids:

- encoding anything permanently into the repository (which was
  something I was thinking of, and others before me apparently, but
  which I now accept would be an awful idea ;) ).

1. I'm not sure if it should be indexed by (commit ID) or
   (tree1,tree2) tuple. ??

regards,
--
Paul Jakma	paul@xxxxxxxx	paul@xxxxxxxxx	Key ID: 64A2FF6A
Fortune:
Men take only their needs into consideration -- never their abilities.
		-- Napoleon Bonaparte
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html