On Sun, 26 Mar 2006, Jakub Narebski wrote:
I think one of the better ideas/suggestions about *recording* filenames was
in the "impure renames / history tracking" thread
http://marc.theaimsgroup.com/?l=git&m=114122175216489&w=2
<Pine.LNX.4.64.0603011343170.13612@xxxxxxxxxxxxxxx>
For the record, the responses I received were educational ;).
Sufficiently so I no longer think renames should be recorded. At
least, definitely not as renames.
I now grok the reasoning for doing it by 'similarity' - it is indeed
a *much* more useful concept. (E.g. the 'pickaxe' idea people keep
alluding though sounds amazingly useful).
So the question really is what, if any, weaknesses does the current
similarity estimation have, and how to solve them. I can think of two
weaknesses:
1. the similarity algorithms can be expensive potentially, and they
essentially get run a lot with the same inputs, to produce the
same results - over and over as one works with a git repo. (there
was a thread a while ago on this I think).
2. Some 'similarities' are just not deducible by current software
state of the art. E.g. where some code is rewritten in another
language:
foo.X -> foo.Y
The high-level algorithms may remain the exact same, but the code
may be unrecognisable as similar except to a human. However,
tracking history back across this rewrite probably would still be
valuable to the human.
So I think what /might/ be interesting is to have a 'similarity
cache', which would help 1, and to allow for manual injection of such
hints (into a seperate and stronger cache most likely) - which would
help 2.
Something to record the following information:
(tree1,tree2)[1]:
Id1 <-> Id1'
.
.
.
Idn <-> Idn'
That would allow:
1. Performance repercussions of similarity estimation to be one-time,
cached there-after. (throw-away information, if a better
similarity estimation heuristic comes along, you can rebuild this
cache)
2. The user to inject their own 'hints' into similarity estimation,
particularly for cases that just aren't obvious and probably never
will be to software estimators (e.g. the rewrite cases), but where
the user sees value in being able to follow back the history.
Avoids:
- encoding anything permanently into the repository (which was
something I was thinking of, and others before me apparently, but
which I now accept would be an awful idea ;) ).
1. I'm not sure if it should be indexed by (commit ID) or
(tree1,tree2) tuple. ??
regards,
--
Paul Jakma paul@xxxxxxxx paul@xxxxxxxxx Key ID: 64A2FF6A
Fortune:
Men take only their needs into consideration -- never their abilities.
-- Napoleon Bonaparte
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html