That's an idea I have for quite some time, and I wonder why it's not used in git tools as a general rule. This idea is simple, git objects database has two (for this discussion) very interesting features: its delta compressed cached that is _very_ efficient, and the reflog. I wonder if that would be possible to write some git porcelains (and builtin API too) that would be more "map" oriented. I mean, we could use a reference as a pointer to a given tree that would be the map (where keys have a path form, which is nice). When I say that, I'm thinking about git-svn, that even with the recent improvements of its .rev_db's still eats a lot of space with the unhandled.log _and_ the indexes it stores for _each_ svn branch/tag. This way, instead of many: foo/index foo/.rev_map.6ef976f9-4de5-0310-a40d-91cae572ec18 foo/unhandled.log we would just have a special refs/db/git-svn/foo reference that would be a tree with three blobs in it: index, rev_map.xxxx, unhandled.log. (or probably index would even be a tree but that's another matter). This way, all the unhandled.log that share a lot of common content would be nicely compressed by the delta-compression algorithms, with a negligible overhead (git-svn is _very_ slow because of svn anyways, we don't really care if it needs to get a blob contents instead opening a flat file). Another nifty usage we could have is memoization databases that don't require a specific tool to expire them, but use the reflog expiration for that. I remember that we discussed quite some time ago, the idea of annotating objects. We could use such annotations to link some objects to memoized datas under different namespaces for each caching scheme involved, and with one reference per namespace that will have in its reflog each of the linked objects created over time for caching. Good candidates to use that are the rr-cache, or git-annotate/blame caching. Of course that would need to write a tool that removes weak annotations that point to objects that don't exist anymore. We could also cache the rename/copies/… detection results, and make those really really cheap to use[0]. I know that some will say something about hammers, problems and nails, though it would allow to develop quite efficient tools with a generic and easy to use API, that could directly benefit from already existing infrastructure in git. I mean it's silly to write yet-another cache expirer when you have the reflog. Or to speak about git-svn again, it could even version its state per branch the way I propose, it'll end up using less disk that what it does now, with the immediate gain that it would be fully clone-able[1] (which would be a _really_ nice feature). So am I having crazy thoughts and should I throw my crack-pipe away ? Or does parts of this mumbling makes any sense to someone ? PS: It's late, and I'm tired, hence my english is probably very clumsy, and I hope I'm understandable enough. I'd be glad to rephrase parts that needs it. [0] and if the copy/rename/… detection algorithm gets smarter, we just need to change its memoization namespace to throw the old cache away at once. [1] and the really nice part here is that even if you don't create one new step per svn revision you import but do macro-steps with hundreds of svn revisions at a time, the merge of two differnt git-svn states of two clones of the _same_ svn repository will have a trivial exact merge: the one that knows the biggest svn revision is the new state to use. -- ·O· Pierre Habouzit ··O madcoder@xxxxxxxxxx OOO http://www.madism.org
Attachment:
pgpUf0S8mI0yG.pgp
Description: PGP signature