Linus Torvalds wrote:
On Sat, 10 Jun 2006, Rogan Dawes wrote:
Here's an idea. How about separating trees and commits from the actual blobs
(e.g. in separate packs)? My reasoning is that the commits and trees should
only be a small portion of the overall repository size, and should not be that
expensive to transfer. (Of course, this is only a guess, and needs some
numbers to back it up.)
The trees in particular are actually a pretty big part of the history.
More importantly, the blobs compress horribly badly in the absense of
history - a _lot_ of the compression in git packing comes very much from
the fact that we do a good job at delta-compression.
So if you get all of the commit/tree history, but none of the blob
history, you're actually not going to win that much space. As already
discussed, the _whole_ history packed with git is usually not insanely
bigger than just the whole unpacked tree (with no history at all).
So you'd think that getting just the top version of the tree would be a
much bigger space-saving that it actually is. If you _also_ get all the
tree and commit objects, the space saving is even less.
One possibility, given that the full commit and tree history is so
large, is simply to get the HEAD commit and the trees that the commit
depends directly on, rather than fetching them all up front.
I actually suspect that the most realistic way to handle this is to use
the "fetch.c" logic (ie the incremental fetcher used by http), and add
some mode to the git daemon where you fetch literally one object at a time
(ie this would be totally _separate_ from the pack-file thing: you'd not
ask for "git-upload-pack", you'd ask for something like
"git-serve-objects" instead).
The fetch.c logic really does allow for on-demand object fetching, and is
thus much more suitable for incomplete repositories.
HOWEVER. The fetch.c logic - by necessity - works on a object-by-object
level. That means that you'd get no delta compression AT ALL, and I
suspect that the downside of that would be a factor of ten expansion or
more, which means that it would really not work that well in practice.
Would it be possible to add a mode where fetch.c is given a list of
desired objects, and returns a list of pointers to those objects? Then
callers that already have such a list could be modified to pass the
whole list at once, allowing at least SOME compression, and optimisation
of round trips, etc? There would be a tradeoff in memory use, though, I
guess.
Rogan
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html