Re: Continue git clone after interruption

Nicolas Pitre <nico@xxxxxxx> · Tue, 18 Aug 2009 13:56:16 -0400 (EDT)

On Tue, 18 Aug 2009, Tomasz Kontusz wrote:

> Ok, so it looks like it's not implementable without some kind of cache
> server-side, so the server would know what the pack it was sending
> looked like.
> But here's my idea: make server send objects in different order (the
> newest commit + whatever it points to first, then next one,then
> another...). Then it would be possible to look at what we got, tell
> server we have nothing, and want [the newest commit that was not
> complete]. I know the reason why it is sorted the way it is, but I think
> that the way data is stored after clone is clients problem, so the
> client should reorganize packs the way it wants.

That won't buy you much.  You should realize that a pack is made of:

1) Commit objects.  Yes they're all put together at the front of the pack,
   but they roughly are the equivalent of:

	git log --pretty=raw | gzip | wc -c

   For the Linux repo as of now that is around 32 MB.

2) Tree andblob objects.  Those are the bulk of the content for the top 
   commit.  The top commit is usually not delta compressed because we 
   want fast access to the top commit, and that is used as the base for 
   further delta compression for older commits.  So the very first 
   commit is whole at the front of the pack right after the commit 
   objects.  you can estimate the size of this data with:

	git archive --format=tar HEAD | gzip | wc -c

   On the same Linux repo this is currently 75 MB.

3) Delta objects.  Those are making the rest of the pack, plus a couple 
   tree/blob objects that were not found in the top commit and are 
   different enough from any object in that top commit not to be 
   represented as deltas.  Still, the majority of objects for all the 
   remaining commits are delta objects.

So... if we reorder objects, all that we can do is to spread commit 
objects around so that the objects referenced by one commit are all seen 
before another commit object is included.  That would cut on that 
initial 32 MB.

However you still have to get that 75 MB in order to at least be able to 
look at _one_ commit.  So you've only reduced your critical download 
size from 107 MB to 75 MB.  This is some improvement, of course, but not 
worth the bother IMHO.  If we're to have restartable clone, it has to 
work for any size.

And that's where the real problem is.  I don't think having servers to 
cache pack results for every fetch requests is sensible as that would be 
an immediate DoS attack vector.

And because the object order in a pack is not defined by the protocol, 
we cannot expect the server to necessarily always provide the same 
object order either.  For example, it is already undefined in which 
order you'll receive objects as threaded delta search is non 
deterministic and two identical fetch requests may end up with slightly 
different packing.  Or load balancing may redirect your fetch requests 
to different git servers which might have different versions of zlib, or 
even git itself, affecting the object packing order and/or size.

Now... What _could_ be done, though, is some extension to the 
git-archive command.  One thing that is well and strictly defined in git 
is the file path sort order.  So given a commit SHA1, you should always 
get the same files in the same order from git-archive.  For an initial 
clone, git could attempt fetching the top commit using the remote 
git-archive service and locally reconstruct that top commit that way.  
if the transfer is interrupted in the middle, then the remote 
git-archive could be told how to resume the transfer by telling it how 
many files and how many bytes in the current file to skip.  This way the 
server doesn't need to perform any sort of caching and remains 
stateless.

You then end up with a pretty shallow repository.  The clone process 
could then fall back to the traditional native git transfer protocol to 
deepen the history of that shallow repository.  And then that special 
packing sort order to distribute commit objects would make sense since 
each commit would then have a fairly small set of new objects, and most 
of them would be deltas anyway, making the data size per commit really 
small and any interrupted transfer much less of an issue.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html