Re: Continue git clone after interruption

Nicolas Pitre <nico@xxxxxxx> · Tue, 18 Aug 2009 16:01:52 -0400 (EDT)

On Tue, 18 Aug 2009, Jakub Narebski wrote:

> Nicolas Pitre <nico@xxxxxxx> writes:
> 
> > On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
> > 
> > > Ok, so it looks like it's not implementable without some kind of cache
> > > server-side, so the server would know what the pack it was sending
> > > looked like.
> > > But here's my idea: make server send objects in different order (the
> > > newest commit + whatever it points to first, then next one,then
> > > another...). Then it would be possible to look at what we got, tell
> > > server we have nothing, and want [the newest commit that was not
> > > complete]. I know the reason why it is sorted the way it is, but I think
> > > that the way data is stored after clone is clients problem, so the
> > > client should reorganize packs the way it wants.
> > 
> > That won't buy you much.  You should realize that a pack is made of:
> > 
> > 1) Commit objects.  Yes they're all put together at the front of the pack,
> >    but they roughly are the equivalent of:
> > 
> > 	git log --pretty=raw | gzip | wc -c
> > 
> >    For the Linux repo as of now that is around 32 MB.
> 
> For my clone of Git repository this gives 3.8 MB
>  
> > 2) Tree and blob objects.  Those are the bulk of the content for the top 
> >    commit.  The top commit is usually not delta compressed because we 
> >    want fast access to the top commit, and that is used as the base for 
> >    further delta compression for older commits.  So the very first 
> >    commit is whole at the front of the pack right after the commit 
> >    objects.  you can estimate the size of this data with:
> > 
> > 	git archive --format=tar HEAD | gzip | wc -c
> > 
> >    On the same Linux repo this is currently 75 MB.
> 
> On the same Git repository this gives 2.5 MB

Interesting to see that the commit history is larger than the latest 
source tree.  Probably that would be the same with the Linux kernel as 
well if all versions since the beginning with adequate commit logs were 
included in the repo.

> > 3) Delta objects.  Those are making the rest of the pack, plus a couple 
> >    tree/blob objects that were not found in the top commit and are 
> >    different enough from any object in that top commit not to be 
> >    represented as deltas.  Still, the majority of objects for all the 
> >    remaining commits are delta objects.
> 
> You forgot that delta chains are bound by pack.depth limit, which
> defaults to 50.  You would have then additional full objects.

Sure, but that's probably not significant.  the delta chain depth is 
limited, but not the width.  A given base object can have unlimited 
delta "children", and so on at each depth level.

> The single packfile for this (just gc'ed) Git repository is 37 MB.
> Much more than 3.8 MB + 2.5 MB = 6.3 MB.

What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to 
be occupied by deltas.

> [cut]
> 
> There is another way which we can go to implement resumable clone.
> Let's git first try to clone whole repository (single pack; BTW what
> happens if this pack is larger than file size limit for given
> filesystem?).

We currently fail.  Seems that no one ever had a problem with that so 
far. We'd have to split the pack stream into multiple packs on the 
receiving end.  But frankly, if you have a repository large enough to 
bust your filesystem's file size limit then maybe you should seriously 
reconsider your choice of development environment.

> If it fails, client ask first for first half of of
> repository (half as in bisect, but it is server that has to calculate
> it).  If it downloads, it will ask server for the rest of repository.
> If it fails, it would reduce size in half again, and ask about 1/4 of
> repository in packfile first.

Problem people with slow links have won't be helped at all with this.  
What if the network connection gets broken only after 49% of the 
transfer and that took 3 hours to download?  You'll attempt a 25% size 
transfer which would take 1.5 hour despite the fact that you already 
spent that much time downloading that first 1/4 of the repository 
already.  And yet what if you're unlucky and now the network craps on 
you after 23% of that second attempt?

I think it is better to "prime" the repository with the content of the 
top commit in the most straight forward manner using git-archive which 
has the potential to be fully restartable at any point with little 
complexity on the server side.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html