Re: Continue git clone after interruption

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>> Nicolas Pitre <nico@xxxxxxx> writes:

>>> That won't buy you much.  You should realize that a pack is made of:
>>> 
>>> 1) Commit objects.  Yes they're all put together at the front of the pack,
>>>    but they roughly are the equivalent of:
>>> 
>>> 	git log --pretty=raw | gzip | wc -c
>>> 
>>>    For the Linux repo as of now that is around 32 MB.
>> 
>> For my clone of Git repository this gives 3.8 MB
>>  
>>> 2) Tree and blob objects.  Those are the bulk of the content for the top 
>>>    commit. [...]  You can estimate the size of this data with:
>>> 
>>> 	git archive --format=tar HEAD | gzip | wc -c
>>> 
>>>    On the same Linux repo this is currently 75 MB.
>> 
>> On the same Git repository this gives 2.5 MB
> 
> Interesting to see that the commit history is larger than the latest 
> source tree.  Probably that would be the same with the Linux kernel as 
> well if all versions since the beginning with adequate commit logs were 
> included in the repo.

Note that having reflog and/or patch management interface like StGit,
and frequently reworking commits (e.g. using rebase) means more commit
objects in repository.

Also Git repository has 3 independent branches: 'man', 'html' and 'todo',
from whose branches objects are not included in "git archive HEAD".

> 
>>> 3) Delta objects.  Those are making the rest of the pack, plus a couple 
>>>    tree/blob objects that were not found in the top commit and are 
>>>    different enough from any object in that top commit not to be 
>>>    represented as deltas.  Still, the majority of objects for all the 
>>>    remaining commits are delta objects.
>> 
>> You forgot that delta chains are bound by pack.depth limit, which
>> defaults to 50.  You would have then additional full objects.
> 
> Sure, but that's probably not significant.  the delta chain depth is 
> limited, but not the width.  A given base object can have unlimited 
> delta "children", and so on at each depth level.

You can probably get number and size taken by delta and non-delta (base)
objects in the packfile somehow.  Neither "git verify-pack -v <packfile>"
nor contrib/stats/packinfo.pl did help me arrive at this data.

>> The single packfile for this (just gc'ed) Git repository is 37 MB.
>> Much more than 3.8 MB + 2.5 MB = 6.3 MB.
> 
> What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to 
> be occupied by deltas.

True.
 
>> [cut]
>> 
>> There is another way which we can go to implement resumable clone.
>> Let's git first try to clone whole repository (single pack; BTW what
>> happens if this pack is larger than file size limit for given
>> filesystem?).
> 
> We currently fail.  Seems that no one ever had a problem with that so 
> far. We'd have to split the pack stream into multiple packs on the 
> receiving end.  But frankly, if you have a repository large enough to 
> bust your filesystem's file size limit then maybe you should seriously 
> reconsider your choice of development environment.

Do we fail gracefully (with an error message), or does git crash then?

If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
FAT is often used on SSD, on USB drive.  Although if you have  2 GB
packfile, you are doing something wrong, or UGFWIINI (Using Git For
What It Is Not Intended).
 
>> If it fails, client ask first for first half of of
>> repository (half as in bisect, but it is server that has to calculate
>> it).  If it downloads, it will ask server for the rest of repository.
>> If it fails, it would reduce size in half again, and ask about 1/4 of
>> repository in packfile first.
> 
> Problem people with slow links have won't be helped at all with this.  
> What if the network connection gets broken only after 49% of the 
> transfer and that took 3 hours to download?  You'll attempt a 25% size 
> transfer which would take 1.5 hour despite the fact that you already 
> spent that much time downloading that first 1/4 of the repository 
> already.  And yet what if you're unlucky and now the network craps on 
> you after 23% of that second attempt?

A modification then.

First try ordinary clone.  If it fails because network is unreliable,
check how much we did download, and ask server for packfile of slightly
smaller size; this means that we are asking server for approximate pack
size limit, not for bisect-like partitioning revision list.

> I think it is better to "prime" the repository with the content of the 
> top commit in the most straight forward manner using git-archive which 
> has the potential to be fully restartable at any point with little 
> complexity on the server side.

But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?

A question about pack protocol negotiation.  If clients presents some
objects as "have", server can and does assume that client has all 
prerequisites for such objects, e.g. for tree objects that it has
all objects for files and directories inside tree; for commit it means
all ancestors and all objects in snapshot (have top tree, and its 
prerequisites).  Do I understand this correctly?

If we have partial packfile which crashed during downloading, can we
extract from it some full objects (including blobs)?  Can we pass
tree and blob objects as "have" to server, and is it taken into account?
Perhaps instead of separate step of resumable-downloading of top commit
objects (in snapshot), we can pass to server what we did download in
full?


BTW. because of compression it might be more difficult to resume 
archive creation in the middle, I think...

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]