On Tue, 18 Aug 2009, Jakub Narebski wrote: > Nicolas Pitre <nico@xxxxxxx> writes: > > > On Tue, 18 Aug 2009, Tomasz Kontusz wrote: > > > > > Ok, so it looks like it's not implementable without some kind of cache > > > server-side, so the server would know what the pack it was sending > > > looked like. > > > But here's my idea: make server send objects in different order (the > > > newest commit + whatever it points to first, then next one,then > > > another...). Then it would be possible to look at what we got, tell > > > server we have nothing, and want [the newest commit that was not > > > complete]. I know the reason why it is sorted the way it is, but I think > > > that the way data is stored after clone is clients problem, so the > > > client should reorganize packs the way it wants. > > > > That won't buy you much. You should realize that a pack is made of: > > > > 1) Commit objects. Yes they're all put together at the front of the pack, > > but they roughly are the equivalent of: > > > > git log --pretty=raw | gzip | wc -c > > > > For the Linux repo as of now that is around 32 MB. > > For my clone of Git repository this gives 3.8 MB > > > 2) Tree and blob objects. Those are the bulk of the content for the top > > commit. The top commit is usually not delta compressed because we > > want fast access to the top commit, and that is used as the base for > > further delta compression for older commits. So the very first > > commit is whole at the front of the pack right after the commit > > objects. you can estimate the size of this data with: > > > > git archive --format=tar HEAD | gzip | wc -c > > > > On the same Linux repo this is currently 75 MB. > > On the same Git repository this gives 2.5 MB Interesting to see that the commit history is larger than the latest source tree. Probably that would be the same with the Linux kernel as well if all versions since the beginning with adequate commit logs were included in the repo. > > 3) Delta objects. Those are making the rest of the pack, plus a couple > > tree/blob objects that were not found in the top commit and are > > different enough from any object in that top commit not to be > > represented as deltas. Still, the majority of objects for all the > > remaining commits are delta objects. > > You forgot that delta chains are bound by pack.depth limit, which > defaults to 50. You would have then additional full objects. Sure, but that's probably not significant. the delta chain depth is limited, but not the width. A given base object can have unlimited delta "children", and so on at each depth level. > The single packfile for this (just gc'ed) Git repository is 37 MB. > Much more than 3.8 MB + 2.5 MB = 6.3 MB. What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to be occupied by deltas. > [cut] > > There is another way which we can go to implement resumable clone. > Let's git first try to clone whole repository (single pack; BTW what > happens if this pack is larger than file size limit for given > filesystem?). We currently fail. Seems that no one ever had a problem with that so far. We'd have to split the pack stream into multiple packs on the receiving end. But frankly, if you have a repository large enough to bust your filesystem's file size limit then maybe you should seriously reconsider your choice of development environment. > If it fails, client ask first for first half of of > repository (half as in bisect, but it is server that has to calculate > it). If it downloads, it will ask server for the rest of repository. > If it fails, it would reduce size in half again, and ask about 1/4 of > repository in packfile first. Problem people with slow links have won't be helped at all with this. What if the network connection gets broken only after 49% of the transfer and that took 3 hours to download? You'll attempt a 25% size transfer which would take 1.5 hour despite the fact that you already spent that much time downloading that first 1/4 of the repository already. And yet what if you're unlucky and now the network craps on you after 23% of that second attempt? I think it is better to "prime" the repository with the content of the top commit in the most straight forward manner using git-archive which has the potential to be fully restartable at any point with little complexity on the server side. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html