On Thu, 2 Sep 2010, Luke Kenneth Casson Leighton wrote: > On Thu, Sep 2, 2010 at 9:45 PM, Jakub Narebski <jnareb@xxxxxxxxx> wrote: > > > If I remember the discussion stalled (i.e. no working implementation), > > and one of the latest proposals was to have some way of recovering > > objects from partially downloaded file, and a way to request packfile > > without objects that got already downloaded. > > oo. ouch. i can understand why things stalled, then. you're > effectively adding an extra layer in, and even if you could add a > unique naming scheme on those objects (if one doesn't already exist?), > those object might (or might not!) come up the second time round (for > reasons mentioned already - threads resulting in different deltas > being picked etc.) ... and if they weren't picked for the re-generated > pack, you'd have to _delete_ them from the receiving end so as to > avoid polluting the recipient's object store haaarrgh *spit*, *cough*. Well, actually there is no need to delete anything. Git can cope with duplicated objects just fine. A subsequent gc will get rid of the duplicates automatically. > what _might_ work however iiiiIiis... to split the pack-object into > two parts. or, to add an "extra part", to be more precise: > > a) complete list of all objects. _just_ the list of objects. > b) existing pack-object format/structure. > > in this way, the sender having done all the hard work already of > determining what objects are to go into a pack-object, transfers that > *first*. _theeen_ you begin transferring the pack-object. theeeen, > if the pack-object transfer is ever interrupted, you simply send back > that list of objects, and ask "uhh, you know that list of objects we > were talking about? well, here it is *splat* - are you able to > recreate the pack-object from that, for me, and if so please gimme > again" Well, it isn't that simple. First, a resumable clone is useful only when there is a big transfer in play. Otherwise it isn't worth the trouble. So, if the clone is big, then this list of objects can be in the millions. For example my linux kernel repo with a couple branches currently has: $ git rev-list --all --objects | wc -l 2808136 So, 2808136 objects, with 20-byte SHA1 for each of them, and you have a 54 MB object list to transfer already. This is a significant overhead that we prefer to avoid, given the actual pack transfer which is: $ git pack-objects --all --stdout --progress < /dev/null | wc -c Counting objects: 2808136, done. Compressing objects: 100% (384219/384219), done. 645201934 Total 2808136 (delta 2422420), reused 2788225 (delta 2402700) The output from wc is 645201934 = 615 MB for this repository. Hence the list of object alone is quite significant. And even then, what if the transfer crashes during that object list transfer? On flaky connections this might happen within 54 MB. > and, 10^N-1 times out of 10^N, for reasons that shawn kindly > explained, i bet you the answer would be "yes". For the list of objects, sure. But that isn't a big deal. It is easy enough to tell the remote about the commits we already have and ask for the rest. With a commit SHA1, the remote can figure out all the objects we have. But all is in that determination of the latest commit we have. If we get a partial pack, it is possible to somehow salvage as many objects from it, and determine what top commit(s) that correspond to. It is possible to set your local repo just as if you had requested a shallow clone and then the resume would simply be a deepening of that shallow clone. But usually the very first commit in a pack is huge as it typically isn't delta compressed (a delta chain has to start somewhere). And this first commit will roughly represent the same size as a tarball for that commit. And if you don't get at least that first commit then you are screwed. Or if you don't get a complete second commit when deepening a clone you are still screwed. Another issue is what to do with objects that are themselves huge. Yet another issue: what to do with all those objects I've got in my partial pack, but that I can't connect to any commit yet. We don't want them transferred again but it isn't easy to tell the remote about them. You could tell the remote: "I have this pack for this commit from this commit but I got only this amount of bytes from it, please resume transfer here." But as mentioned before the pack stream is not deterministic, and we really don't want to make it single-threaded on a server. Furthermore this is a lot of work for the server as even if the pack stream is deterministic, then the server still has to recreate the first part of the pack just to throw it away until the desired offset is reached. And caching pack results also has all sorts of implications we've prefered to avoid on a server for security reasons (better keep serving operations read-only). > ... um... in fact... um... i believe i'm merely talking about the .idx > index file, aren't i? because... um... the index file contains the > list of object refs in the pack, yes? In one pack, yes. You might have multiple packs. And that doesn't mean that all the objects from a pack are all relevant to the actual branches you are willing to export. > sooo.... taking a wild guess, here: if you were to parse the .idx file > and extract the list of object-refs, and then pass that to "git > pack-objects --window=0 --delta=0", would you end up with the exact > same pack file, because you'd forced git pack-objects to only return > that specific list of object-refs? If you do this i.e. turn off delta compression, then the 615 MB repository above will turn itself into a multi-gigabyte pack! Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html