On Tue, 24 Feb 2009, Shawn O. Pearce wrote: > Jakub Narebski <jnareb@xxxxxxxxx> wrote: > > > > I (and Nicolas) by 'sorting order' mean here ordering of objects and > > deltas in the pack file, i.e. whether we get _exactly_ the same (byte > > for byte) packfile for the same want/have exchange (your proposal), or > > even for the same arguments to git-pack-objects (which is a necessary, > > although I think not sufficient condition). > > I know. > > My proposal though didn't require the same byte-for-byte pack file. > Only that the objects were in a predictable order. It didn't permit > resuming in the middle of an object. If the last object in the pack > was truncated the client would resume by getting that object again, > and may get a different byte sequence for that object representation. Ah, so you meant skipping first N _objects_, and not first N _bytes_ of a re-generated pack. That's better. Although in the case when packfiles are cached, I think you can support resuming on a byte. But I guess only in such case (where exactly byte-for-byte the same packfile is resend / reused). > > Its a b**ch to know where you stopped though, as you could be in > a long string of deltas whose base is in the portion you didn't > yet receive. Which means you can't identify that string that you > already have, and pack-objects on resume can't assume you have > those objects, because you only have the deltas for them and are > lacking a way to restore them. Moreover from what I understand the want/have exchange is about _commits_, and it assumes that if you 'have' a commit, you have all its ancestors, and all trees (including those of ancestors), and all blobs (including those of ancestors). Not only delta without base. Besides if I remember correctly we always write base before delta; or am I mistaken here? But one could take a look at patches (present in git mailing list archive) which tried to add 'lazy clone' / 'remote alternates' support. IIRC there was 'haveonly' extension to exchange protocol, which was to meant that you have (in full) only given object, but not necessary its prerequisites. Then you can filter out those 'haveonly' objects from list of objects to pack fed to git-pack-object, isn't it? > > > Can we assume that packfiles are named correctly (i.e. name of packfile > > match SHA-1 footer)? > > Wrong. > > The hash in "pack-$hash.pack"/"pack-$hash.idx" is *not* the 20 byte > SHA-1 footer. Its the 20 byte SHA-1 of the sorted object names who > are in that pack. > > We should try not to assume that the pack's file name matches the > sorted object names, but we can assume that the pack file name is > "pack-$hash.pack" where $hash is a 40 character hexadecimal string. > The dumb commit walkers already have this restriction built into > them, and have for quite some time. > > Any pack writers, including fast-import, honor this naming standard > in order to ensure they are compatible with the existing dumb > commit walkers. Ah. So it is a _bit_ harder (for "dumb" protocols) than I thought. Still much easier than resumable clone for smart (pack generating) protocols. > > > Therefore I think that restartable clone for "dumb" (commit walker) > > protocols is easy GSoC project, while restartable clone for "smart" > > (generate packfile) protocols is at least of medium difficulty, and > > might be harder. > > Probably quite right. Unfortunately the majority of the git > repositories out there are served with the smart protocol, because > it is more efficient. :) Long, long time ago rsync:// protocol was recommended for initial clone. It has serious disadvantage of possibly returning silently corrupted repository, as it didn't ensure that references and objects were fetched in correct sequence, and is thus deprecated, and support for it bit-rotten ;) in places... I wonder if it is possible to make rsync:// more robust... [...] > > I'll try to add 'pack file cache for git-daemon' proposal to > > GSoC2009Ideas page... but I cannot be mentor (or even co-mentor) for > > this idea. > > The pack file cache project is likely easier than restarting a > pack file. Especially in the face of the threaded delta code. > > There are difficult details about making the cache secure so we can't > overwrite repository data due to a buffer overflow. Or making > the cache prune itself so it doesn't run out of disk. Etc. > We've talked about a cache before on list. Well, this is _cache_. OTOH having pack cache would make it easy to have resumable clone if you hit one of cached packfiles on resume... On the other hand I wonder what improvements it would give, as generating packs with delta reuse is, I think, quite fast... -- Jakub Narebski Poland -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html