Re: git pack/unpack over bittorrent - works!

Luke Kenneth Casson Leighton <luke.leighton@xxxxxxxxx> · Fri, 3 Sep 2010 11:54:12 +0100

have a bit more time.

On Fri, Sep 3, 2010 at 1:29 AM, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
>> pack, you'd have to _delete_ them from the receiving end so as to
>> avoid polluting the recipient's object store haaarrgh *spit*, *cough*.
>
> Well, actually there is no need to delete anything.  Git can cope with
> duplicated objects just fine.  A subsequent gc will get rid of the
> duplicates automatically.

 excellent.  good to hear.

>>  what _might_ work however iiiiIiis... to split the pack-object into
>> two parts.  or, to add an "extra part", to be more precise:
>>
>> a) complete list of all objects.  _just_ the list of objects.
>> b) existing pack-object format/structure.
>>
>> in this way, the sender having done all the hard work already of
>> determining what objects are to go into a pack-object, transfers that
>> *first*.  _theeen_ you begin transferring the pack-object.  theeeen,
>> if the pack-object transfer is ever interrupted, you simply send back
>> that list of objects, and ask "uhh, you know that list of objects we
>> were talking about?  well, here it is *splat* - are you able to
>> recreate the pack-object from that, for me, and if so please gimme
>> again"
>
> Well, it isn't that simple.
>
> First, a resumable clone is useful only when there is a big transfer in
> play.  Otherwise it isn't worth the trouble.
>
> So, if the clone is big, then this list of objects can be in the
> millions.  For example my linux kernel repo with a couple branches
> currently has:
>
> $ git rev-list --all --objects | wc -l
> 2808136
>
> So, 2808136 objects, with 20-byte SHA1 for each of them, and you have a
> 54 MB object list to transfer already.

 ok:

 a) that's fine.  first time, you have to do that, you have to do that.

 b) i have some ideas in mind, to say things like "i already have the
following objects up to here, please give me a list of everything
since then".

 > And even then, what if the transfer crashes during that object list
> transfer?  On flaky connections this might happen within 54 MB.

 that's fine: i envisage the object list being cached at the remote
end (by the first seed), and also being a "shared file", such that
there may even be complete copies of that "file" out there already,
such that resumption is a non-issue.

>> and, 10^N-1 times out of 10^N, for reasons that shawn kindly
>> explained, i bet you the answer would be "yes".
>
> For the list of objects, sure.  But that isn't a big deal.  It is easy
> enough to tell the remote about the commits we already have and ask for
> the rest.  With a commit SHA1, the remote can figure out all the objects
> we have. But all is in that determination of the latest commit we have.
> If we get a partial pack, it is possible to somehow salvage as many
> objects from it, and determine what top commit(s) that correspond to.
> It is possible to set your local repo just as if you had requested a
> shallow clone and then the resume would simply be a deepening of that
> shallow clone.

 i'll need to re-read this when i have more time.  apologies.

> Another issue is what to do with objects that are themselves huge.

 that's fine, too: in fact, that's the perfect scenario where a
file-sharing protocol excels.

> Yet another issue: what to do with all those objects I've got in my
> partial pack, but that I can't connect to any commit yet.  We don't want
> them transferred again but it isn't easy to tell the remote about them.
>
> You could tell the remote: "I have this pack for this commit from this
> commit but I got only this amount of bytes from it, please resume
> transfer here."

 ok i have a couple of ideas/thoughts

a) one of which was to send the commit index list back to the remote
end, but that would be baaaaad as it could be 54mb as you say, so it
would be necessary to say "here is the SHA1 of the index file you gave
me earlier, do you still have it, if so please can we resume

b) as long as _somebody_ has a complete copy, distributed throughout
the file-sharing network, of that complete pack, "resume transfer
here" isn't .... the concept is moot.  the bittorrent protocol covers
that concept of "resume" very very easily.

> But as mentioned before the pack stream is not
> deterministic,

 one a one-off basis, it is; and even then, i believe that it could be
made to "not matter".  so you ask a server for a pack object and get a
different SHA-1?  so what, you just make that part of the
file-sharing-network unique key: {ref}-{objref}-{SHA-1} instead of
just {ref}-{objref}.  if the connection's lost, wow big deal, you just
ask again and you end up with a different SHA-1.  you're back to a
situation which is actually no different from and no less efficient
than the present http transfer system.

 ... but i'd rather avoid this scenario, if possible.

> and we really don't want to make it single-threaded on a
> server.  Furthermore this is a lot of work for the server as even if the
> pack stream is deterministic, then the server still has to recreate the
> first part of the pack just to throw it away until the desired offset is
> reached.  And caching pack results also has all sorts of implications
> we've prefered to avoid on a server for security reasons (better keep
> serving operations read-only).

 i've already thrown out the idea of cacheing the pack objects
themselves, but am still exploring the concept of cacheing the .idx
file, even for short periods of time.

 so the server does a lot of work creating that .idx file, but it
contains the complete list of all objects, which you _could_ just
obtain again by just asking explicitly for each and every single one
of those objects, no more, no less, no deltas, no windows - the list,
the whole list and nothing but the list.

>> ... um... in fact... um... i believe i'm merely talking about the .idx
>> index file, aren't i?  because... um... the index file contains the
>> list of object refs in the pack, yes?
>
> In one pack, yes.  You might have multiple packs.  And that doesn't mean
> that all the objects from a pack are all relevant to the actual branches
> you are willing to export.

 yes that's fine.  multiple packs are considered to be independent
files of the "VFS layer" in the file-sharing network.  that's taken
care of.  what i need to know is: can you recreate a pack object given
the list of objects in its .idx file?

>> sooo.... taking a wild guess, here: if you were to parse the .idx file
>> and extract the list of object-refs, and then pass that to "git
>> pack-objects --window=0 --delta=0", would you end up with the exact
>> same pack file, because you'd forced git pack-objects to only return
>> that specific list of object-refs?
>
> If you do this i.e. turn off delta compression, then the 615 MB
> repository above will turn itself into a multi-gigabyte pack!

 ok this was covered in my previous post, hope it's clearer.  perhaps
"git pack-objects --window=0 --delta=0 <
{list-of-objects-extracted-from-the-idx-file}" isn't the way to
achieve what i envisage - if not, does anyone have any ideas on how
extracting the exact list of objects as previously given by a .idx
file can be achieved?

l.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html