Re: git pack/unpack over bittorrent - works!

Luke Kenneth Casson Leighton <luke.leighton@xxxxxxxxx> · Sat, 4 Sep 2010 15:06:29 +0100

On Sat, Sep 4, 2010 at 6:23 AM, Kyle Moffett <kyle@xxxxxxxxxxxxxxx> wrote:
> So I think the real problem to solve would be:  Given a collection of
> local computers each with many local repositories, what is the best
> way to optimize a clone of a "new" remote repository (over a slow
> link) by copying most of the data from other local repositories
> accessible via a fast link?

 the most immediate algorithm that occurs to me that would be ideal
would be - rsync!  whilst i had the privilege of being able to listen
10 years ago to tridge describe rsync in detail, so am aware that it
is parallelisable (which was the whole point of his thesis), a) i'm
not sure how to _distribute_ it b) i wouldn't have a clue where to
start!

so, i believe that a much simpler algorithm is to follow nicolas' advice, and:

* split up a pack-index file by its fanout (1st byte of SHAs in the idx)
* create SHA1s of the list of object-refs within an individual fanout
* compare the per-fanout SHA1s remote and local
* if same, deduce "oh look, we have that per-fanout list already"
* grab the per-fanout object-ref list using standard p2p filesharing

in this way you'd end up breaking down e.g. 50mb of pack-index (for
e.g. linux-2.6.git) into rouughly 200k chunks, and you'd exchange
rouughly 50k of network traffic to find out that you'd got some of
those fanout object-ref-lists already.  which is nice.

(see Documentation/technical/pack-format.txt, "Pack Idx File" for
description of fanouts, but according to gitdb/pack.py it's just the
1st byte of the SHA1s it points to)

> The goal would be to design a P2P protocol capable of rapidly and
> efficiently building distributed searchable indexes of ordered commits
> that identify which peer(s) contain that each commit.

 yyyyup.

> When you attempt to perform a "git fetch --peer" from a repository, it
> would quickly connect to a few of the metadata index nodes in the P2P
> network and use them to negotiate "have"s with the upstream server.
> The client would then sequentially perform the local "fetch"
> operations necessary to obtain all the objects it used to minimize the
> commit range with the server.  Once all of those "fetch" operations
> completed, it could proceed to fetch objects from the server normally.

 why stop at just fetching objects only from the server?  why not have
the objects distributed as well?  after all, if one peer has just gone
to all the trouble of getting an object, surely it can share it, too?

 or am i misunderstanding what you're describing?

> Some amount of design and benchmarking would need to be done in order
> to figure out the most efficient indexing algorithm for finding a
> minimal set of "have"s of potentially thousands of refs, many with
> independent root commits.  For example if the index was grouped
> according to "root commit" (of which there may be more than one), you
> *should* be able to quickly ask the server about a small list of root
> commits and then only continue asking about commits whose roots are
> all known to the server.

 intuitively i follow what you're saying.

> The actual P2P software would probably involve 2 different daemon
> processes.  The first would communicate with each other and with the
> repositories, maintaining the ref and commit indexes.  These daemons
> would advertise themselves with Avahi,

 NO.

 ok.  more to the point: you want to waste time forcing people to
install a pile of shite called d-bus, just so that people can use git,
go ahead.

can we gloss quickly over the mention of avahi as that *sweet-voice*
most delightful be-all and solve-all solution *normal-voice* and move
on?

> or alternatively in an
> enterprise environment they would be managed by your sysadmins and be
> automatically discovered using DNS-SD.

 and what about on the public hostile internet?  no - i feel that the
basis should be something that's proven already, that's had at least
ten years to show for itself.  deviating from that basis: fine - at
least there's not a massive amount of change required for
re-implementors to rewrite a compatible version (in c, or *shudder*
java)

> The second daemon would be a modified git-daemon that connects to the
> advertised "index" daemons and advertises its own refs and commit
> lists, as well as its IP address and port.

 yes, there are definitely two distinct purposes.  i'm not sure it's
necessary - or a good idea - to split the two out into separate
daemons, for the reason that you may, just like bittorrent can use a
single port to share multiple torrents comprising multiple files, wish
to use a single daemon to serve multiple git repositories.

 if you start from the basis of splitting things out then you have a
bit of a headache on your hands wrt publishing multiple git repos.

l.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html