Re: git pack/unpack over bittorrent - works!

Kyle Moffett <kyle@xxxxxxxxxxxxxxx> · Sat, 4 Sep 2010 01:23:54 -0400

Ted,

I think your "canonical pack" idea has value, but I'd be inclined to
try to optimize more for the "common case" of developing on a fast
local network with many local checkouts, where you occasionally
push/fetch external sources via a slow link.

Specifically, let's look at the very reasonable scenario of a
developer working over a slow DSL or dialup connection.  He's probably
got many copies of various GIT repositories cloned all over the place
(hey, disk is cheap!), but right now he just wants a fresh clean copy
of somebody else's new tree with whatever its 3 feature branches are.
Furthermore, he's probably even got 80% of the commit objects from
that tree archived in his last clone from linux-next.

In theory he could very carefully arrange his repositories with
judicious use of alternate object directories.  From personal
experience, though, such arrangements are *VERY* prone to accidentally
purging wanted objects; unless you *never* ever delete a branch in the
"reference" repository.

So I think the real problem to solve would be:  Given a collection of
local computers each with many local repositories, what is the best
way to optimize a clone of a "new" remote repository (over a slow
link) by copying most of the data from other local repositories
accessible via a fast link?

The goal would be to design a P2P protocol capable of rapidly and
efficiently building distributed searchable indexes of ordered commits
that identify which peer(s) contain that each commit.

When you attempt to perform a "git fetch --peer" from a repository, it
would quickly connect to a few of the metadata index nodes in the P2P
network and use them to negotiate "have"s with the upstream server.
The client would then sequentially perform the local "fetch"
operations necessary to obtain all the objects it used to minimize the
commit range with the server.  Once all of those "fetch" operations
completed, it could proceed to fetch objects from the server normally.

Some amount of design and benchmarking would need to be done in order
to figure out the most efficient indexing algorithm for finding a
minimal set of "have"s of potentially thousands of refs, many with
independent root commits.  For example if the index was grouped
according to "root commit" (of which there may be more than one), you
*should* be able to quickly ask the server about a small list of root
commits and then only continue asking about commits whose roots are
all known to the server.

The actual P2P software would probably involve 2 different daemon
processes.  The first would communicate with each other and with the
repositories, maintaining the ref and commit indexes.  These daemons
would advertise themselves with Avahi, or alternatively in an
enterprise environment they would be managed by your sysadmins and be
automatically discovered using DNS-SD.  Clients looking to perform a
P2P fetch would first ask these.

The second daemon would be a modified git-daemon that connects to the
advertised "index" daemons and advertises its own refs and commit
lists, as well as its IP address and port.

My apologies if there are any blatant typos or thinkos, it's a bit
later here than I would normally be writing about technical topics.

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html